There’s More Than One Way To Wrangle Data – Recasting with Google, R and Python

Just as facts can be woven in many different ways to produce many different narratives, so data visualisation lends itself to story telling and not just data wrangling. The extra dimension involved in the thing we currently call ‘Data Journalism’, is the fact that there are many different tools with which we weave data into the narrative.

So let’s go through three (and free) ways to recast data. The simplest way is more likely to be what you take away, but to make data work for you rather than working for your data, you will need to go above and beyond proprietary software. It’s a matter of scale, control and transparency.

Most people who use Excel or Access would recast data in the form of a Pivot Table. In that sense, by ‘recast’ I mean bucketing your data using the various columns and data types available. For these examples I’ll be using data from The Guardian which was recently used to make an interactive on aid donations to Somalia. Go here for the data.

Here’s how to get the total donations be day using just Google Docs:

 

If you’re looking to use data analysis as a part of your journalistic career, you’re going to have to handle large and dirty datasets. The data scientist tool of choice is R. It’s free and open source. It is maintained by an academic community and so is very powerful for statistical analysis and visualizations. It has a learning curve but there are many resources online to help you out. Here’s a little taster with some code to handle the column categories in the data. Here is the code for you do download.

 

Now, for the very brave, let’s do it all like it should be done: in code! Journalistically, you want your work to go a long way, especially if you’ve sweated over a keyboard for it. In that sense you’d want to fetch your data, recast it and then pipe it through to a visual in one foul swoop. You’d also want to be able to follow your data through every step of manipulation to check for quality. Most importantly, you’ll want your audience to see what it is you are doing and recast the data for their own needs if they so choose. So here’s how to recast a very different dataset using Python. Download the code here.

One thought on “There’s More Than One Way To Wrangle Data – Recasting with Google, R and Python

  1. Evidence of the power of working in the open: Raynor Vliegendhart, an actual developer, has made my code “more Pythonic” i.e. better. Changes are:

    # Refactoring part 1 (line 46):
    k not in d.keys() == k not in d

    # Refactoring part 2 (lines 55-61):
    sortedKeys = d.keys()
    sortedKeys.sort()
    # can be replaced by:
    sortedKeys = sorted(d)

    He made a pull request on GitHub and I merged it. https://github.com/DataMinerUK/Blog/blob/master/UniqueDonorParser.py

    So my work is being made better for me. I’m learning to be a better Pythonist and my work is being made more efficient for me!

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*