The foundation of any data project lies in cleaning, sorting and connecting the data through all the steps needed for analysis and presentation. This is the lesson I am learning at the moment looking at a dataset of the order of 10,000-50,000 rows. There are no clear steps involved. There is no methodology. There is no “10 Steps to Understanding Your Data” guide.
What I have are pipes and a work flow system. These need to be constructed for each data project as they are dependent on the type, structure and cleanliness of the original data. Each node could be a parser, a refining tool or a database query. With each iteration of the data exploring process the data needs to be piped through various parts of this circuit. I talked about this in my previous post and will draw up my network flow and post my scripts to GitHub with every project I complete. A neat trick I’ve just learnt is to use command line scripts to pipe your data though individual parsing steps using standard in and out so as to avoid clunky paths.
Being able to see this circuit and having control over the flows make recursive journalism possible and the Miso release, Dataset, is built with this sentiment in mind. Alex Graul, Guardian developer behind the project, explains:
I’ve seen this working in action. I’ve updated a Google Spreadsheet and the resultant chart updated automatically. I am looking forward to working with the library as a journalist and a developer to really glean the benefits of Open Source. Building a properly documented and described piece of software is a massive job which needs to be done meticulously for the Open Source community to take it on. Miso is for anyone who wishes to call themselves a ‘developer’ or a ‘journalist’. It is organisation neutral and part of a growing trend of opening up newsroom tools.