I recently was asked to speak at the inaugural meetup of HacksHackers Canterbury. I wasn’t sure what proportion of developers and journalists would turn up or what they would like to hear. So I decided to give my top ten tips for working on a data driven project and managing the relations between what the journalists need and what the developers need. These observations are based on my time at The Guardian but also rays of enlightenment from experience at the BBC and CNN as well as struggling with my own data digging projects at ScraperWiki. I introduced the topic by pointing out that developers love cats so put cats everywhere.
For a team of developers working on a project; file, variable and method names have consistent formats and should be labelled logically. This is a practice they all know of even if they do not adhere to it strictly. Sharing, reusing and publishing code require certain standards and best practices. Yet, traditionally, journalism was not a team event and journalists were allowed to organise themselves any which way they liked. Notes were written and filed (hopefully). Now, all journalists have email, twitter, Google docs, spreadsheets, Evernote. A digital cacophony spread across platforms each with their own filing system. Dealing with an array of files filled with data at various stages of aggregation and cleaning, along with documents on what was done is a nightmare. Not only is it a pain that slows the process but it is a danger to the integrity of the story. The lack of clear organisation increases the probability of introducing human error in constructing the data. So no files called “temp part 3.4 final revised” should ever exist. For command line use, never put spaces in a name (it’s a pain). Choose a logical convention and stick with it. thisIsCamelCase and this_is_snake_case. Use one or if you have different structures, use both to make this apparent, for example use camel case for file names and match them to the tables, and snake case for the column headings in the table.
Understand The Structure:
A really easy tip for journalists is to open files in a text editor. These are free to download, any text editor will do. Often, what you see in your application (Excel, OpenOffice, Google docs, etc) is not what is actually seen by a programme. Even if the journalist can’t do anything about it the person dealing with the data needs to be aware so that they can communicate proper time frames to the journalist. Often, getting the data in the right structure is just as time intensive as building an interactive to display it. But there are simple things the journalist can do. When working with CSVs (spreadsheets) make sure there are no carriage returns in the cells. Identify the lowest common denominator in data dependencies and record those rather than the resultant data point, for example, if you have a quantity and a price per quantity that give you a total price, log the quantity and price per quantity as the total price can be calculated from those two. Always add an id column to your dataset so that if the data needs to be joined artefacts are more easily found.
Do The Journalism First:
This may seem very obvious but working with data requires another layer of questions before you pass it on. Besides looking to see if there are interesting stories in the data you need to ask yourself: Is this the full data? What calculations can I do to check? Are there columns I can join to add value to the data set? Do I know what all the columns mean and how they were generated? Has there been a change in the way a column has been measured/gathered? What results can be caused by artefacts? In that sense, don’t just hand over a download of the data to the data architect, give a link to where you found it.
Get Comfortable With Your Tools:
This applies to the budding data journalists. Ideally, you will have to use an array of tools. Know how they work and what best they work for. Know how they can be combine to reduce the time frame or the probability of introducing errors. Be comfortable with them but don’t feel you have to master them all. If you want to dedicate your time to mastery, pick the coding language as software tools always change or disappear entirely. This is the really difficulty with working with data. Each project will require an intensive use of only a subset of the tools you have in your tool box. Because projects can take 6 to 8 weeks, you learn to use those tools better. But it may be a good while before you use any particular one again and by then you’ve lost the mastery and have to jog your brain a bit. Developers do it all the time. So seek their help, comfort and advice.
Always Be Learning:
All developers have to keep learning to keep up with the web and journalism is going the same way. I really enjoy this part but I realise that I am in a unique situation. I would strongly urge editors and managers to give the journalists time to pick up new skills. If not on their own then with training courses or better yet, peer-to-peer programming and tutoring. Regardless, there are a lot of free online resources which are great for beginners including Codecademy, LearnCodeTheHardWay and a range of courses from Coursera and Udacity. The main lesson is never expect to know everything or have to do everything. The best practice is to build atop someone else’s work. Just adhere to best practices in terms of openness and accreditation.
The Command Line Is Your Friend:
One thing that was and still is quite hard for me (but really should have been the starting point) is using the command line. It is hugely beneficial to everyone to run the computer through the terminal. It will also impress the developers. For manipulating files it is much faster than any piece of software. What I found enlightening, this was passed to me by a developer at The Guardian, is to use the command line to pipe CSV files though Python parsing scripts rather than writing a script that looks at one file and prints to another. This means I can write individual scripts to change one things, keep those, file them and reuse them without having to write one long script.
Work Open Source:
If you’re going to take away one tip this would be it. This is the most useful. What I mean by work open source is work as if you’re in a team and are going to publish your workings and even as if you are going to make a tutorial. This means you comment your code, structure your code and your files and put it in a place where you can find it when you forgot what you did. It may seem needless and tedious at first but it pays off in spades in the long term. It can also attract attention and connect you to the type of people you need to know so share and share alike. You may sometimes fell like the cat here, horrified. But if people point out that you did something in a less than pretty way it means you will know the more elegant way to do things and are less likely to make the faux pas again. There is nothing to be gained from keeping your learning and your work closed.
Do Fun Things:
If you are thinking of learning to code to augment your journalism then go to hack days, conferences and socialize in the developer community. They are very good fun. Find a HacksHackers and get talking to people. Newsroom developers are lucky in that even news organisations have adopted the hackday ethos of creativity. Journalists have never had the opportunity to take a day or even a week (as is the case in The Guardian) to work on any idea they have. This is how investigative journalism should be brought back into the newsroom, by having days where every journalist can pitch an idea and get a team together to see if a feasible investigation can be made. Sadly, that is not the case. But developers have this amazing ‘right’ to hackdays. They are all about creativity, meeting new people and geting news perspectives (plus finding out what’s going on in different development circles). So take advantage of these when you can.
Catch Bad Guys:
Developers like creating things. They like making fun things or functional things. What they don’t like is having to tear things apart, spend days digging and constructing data. They don’t like having to call people up about it. Many developers can and will do all these things but the typical development community is centred around building software, libraries and websites. Data driven investigations is where the data journalism niche lies and this, to me, is the most exciting area of journalistic potential. This is also where the developers are most appreciative of your skills.
Think About Others:
At the talk I said I haven’t met a rude or egotistical programmer since I started which Martin Belam pointed out is owing to my mileage. I have met egotistical and rude programmers who are also programmers but what I meant by this (which may also be naive) is that I haven’t met a programmer who thinks he/she can make/do/build everything themselves (unlike some journalists I’ve met). They are always looking on stackoverflow and Github, always use libraries and ask others what the proper syntax for a language they haven’t used in ages is. When it comes to design and creativity, you get egos but just learning the nuts and bolts of code, I have found everyone quite humbling. That being said, I feel it very important that the journalist in the equation does not get an ego. I feel the roles as they stand now have the journalist throwing data or ideas to developers, telling them what they want (which is usually something they’re seen from a competitor’s site) and expecting the results in a timely fashion.
For the future of quality journalism through a connective, social, immediate medium, each person in the work flow stream has to understand and know a fair bit about everyone’s role.