Sadly, there is no GitHub repo for this particular data dig. Just under half of the data we looked at are in PDFs. However, there is some code on ScraperWiki.
The code I have is not great, I’ll be the first to admit that. I really just wanted it to do the job. I’m slowly beginning to do things ‘developer-style’. My first step is to actually ensure the structure of the code is robust. So I split each scraper (the bulk Purchase Award Contracts and two types of packaged Purchase Award Contracts had to be scraped separately) into 3 – 4 steps.
For instance, looking at the packaged contracts, I :
- Fetched the link to each text file for all the years
- For each text file I fetch all the content into one long string (these two steps could easily be combined, but if the content changes/gets wiped I will always have the original content in a separate file. Keep all original records, it’s just the journalist in me)
- The content of each text file was then parsed for the relevant information. Separating these two steps means ScraperWiki run time is kept to a minimum when making calls to a website. This time is limited so you would want to separate your parsing from your scraping.
- Port allocation reports were structured differently to the Purchase Awards so they were parsed differently from the script in step 2.
I downloaded these as CSVs and put them into a SQL database. For the fiscal year we were interested in, I pulled the Excel files of the Cargo Availability Reports together by copy and paste. I put this into the database and joined Purchase Award Contract to Cargo Availability Report on contract number, port and quantity. The other fields like company and commodity were free text fields so those might not match even if they were supposed to be the same thing. I wanted to avoid duplicating entries on the join so these fields looked best to me.
The Excel files were not consistent in their heading structure so pulling them together programmatically would probably involve more effort. Given time and better programming skills I would have liked to have done this and written a combinatorics programme to match the contracts to the cargo reports when the quantities were split over several shipments.
The biggest complication in getting the data together was finding out where everything was and how it was all related. Another obstacle that highlights how difficult it was to piece together the data, is the fact that the reference numbers for 4 different documents have the same number structure. For instance, the packaged PDFs on the dreadful web portal have a “Bid Invitation” numbers that look exactly like the “Purchase Requisition” number in the Cargo Availability Reports but in fact refers to the same structured number (2000000XXX) in the filename!
There is public data and data made for public use. These are not the same thing.