Scrape Up, Mash Up, Fuse Up

So I’m in the US, preparing to roll out events. To make decisions as to where to go I needed to get data. I needed numbers on the type of people we’d like to attend our events. In order to generate good data projects we would need a cohort of guests to attend. We would need media folks (including journalism students) and people who can code in Ruby, Python, and/or PHP. We’d also like to get the data analysts, or Data Scientists, as they are known as, particularly those who use R.

So with assistance from Ross Jones and Friedrich Lindenberg, I scraped the locations (websites and telephone numbers if they were available) of all the media outlets in the US (changing the structure to suit its intended purpose i.e. why there are various tables), data conferences, Ruby, PHP, Python and R meetups, B2B publishers (Informa and McGraw-Hill) and top 10 journalism schools. I added small sets in by hand such as HacksHackers chapters. All in all, nearly 12,800 data points. I had never used Fusion Tables before but I have heard good things. So I mashed up the data and imported it into Fusion Tables. So here is it (clicki on the image as sadly wordpress.com does not support iframes):

Click to explore on Fusion Tables

Sadly there is a lot of overlap so not all the points are visible. Google Earth explodes the points on the same spot however it couldn’t handle this much data when I exported it. Once we decide where best to go I can hone in on exact addresses. I wanted to use it to pinpoint concentrations, so a heat map of the points was mostly what I was looking for.

Click to explore on Fusion Tables

Using Fusion Tables I have then break down the data for the hot spots. I’ve looked at the category proportions and using the filter and aggregate, made pie charts (see New York City for example). The downside I found with Fusion Tables is that the colour schemes cannot be adjusted (I had to fix them up using Gimp) and the filters are AND statement (no OR statement option). The downside with US location data is the similarity of place names across states (also having a place and state name the same), so I had to eye up the data. So here is the breakdown for each region where the size of the pie chart corresponds to the number of data points for that location. It is relative to region not across.

Of course media outlets would outnumber coding meetups, universities and HacksHackers Chapters, but they would be a better measure of population size and city economy.

What I’ve learnt from this is:

  1. Free tools are simple to use if you play around with them
  2. They can be limiting for visual mashups
  3. The potential of your outcome is proportional to your data size, not your tool functionality (you can always use multiple tool)
  4. To work with different sources of data you need to think about your database structure and your outcome beforehand
  5. Manipulate your data in the database not your tool, always keep the integrity of the source data
  6. To have data feed into your outcome changes your efforts from event reporter to source
This all took me about a week between doing other ScraperWiki stuff and speaking at HacksHackers NYC. If I were better at coding I imagine this could be done in a day no problem.

2 thoughts on “Scrape Up, Mash Up, Fuse Up

  1. Hi,
    i got a question about plotting the pie charts on the google maps. How is this done?
    With google FusionTables? Or other scripts?
    Thx.

    • Hi. Fusion tables do not allow this type of overlay so I made the pie charts separate to the map and put them on myself.

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*