Data Scraping and Data Cleaning

People keep telling me that collecting data is no fun, that coding and cleaning data is a massive headache. I would agree that it takes a lot of work, and there is a lot of problem-solving required. However, I have found this data collection and coding process quite enjoyable (with some much appreciated direction from my data visualization professor getting started with the data scraper). Here is a Google Sheets document with four sheets. We are using Google Sheets for the scraper since Excel/Sheets is a tool that I am relatively familiar with. The sheets titled “raw” and “processed” are the original data and output from the initial iteration of the coding tool.  The sheets title “raw2” and “processed2” are the versions that I have recently updated.

For this update, I reorganized the output of the scraper, omitted the names output, and added a section to output the links to the articles. I had to use slightly different protocols for the “departments” section of the list and for all the other sections of the list. I also spent a while cleaning the data. I also added a section for academic discipline. This was easiest to do by hand, as there were less than 40 categories, and they were already organized by discipline. It was as simple as copy-pasting and dragging down to the next break in the list (I inserted breaks between the sections of the list). I also removed the result/finding output (the all caps red lettering on the Not A Fluke page). I it seems like it may be very difficult to code chop up and code that information without looking at each case by hand. I’ll see how tedious it is to do by hand, and then I’ll revisit the idea of using a tool to do it automatically.

The data cleaning process for this project is quite interesting to me. Traditionally, I have learned of data cleaning as something that happens once, and after data collection. With the kind of data that this project is using, data cleaning is more of an iterative process: run the data through the scraper, tweak the scraper, run the data again, clean the data, etc. This is all prior to the ‘usual’ data cleaning that would happen once the data is organized in a statistical analysis program (dealing with missing cases, collapsing values, etc.).

As can be seen, there are quite a few errors in the output starting around the “Biology” section of the list. This is as far as I got with one facet of the data cleaning. One of the parts of the scraper find the first period and uses it as a reference point. Fortunately, the formatting of the list is very consistent. Unfortunately, some people’s names have a first or middle initial. This throws of the scraper and it comes back with errors. I was able to reference where errors come up and go remove the periods following initialed names. I have not yet decided if I want to go through the rest of the cases by hand, or find a way to make the scraper selectively ignore periods when they are adjoined to a single letter. The other primary data cleaning that I did fell into two categories: combining rows when one entry had been split between two rows (and the reverse), and fixing punctuation for consistency throughout the list.

I believe the next step (after finishing cleaning the data) will be to incorporate the Carnegie Classification data. A bit further down the road will be the by-hand portion of the data collection, unless I figure out a way to scrape the text from the news articles.

 

Note: here is the coding tool for reference.

One thought on “Data Scraping and Data Cleaning”

  1. Nice. Looks like some solid progress. I find the puzzle aspect of cleaning this sort of thing pretty fun. Data work like this also makes me think hard about how I have people enter data. My focus is on keeping elements separate and guided so that I can chop things up without all this hand work.

    The next level for doing things like ignoring periods in people’s middle initials will likely require regex. There are many tutorials out there but that’s a decent site to learn some basics. It’s an amazingly powerful skill set to master and applies across a variety of programming languages and works in Excel and Google Sheets as well.

    For instance, \s[A-Z][.] would return single letters that are capital proceeded by a space and followed by a period. While something like [a-z][.]\s would return any lowercase letter followed by a period and a space.

Leave a Reply

Your email address will not be published. Required fields are marked *