Category Archives: Uncategorized

Thesis, Data, Coding, and a Data Scraper

As I mentioned in a previous post, the working title for my thesis is, “Campus Sexual Misconduct: Exploring Faculty as Perpetrators and Institutional Power.” The data I will be using for my thesis comes from NOT A FLUKE, a project that keeps a running record of news articles of cases of faculty sexual assault and harassment. For my thesis, I will be coding some of or all of these news articles (Current N = 613). I will be coding for a variety of variables including type of institution, [power] relationship between the faculty member and the victim, type of misconduct, disciplinary action and outcome, and legal action and outcome.

I have developed a draft of my coding tool here. This version of the coding tool has labels where there will eventually be numerical values. I find this way easier in Google Sheets/Excel because either there is no option for using both labels and values (like in SPSS, SAS, etc.), or I cannot find such an option. Two of the variables don’t have explicitly defined levels yet. These variables are ‘type of institution’ and ‘faculty academic discipline.’ The latter is rather easy to categorize (and already labelled on the NOT A FLUKE page), but to collapse it into manageable chunks for statistical analysis requires some theoretical or technical precedent for sorting which fields into which categories. Type of institution is slightly more complicated, but I believe I may end up using this resource. I will have these variables nailed down after a meeting with our resident sociology of education expert Dr. Tressie Cottom hopefully this week.

I have also been working on a codebook to accompany the coding tool, which will make life easier when working with the purely numerical version of the coding tool. I will provide a link to the codebook in a blog post on some testing of the coding tool some time in the next few days. You can infer most of the information that would be contained in it from the coding tool

The main challenge with this data source is that it is a bit disorganized, at least technically speaking. It would make life a lot easier if it was somehow in a spreadsheet, so that’s what I will try and do. If possible, I will work with my instructor for my data visualization course to construct a data scraper to automatically extract at least some of the data and/or organize it in a spreadsheet. I think that the most important function of the data scraper would be simply pulling the links to the articles out of the NOT A FLUKE web page, obtaining the name of the university in question* for each article, and dumping these either into the coding tool or into a separate spreadsheet to then be imported into the coding tool. The second most valuable function would be to obtain other variable information from the NOT A FLUKE web page, such as ‘faculty university position’ or ‘institution outcome.’ The third most valuable (and likely most complicated) function of the coding tool would be to obtain variable information from the articles themselves.

I expect that I will need to do much of the coding by hand, but I think it will be a useful, valuable exercise to practice using a data scraper and to manage messy data generally.


*This will be particularly helpful for merging in institutional information if I use the carnegie classification resource.