Thesis, Data, Coding, and a Data Scraper

As I mentioned in a previous post, the working title for my thesis is, “Campus Sexual Misconduct: Exploring Faculty as Perpetrators and Institutional Power.” The data I will be using for my thesis comes from NOT A FLUKE, a project that keeps a running record of news articles of cases of faculty sexual assault and harassment. For my thesis, I will be coding some of or all of these news articles (Current N = 613). I will be coding for a variety of variables including type of institution, [power] relationship between the faculty member and the victim, type of misconduct, disciplinary action and outcome, and legal action and outcome.

I have developed a draft of my coding tool here. This version of the coding tool has labels where there will eventually be numerical values. I find this way easier in Google Sheets/Excel because either there is no option for using both labels and values (like in SPSS, SAS, etc.), or I cannot find such an option. Two of the variables don’t have explicitly defined levels yet. These variables are ‘type of institution’ and ‘faculty academic discipline.’ The latter is rather easy to categorize (and already labelled on the NOT A FLUKE page), but to collapse it into manageable chunks for statistical analysis requires some theoretical or technical precedent for sorting which fields into which categories. Type of institution is slightly more complicated, but I believe I may end up using this resource. I will have these variables nailed down after a meeting with our resident sociology of education expert Dr. Tressie Cottom hopefully this week.

I have also been working on a codebook to accompany the coding tool, which will make life easier when working with the purely numerical version of the coding tool. I will provide a link to the codebook in a blog post on some testing of the coding tool some time in the next few days. You can infer most of the information that would be contained in it from the coding tool

The main challenge with this data source is that it is a bit disorganized, at least technically speaking. It would make life a lot easier if it was somehow in a spreadsheet, so that’s what I will try and do. If possible, I will work with my instructor for my data visualization course to construct a data scraper to automatically extract at least some of the data and/or organize it in a spreadsheet. I think that the most important function of the data scraper would be simply pulling the links to the articles out of the NOT A FLUKE web page, obtaining the name of the university in question* for each article, and dumping these either into the coding tool or into a separate spreadsheet to then be imported into the coding tool. The second most valuable function would be to obtain other variable information from the NOT A FLUKE web page, such as ‘faculty university position’ or ‘institution outcome.’ The third most valuable (and likely most complicated) function of the coding tool would be to obtain variable information from the articles themselves.

I expect that I will need to do much of the coding by hand, but I think it will be a useful, valuable exercise to practice using a data scraper and to manage messy data generally.


*This will be particularly helpful for merging in institutional information if I use the carnegie classification resource.

Digital Portfolios

One of my goals this semester is to develop my own digital portfolio. I’ve been a bit hesitant about working on this in the past because the development of such has always occurred in very forced, inorganic classroom settings. This semester, in a fully-online course dedicated to data visualization and the ability to communicate and speak data visually (and digitally), the task is much more organic. I’m concerned with the organic, genuine aspect of it because it’s my personal portfolio. I want such an expression of myself to be authentic.

So it has to be ‘authentic.’ But what else? ‘Professional’, ‘aesthetic’, and ‘interesting’ come to mind. I think these four aspects encapsulate the essence of what I currently want out of a digital portfolio. Throughout the semester, I will try to return to these concepts to keep my goals in mind, explicate what exactly they mean and how to put them into practice, and critically reflect on the concepts in the first place and consider changes and additions to them.

I already explained what authenticity means to me, but I think it would be useful to describe in more detail (if still quite broadly) what these other three concepts (professional, aesthetic, and interesting) mean to me. For my portfolio to be professional, it needs to be organized, accessible, inviting, and topical to my expertise. For it to be aesthetic, it needs to be sleek, pleasing to the eye, and easy to navigate. For it to be interesting, it needs to be focused, incorporate multimedia elements, and provide different amounts of information depending on the reader’s needs. I think that these four categories bleed into each other a bit and they certainly rely upon each other to best express themselves.


Portfolios out there:


With these goals in mind, I would like to turn to some examples of preexisting digital portfolios out in the big wide internet. First to discuss is Dr. Tressie McMillan Cottom. I am currently taking my second class with her. Being a scholar of digital sociology (not to short change her fantastic work in the sociology of education), I think it’s no surprise that Dr. Cottom’s presence on the internet is impressive. Part of that presence is her website. Although the content of the website is impressive out of context, the way that the content is presented is what I’m interested in studying for the development of my own digital portfolio.

I think that the website does a good job of towing the line between minimalism and visual interest. Color scheme, font choice, use of imagery, and menu layout play into this.  The aesthetic quality of a website also signals dedication to quality of presentation. The implication is that either the person to whom the portfolio belongs dedicated a large amount of work to creating and/or designing the site themselves, or that they deemed it valuable enough to hire someone to do for them (let’s leave discussions of the implicit class and capital assumptions in that last comment about choice for another time).

Another thing that I find immediately apparent about Dr. Cottom’s website is how the content varies by how deep one ventures into it and by where one chooses to look. If I went to her website seeking professional information about here, there is a link to her CV right at the top. Ten pages of publications, affiliations, and experience right there in bland, professional text. If I went to her website wanting to know more about who she is, I would immediately notice that the menu bar at the top is not the place for me (however, it does tell me that she is a professional, hard-working person just from the tab labels). I might then scroll down, and as I scroll, I see a professional headshot and a blurb about her. Further down are testimonials about her books. Then come blog posts of hers and even a few memes (specifically ones that signal not only internet proficiency but involvement in certain racial spaces on the internet) before getting to the end of the page with contact information and a site map. I think that this demonstrates an interesting idea of expressing more humanity and individuality as one reads through the front page of Dr. Cottom’s website, but also providing targeted information depending on what the reader is looking for.


I stumbled across another portfolio I thought was interesting here. I found Dr. Healy’s website through a site that a colleague of mine shared with me while discussing the data visualization course that I am taking. The site is a draft of a book that Dr. Healy is writing on visualizing social data using R. I thought it was an interesting site to look through and wanted to find more information about the author. Lo and behold he has a digital portfolio, also in website format like Dr. Cottom’s. Similar to Dr. Cottom, Dr. Healy’s website leaves his headshot later on the front page. The layout is sleek, but maybe a little too flat and lacking in style, pictures, or other media for my taste. I think that it does do a good job of putting fourth a specific presentation of self, and of being easy to navigate. I found it a bit odd that his CV was available under ‘publications’ instead of in an ‘about’ section or something similar.

I did, however, like the use of a flowchart/concept map under the resources tab. The visually displaying knowledge and engagement in research interests is something that I want to reflect in my portfolio.

One minor detail that I noticed when taking these screenshots is that tabs do not stay highlighted once you navigate to that page. I think that these sorts of small details don’t mean much for a website that isn’t likely to be heavily trafficked, but I think it’s the sort of detail that can make interactive data, whether qualitative or quantitative, harder to engage with and understand. Something to keep in mind for later.


Briefly, I thought it would be amusing to talk about a poor example for website design, and maybe more specifically, an example about how what makes a website design ‘good’ changes over time. I don’t have an example of this in digital portfolio format, but I feel like website design principles still apply. Here is the famous/infamous official Warner Bros Space Jam website.  It is famous for the fact that Warner Bros have done virtually nothing to update the website in the 22 years since the movie’s release. This website illustrates the other side of font choice, color scheme, ease of navigation, etc. Have a look around. I think that it’s worth a minute or two of amusement at least. I think one of the most important takeaways is that this website was supposedly* once an official, reputable website for a feature film, albeit a comedy.



So what about these websites do I want to incorporate into my own digital portfolio? And more generally, what do I want my digital portfolio to look like?

  • Sleek, aesthetic
    • No more than two fonts aside from my name
    • Reserved, somewhat muted color scheme
  • Engaging, interesting
    • Use of pictures and multimedia
    • Focued on charts, graphs, other visuals that represent my research interests and abilities
    • Graphics and blog posts that are generally engaging, topical.
  • Professional, informative
    • Contains CV, contact info
    • Is oriented towards an educated, research- and data-minded audience, but not excessively technical or jargon-heavy.
  • Authentic
    • Reflects my research interests and attitudes
    • Shows personality and individuality

Stay tuned for a separate post containing a wireframe of my future digital portfolio’s front page coming later today.




*I don’t know for sure. I was 4 at the time.