Category Archives: Data Viz

Recreational Data Visualization

As is probably painfully obvious to anyone who has met me, have a lot of “Rubik’s” cubes (hardly any of my ‘speedcubes’ and twisty puzzles are actually Rubik’s brand). In my spare time, I practice solving a basic 3x3x3 speedcube like this one.  My general goal is to get faster. My mean solve time these days is around 26 or 27 seconds. Given that the current world record is 4.59s and the world record average* is 5.80s, I still have plenty of room for improvement.

In an attempt to track my progress, I have been keeping a spreadsheet of all of my solves and what date the solve occurred on.  I have some very basic charts in the spreadsheet tracking overall distribution and a time trend. I just thought I’d mention this here because, although not social science in any way, shape, or form, I may still play around with this data as I am learning data visualization tools. I’m always happy for feedback on my personal projects, so if there are any interesting ideas for analyses or visualizations pertaining to this data, I’d love to hear them.

 

*In speedcubing competitions, a competitor’s “average” is a trimmed mean, averaging the middle 3 of 5 solves by ignoring the fastest and slowest solve.

Visualizations and Data Scraping Update

My data is coming along well, albeit a bit slowly. I have finished what I can using excel to organize and parse out the data from the NOT A FLUKE site itself. The finished output (that I have yet to input into my coding tool) can be found here in the left half of the ‘processed 2’ sheet. The one remaining step is to incorporate the Carnegie Classification Listings so that I have geographic data as well as private/public, profit/nonprofit, and general school classification. It is unlikely that I will be able to incorporate variables that require hand-coding (specifics of misconduct, faculty relationship to victim, victim characteristics) this semester.

Now, what do I want to do with the data I have? How can it be visualized? Here are some research questions that I would like to explore through data visualizations:

 

How does faculty outcome vary geographically?

How does faculty outcome vary by faculty position in the university (president, dean, department chair, professor, etc)?

How does faculty position vary by public/private or by profit/nonprofit (or both)? – This could be getting at overall transparency and accountability of different kinds of institutions.

How does faculty outcome by year? Is there a change after the Obama era Title IX reform?

 

For faculty outcome, it would be easiest to code it into a binary variable “retention” denoting whether they stayed with the university or not. For my full thesis, I intend to do a somewhat more complicated analysis of faculty outcomes.

 

These are some of my thoughts at the moment. I will have an update on the data and tests with visualizing it later this week.

Data Scraping and Data Cleaning

People keep telling me that collecting data is no fun, that coding and cleaning data is a massive headache. I would agree that it takes a lot of work, and there is a lot of problem-solving required. However, I have found this data collection and coding process quite enjoyable (with some much appreciated direction from my data visualization professor getting started with the data scraper). Here is a Google Sheets document with four sheets. We are using Google Sheets for the scraper since Excel/Sheets is a tool that I am relatively familiar with. The sheets titled “raw” and “processed” are the original data and output from the initial iteration of the coding tool.  The sheets title “raw2” and “processed2” are the versions that I have recently updated.

For this update, I reorganized the output of the scraper, omitted the names output, and added a section to output the links to the articles. I had to use slightly different protocols for the “departments” section of the list and for all the other sections of the list. I also spent a while cleaning the data. I also added a section for academic discipline. This was easiest to do by hand, as there were less than 40 categories, and they were already organized by discipline. It was as simple as copy-pasting and dragging down to the next break in the list (I inserted breaks between the sections of the list). I also removed the result/finding output (the all caps red lettering on the Not A Fluke page). I it seems like it may be very difficult to code chop up and code that information without looking at each case by hand. I’ll see how tedious it is to do by hand, and then I’ll revisit the idea of using a tool to do it automatically.

The data cleaning process for this project is quite interesting to me. Traditionally, I have learned of data cleaning as something that happens once, and after data collection. With the kind of data that this project is using, data cleaning is more of an iterative process: run the data through the scraper, tweak the scraper, run the data again, clean the data, etc. This is all prior to the ‘usual’ data cleaning that would happen once the data is organized in a statistical analysis program (dealing with missing cases, collapsing values, etc.).

As can be seen, there are quite a few errors in the output starting around the “Biology” section of the list. This is as far as I got with one facet of the data cleaning. One of the parts of the scraper find the first period and uses it as a reference point. Fortunately, the formatting of the list is very consistent. Unfortunately, some people’s names have a first or middle initial. This throws of the scraper and it comes back with errors. I was able to reference where errors come up and go remove the periods following initialed names. I have not yet decided if I want to go through the rest of the cases by hand, or find a way to make the scraper selectively ignore periods when they are adjoined to a single letter. The other primary data cleaning that I did fell into two categories: combining rows when one entry had been split between two rows (and the reverse), and fixing punctuation for consistency throughout the list.

I believe the next step (after finishing cleaning the data) will be to incorporate the Carnegie Classification data. A bit further down the road will be the by-hand portion of the data collection, unless I figure out a way to scrape the text from the news articles.

 

Note: here is the coding tool for reference.

Thesis, Data, Coding, and a Data Scraper

As I mentioned in a previous post, the working title for my thesis is, “Campus Sexual Misconduct: Exploring Faculty as Perpetrators and Institutional Power.” The data I will be using for my thesis comes from NOT A FLUKE, a project that keeps a running record of news articles of cases of faculty sexual assault and harassment. For my thesis, I will be coding some of or all of these news articles (Current N = 613). I will be coding for a variety of variables including type of institution, [power] relationship between the faculty member and the victim, type of misconduct, disciplinary action and outcome, and legal action and outcome.

I have developed a draft of my coding tool here. This version of the coding tool has labels where there will eventually be numerical values. I find this way easier in Google Sheets/Excel because either there is no option for using both labels and values (like in SPSS, SAS, etc.), or I cannot find such an option. Two of the variables don’t have explicitly defined levels yet. These variables are ‘type of institution’ and ‘faculty academic discipline.’ The latter is rather easy to categorize (and already labelled on the NOT A FLUKE page), but to collapse it into manageable chunks for statistical analysis requires some theoretical or technical precedent for sorting which fields into which categories. Type of institution is slightly more complicated, but I believe I may end up using this resource. I will have these variables nailed down after a meeting with our resident sociology of education expert Dr. Tressie Cottom hopefully this week.

I have also been working on a codebook to accompany the coding tool, which will make life easier when working with the purely numerical version of the coding tool. I will provide a link to the codebook in a blog post on some testing of the coding tool some time in the next few days. You can infer most of the information that would be contained in it from the coding tool

The main challenge with this data source is that it is a bit disorganized, at least technically speaking. It would make life a lot easier if it was somehow in a spreadsheet, so that’s what I will try and do. If possible, I will work with my instructor for my data visualization course to construct a data scraper to automatically extract at least some of the data and/or organize it in a spreadsheet. I think that the most important function of the data scraper would be simply pulling the links to the articles out of the NOT A FLUKE web page, obtaining the name of the university in question* for each article, and dumping these either into the coding tool or into a separate spreadsheet to then be imported into the coding tool. The second most valuable function would be to obtain other variable information from the NOT A FLUKE web page, such as ‘faculty university position’ or ‘institution outcome.’ The third most valuable (and likely most complicated) function of the coding tool would be to obtain variable information from the articles themselves.

I expect that I will need to do much of the coding by hand, but I think it will be a useful, valuable exercise to practice using a data scraper and to manage messy data generally.

 

*This will be particularly helpful for merging in institutional information if I use the carnegie classification resource.

Digital Portfolios

One of my goals this semester is to develop my own digital portfolio. I’ve been a bit hesitant about working on this in the past because the development of such has always occurred in very forced, inorganic classroom settings. This semester, in a fully-online course dedicated to data visualization and the ability to communicate and speak data visually (and digitally), the task is much more organic. I’m concerned with the organic, genuine aspect of it because it’s my personal portfolio. I want such an expression of myself to be authentic.

So it has to be ‘authentic.’ But what else? ‘Professional’, ‘aesthetic’, and ‘interesting’ come to mind. I think these four aspects encapsulate the essence of what I currently want out of a digital portfolio. Throughout the semester, I will try to return to these concepts to keep my goals in mind, explicate what exactly they mean and how to put them into practice, and critically reflect on the concepts in the first place and consider changes and additions to them.

I already explained what authenticity means to me, but I think it would be useful to describe in more detail (if still quite broadly) what these other three concepts (professional, aesthetic, and interesting) mean to me. For my portfolio to be professional, it needs to be organized, accessible, inviting, and topical to my expertise. For it to be aesthetic, it needs to be sleek, pleasing to the eye, and easy to navigate. For it to be interesting, it needs to be focused, incorporate multimedia elements, and provide different amounts of information depending on the reader’s needs. I think that these four categories bleed into each other a bit and they certainly rely upon each other to best express themselves.

 

Portfolios out there:

 

With these goals in mind, I would like to turn to some examples of preexisting digital portfolios out in the big wide internet. First to discuss is Dr. Tressie McMillan Cottom. I am currently taking my second class with her. Being a scholar of digital sociology (not to short change her fantastic work in the sociology of education), I think it’s no surprise that Dr. Cottom’s presence on the internet is impressive. Part of that presence is her website. Although the content of the website is impressive out of context, the way that the content is presented is what I’m interested in studying for the development of my own digital portfolio.

I think that the website does a good job of towing the line between minimalism and visual interest. Color scheme, font choice, use of imagery, and menu layout play into this.  The aesthetic quality of a website also signals dedication to quality of presentation. The implication is that either the person to whom the portfolio belongs dedicated a large amount of work to creating and/or designing the site themselves, or that they deemed it valuable enough to hire someone to do for them (let’s leave discussions of the implicit class and capital assumptions in that last comment about choice for another time).

Another thing that I find immediately apparent about Dr. Cottom’s website is how the content varies by how deep one ventures into it and by where one chooses to look. If I went to her website seeking professional information about here, there is a link to her CV right at the top. Ten pages of publications, affiliations, and experience right there in bland, professional text. If I went to her website wanting to know more about who she is, I would immediately notice that the menu bar at the top is not the place for me (however, it does tell me that she is a professional, hard-working person just from the tab labels). I might then scroll down, and as I scroll, I see a professional headshot and a blurb about her. Further down are testimonials about her books. Then come blog posts of hers and even a few memes (specifically ones that signal not only internet proficiency but involvement in certain racial spaces on the internet) before getting to the end of the page with contact information and a site map. I think that this demonstrates an interesting idea of expressing more humanity and individuality as one reads through the front page of Dr. Cottom’s website, but also providing targeted information depending on what the reader is looking for.

 

I stumbled across another portfolio I thought was interesting here. I found Dr. Healy’s website through a site that a colleague of mine shared with me while discussing the data visualization course that I am taking. The site is a draft of a book that Dr. Healy is writing on visualizing social data using R. I thought it was an interesting site to look through and wanted to find more information about the author. Lo and behold he has a digital portfolio, also in website format like Dr. Cottom’s. Similar to Dr. Cottom, Dr. Healy’s website leaves his headshot later on the front page. The layout is sleek, but maybe a little too flat and lacking in style, pictures, or other media for my taste. I think that it does do a good job of putting fourth a specific presentation of self, and of being easy to navigate. I found it a bit odd that his CV was available under ‘publications’ instead of in an ‘about’ section or something similar.

I did, however, like the use of a flowchart/concept map under the resources tab. The visually displaying knowledge and engagement in research interests is something that I want to reflect in my portfolio.

One minor detail that I noticed when taking these screenshots is that tabs do not stay highlighted once you navigate to that page. I think that these sorts of small details don’t mean much for a website that isn’t likely to be heavily trafficked, but I think it’s the sort of detail that can make interactive data, whether qualitative or quantitative, harder to engage with and understand. Something to keep in mind for later.

 

Briefly, I thought it would be amusing to talk about a poor example for website design, and maybe more specifically, an example about how what makes a website design ‘good’ changes over time. I don’t have an example of this in digital portfolio format, but I feel like website design principles still apply. Here is the famous/infamous official Warner Bros Space Jam website.  It is famous for the fact that Warner Bros have done virtually nothing to update the website in the 22 years since the movie’s release. This website illustrates the other side of font choice, color scheme, ease of navigation, etc. Have a look around. I think that it’s worth a minute or two of amusement at least. I think one of the most important takeaways is that this website was supposedly* once an official, reputable website for a feature film, albeit a comedy.

 

 

So what about these websites do I want to incorporate into my own digital portfolio? And more generally, what do I want my digital portfolio to look like?

  • Sleek, aesthetic
    • No more than two fonts aside from my name
    • Reserved, somewhat muted color scheme
  • Engaging, interesting
    • Use of pictures and multimedia
    • Focued on charts, graphs, other visuals that represent my research interests and abilities
    • Graphics and blog posts that are generally engaging, topical.
  • Professional, informative
    • Contains CV, contact info
    • Is oriented towards an educated, research- and data-minded audience, but not excessively technical or jargon-heavy.
  • Authentic
    • Reflects my research interests and attitudes
    • Shows personality and individuality

Stay tuned for a separate post containing a wireframe of my future digital portfolio’s front page coming later today.

 

 

 

*I don’t know for sure. I was 4 at the time.

Data Visualization and My Final Semester of Grad School

Since this is my first (publicly visible) post, I should introduce myself. I am a second-year graduate student in the VCU sociology department.  I am an in-person student on the thesis track, and also a teaching assistant for two of the professors in the department. I had originally planned to pursue a Ph.D. and become a professor, but after spending some close-up time with an academic department, I have decided that the life or a professor is not the life for me. After I graduate, I hope to find a job as a statistician, data manager, research methodologist, or any other job working with data, statistics, or research. These are not only my strengths, but also what I find most enjoyable and relaxing in the world of sociology, so a career in these areas would suit me well.

I chose to take a course in data visualization this semester with an eye to the future, specifically to my fast approaching career in statistics, data, and the nuts and bolts of sociological research. Studying data visualization will help me develop my skills in ‘speaking’ data and presenting and conceptualizing data for various audiences. I hope that these skills will improve my effectiveness in any of the occupations that I have listed above. I hope to get a broad understanding of various data visualization tools and build experience with both specific tools and with learning new tools generally. I also hope to develop my ability to advertise my skills through a digital portfolio.

I think that my introduction to the power of data visualization techniques was this well-known video:

I believe it was shown in an undergraduate sociology class that I took in spring 2014. I have relatively little experience with applied data visualization. The extent of my experience at the start of this semester is in using the data visualization tools in SPSS and Excel in the context of the roughly 7-to-9 undergraduate and graduate methodology and statistics classes that I have been either a student or a teaching assistant for.

Considering that I hope to tailor my work in this course to my research interests as much as possible, it would make sense to introduce my research interests now. [Very] broadly, I am interested in organizational power and inequality. As a sociologist, I am interested in the big three categories of social inequality and stratification: race, class, and gender. I have two primary research interests: patriarchy and sexual misconduct in higher education, and pornography.

The working title for my thesis is, “Campus Sexual Misconduct: Faculty as Perpetrators and Institutional Power.” I am using the Not A Fluke data set (Current N=600). It is a compilation of publicly available news articles documenting cases of faculty committing acts of sexual misconduct against students. I will be coding the data for statistical analysis and also incorporating publicly available data about institutional information and classification for the universities involved.

At this point in my education I have strong computer, statistics, and methodological skills, but little experience with data visualization tools, both in terms of what options are available and in terms of how to use them. I hope to expand my knowledge and expertise in data visualization to improve my skills as a sociologist for my entry into the job market this summer.