Data Scraping

It is evident that the internet has completely changed how we interact with each other. Moreover, social media has contributed to this ongoing change as we progress and become “a digital society”. The Internet has also changed where we shop, how we receive our daily dose of news, and how we show off our new purchases. At the same time, the internet has also changed how we receive our academic degrees or communicate with a colleague who is overseas. But, more importantly, the internet and the web have been responsible for generating an enormous amount of data that can be used for social research purposes. Consequently, the internet has changed how we as sociologists or social researchers conduct our studies and collect our data.

Data Scraping refers to a process where one can retrieve data/information from any type of sources such as the web, social media, or just the computer itself. In recent years, the web scraping systems have been developed that rely on using techniques such as parsing and Natural Language Processing (NLP) that simulates the human processing and automatically extracts the useful information. Kiser (2016) explains that “NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way”. With the help of NLP, researchers can perform various tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation (Kiser, 2016). Data scraping or data mining tools can be versatile in social research as they can provide flexibility of collecting already existing data to researchers. It enables a form of “live” social research, in which the online data may allow quicker and inexpensive research findings.

Source: Socially Aware Blog

Even though this is a new approach in social research, there have been some instances of social researchers opting for this type of a methodology. In 2014, researcher Priscilla Murphy utilized text mining and semantic network analysis to examine the question “How have three decades of change in Chinese society, business, and politics influenced its official media coverage of emerging partnerships and interactions with US business? She conducted a text mining and semantic network analysis by employing Centering Resonance Analysis (CRA) of 929 articles in China Daily and Xinhua News Service from the year 1979 to the year 2011. Then, she examined the articles via cluster analysis and factor analysis by making a group of news stories that received similar coverage and revealing theme for each cluster (Murphy, 2014). Her findings revealed 17 different themes over the period of time. She notes that her findings suggest that “A sense of historical inevitability surrounded China’s progress from the periphery to the core of globalization, while the US moved from sole player to one of many partners for China” (Murphy, 2014).

In 2012, researcher Wu He conducted a study “Examining students’ online interaction in a live video streaming environment using data mining and text mining”. In this study, He utilized a live video streaming (LSV) system using data mining/text mining techniques to examine the dynamics of social interaction in the web-based learning environment by analyzing the online questions and chat messages between students and with their instructors. The findings of this study suggest that there is a correlation between the number of questions students asked and students’ final grades.

Source: He (2012)

In 2011, Bruns et al utilized web crawling methodologies “for mapping the Australian political blogosphere and tracking how information is disseminated across it”. The researchers first tracked the blogging activity and scraped the new blog posts when they were announced through Really Simple Syndication (RSS) feeds. Then, they used custom-made tool to differentiate between the types of content, making it easier to analyze only crucial content written by bloggers. Finally, they examine the data using link network mapping and textual analysis tools to establish key themes. In their findings, authors suggest that Australian political bloggers consistently write about current political affairs, however, they often interpret them differently than mainstream news outlets. This study seems very interesting now as we enter into a new era of “fake news”. I think it would be really interesting to replicate the study to examine current political blogosphere across the United States.


Bruns, A. et al. “Mapping The Australian Networked Public Sphere”. Social Science Computer Review 29.3 (2011): 277-287. Web.

He, Wu. “Examining Students’ Online Interaction In A Live Video Streaming Environment Using Data Mining And Text Mining”. Computers in Human Behavior 29.1 (2013): 90-102. Web.

Kiser, Matt. “Introduction To Natural Language Processing (NLP) 2016 – Algorithmia”. Algorithmia. N.p., 2017. Web. 26 Mar. 2017.

Murphy, Priscilla, and Marilena Olguta Vilceanu. “Official Chinese Media Representations Of US Business, 1979–2011: A Text Mining Approach”. International Communication Gazette 76.8 (2014): 682-702. Web.



One thought on “Data Scraping

  1. Figuring out how to use social media analysis tools to detect ‘fake news’ would be quite an accomplishment. I wonder if fake news travels through the media network differently that real news? Can we detect fake news if it never appears on a certain website, i.e. HuffingtonPost. How has the Russian fake news been detected? Is it all hunt and peck? Or is there a systematic way to uncover it?

Leave a Reply

Your email address will not be published. Required fields are marked *