During this week, we will start collecting Twitter data, and examine the use of regular expression for search in text data.

Information on how to access a variety of APIs: APIs for social scientists: A collaborative review

1. In-class data-collection exercise

We will be setting up YouTube API account from the Google Cloud console here: https://console.cloud.google.com/apis/api/youtube.googleapis.com

Setting up the relevant directories:

Imagine you’re starting a new research project
Create a project directory on your computer
Create a Code/ directory within the project directory
Create a Data/ directory within the project directory
Within the Data/ directory create the following:
- Comments/, Video_IDs/

Setting up the preliminary data:

Create a csv with a list of 3-5 YouTube Channels (e.g. FoxNews, Breitbart, MSNBC, New York Times, Democracy Now!)
- e.g. by creating a data.frame in R or creating one in Excel or Google Sheets and exporting to .csv
- It should have two columns, one for the proper name of the channel, and the other with the channel name as it is in YouTube
  - For example, a proper name of a channel might be “Democracy Now!”, where the channel name on YouTube is “democracynow”
Save the csv in the Data/ directory

Work through the following code to download the comments on videos from the YouTube channels that you save in your csv. Save and uncompress this file to your Data/Code/ directory: YouTube_Exercise_Code.zip

2. Introduction of text analysis

Keyword approaches to text analysis

Question: Why might we worry about coming up with keywords to use for the Twitter Streaming API to find all tweets about the coronavirus? How might keyword selection bias the tweets one collects?

For those who are interested, I reference the following paper in the lecture about keywords:

Computer-Assisted Keyword and Document Set Discovery from Unstructured Text
American Journal of Political Science, 2017
Gary King, Patrick Lam, and Margaret Roberts

Bag of words approaches to text analysis

After this lecture, you might consider reading Text Preprocessing For Unsupervised Learning by Denny & Spirling (2018), which provides an overview of bag of words approaches. After the lecture (and potentially reading Denny & Spirling 2018), you should also be able to easily understand the following passage from the article A Bayesian Hierarchical Topic Model for Political Texts by Justin Grimmer (2010):

Preparing the Texts for Analysis

Using the thousands of press releases from all Senate offices, the expressed agenda model measures the priorities senators communicate to their constituents through press releases. To perform this analysis, a set of preprocessing steps are performed on the press releases, all of which are well established in the literature on the statistical analysis of text (Manning et al. 2008). The first step discards the order of words in the press release, leaving an unordered set of words remaining (Hopkins and King forthcoming; Quinn et al. forthcoming). Although one might expect the order of words to be crucial to understanding the sentiment expressed in a text, identifying the topic of a press release should be invariant to permutations of word order. Certain topics, such as the Iraq war, should result in specific words appearing with high frequency (troop, war, iraqi) irrespective of whether the senator supports or opposes the war.

Next, all the words are placed into lower case and all punctuations are removed. Then, I applied the Porter stemming algorithm to each word (Porter 1980). The stemming algorithm takes as an input a word and returns the word’s basic building block, or stem. For example, the stemming algorithm takes the words family, families and returns famili.

After stemming the words in each document, I counted the number of occurrences of each word in the corpus, the total set of press releases. All words that do not occur in at least 0.5% of press releases were removed (Quinn et al. forthcoming). Finally, I removed all stop words (e.g., around, whereas, why, whether), along with any word that appears in over 90% of any individual senator’s press releases. This ensures that each senator’s press releases are not grouped together based upon language unique to each senator, yet unrelated to the topic of the document.

After preprocessing the press releases, 1988 unique stems remain, along with 3,715,293 stem observations in the 24,236 press releases. Each document is represented as a w x 1 vector, where w are the number of stems that remain after the preprocessing (in this example, w = 1988).