This week will have a somewhat shorter lecture, and then a lab as usual. You will be introduced to term frequency inverse document frequency (TF-IDF), Receiver Operator Characteristic (ROC) curves, and keywords in context.

Presentations

Lennart is presenting:
Fake News on Twitter during the 2016 U.S. Presidential Election.
Nir Grinberg, Kenneth Joseph, Lisa Friedland, Briony Swire-Thompson, and David Lazer

Victor is presenting:
Who Leads? Who Follows? Measuring Issue Attention and Agenda Setting by Legislators and the Mass Public Using Social Media Data.
Pablo Barberá, Andreu Casas, Jonathan Nagler, Patrick J. Egan, Richard Bonneau, John T. Jost, and Joshua A. Tucker

Readings

No required readings this week.

Below are a couple of articles that you might use as references to understand the material the we cover in the class nevertheless.

An Introduction to ROC Analysis
Pattern Recognition Letters, 2006, 2: 861-874
Tom Fawcett
An Introduction to Statistical Learning (with Applications in R)
New York, NY: Springer, 2013.
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
On ROC Curves, pp. 147-149
Introduction to Information Retrieval
New York, NY: Cambridge University Press, 2009.
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze
On TF-IDF, pp. 117-120

Finally, as an extra, the article below provides a nice example of how to use both supervised and unsupervised learning techniques to answer important questions in the study of social media and politics. The authors manually annotate tweets to classify them as civil or uncivil (polite/impolite); apply a supervised learning model (lasso) to then classify all tweets that they collected; and finally apply an (unsupervised) LDA topic model to see what topics have the most uncivil posts. The authors do so to examine the level of incivility directed at politicians, and how incivility differs depending on the political topic of discussion. A similar paper would work well as a thesis topic for masters students in the class if you are still searching for an idea. From what I can tell, their analysis is also wholly conducted in R, and uses ggplot for graphing (as with many papers in political science).

The Dynamics of Political Incivility on Twitter
SAGE OPEN, 2020: 1-15
Yannis Theocharis, Pablo Barberá, Zoltán Fazekas, and Sebastian Adrian Popa

Lectures

I will mention the articles below:

From Isolation to Radicalization: Anti-Muslim Hostility and Support for ISIS in the West
American Political Science Review, 2019, 113 (1): 173-194
Tamar Mitts
Gendered Language on the Economics Job Market Rumors Forum
American Economic Association: Papers & Proceedings, 2018, 108 (May): 175-179
Alice H. Wu
Classification Accuracy as a Substantive Quantity of Interest: Measuring Polarization in Westminster Systems
Political Analysis, 2018, 26 (1): 120-128
Andrew Peterson and Arthur Spirling
Elusive Consensus: Polarization in Elite Communication on the COVID-19 Pandemic
Science Advances, 2020, 6 (28): 1-5
Jon Green, Jared Edgerton, Daniel Naftel, Kelsey Shoub, and Skyler J. Cranmer
How State and Protester Violence Affect Protest Dynamics
Journal of Politics, Forthcoming: 1-39
Zachary C. Steinert-Threlkeld, Alexander Chan, and Jungseock Joo
Viral Visualizations: How Coronavirus Skeptics Use Orthodox Data Practices to Promote Unorthodox Science Online
CHI Conference on Human Factors in Computing Systems (CHI ‘21), May 8-13, Yokohama, Japan, 2021: 1-18
Crystal Lee, Tanya Yang, Gabrielle Inchoco, Graham M. Jones, and Arvind Satyanarayan

Lab

The .R file and model objects below have slight differences (and improvements) to what is described in the video. So you may notice relatively minor changes between the R file downloadable below and the R file as discussed in the video.

Lab code: TF-IDF.R

Tweets from Members of Congress: MOC_Tweets.rds

Tokenized data: TFIDF_Tokens.rds

Fitted models: elastic_model_counts.rds
Fitted models: elastic_model_counts_auc.rds
Fitted models: elastic_model_tfidf.rds
Fitted models: elastic_model_auc_tfidf.rds