During this week, we will discuss supervised machine learning methods for analyzing social media data.

Readings

Introduction to supervised learning

An Introduction to Statistical Learning (with Applications in R)
New York, NY: Springer, 2013.
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
Read only chapters 1, 3, and 5
For Chapter 5, you do not have to read about bootstrapping (from §5.2 and beyond). I recommend it, however, because it is a common statistical tool; is widely applicable; and is generally easy to implement.

For anyone who is interested in machine learning in general, I recommend reading the full book. This book and its somewhat more technical companion book, The Elements of Statistical Learning (local copy here), are very well-known in the machine learning and social scientific literature generally.

Methods

There are many machine learning methods. To become familiar with at least one of the commonly used methods, read one of the following articles regarding machine learning approaches for political science data:

Tree-Based Models for Political Science Data
American Journal of Political Science, 2018, 62 (3): 729-744
Jacob M. Montgomery and Santiago Olivella
Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines
Political Analysis, 2014, 22 (2): 224-242
Vito D’Orazio, Steven T. Landis, Glenn Palmer, and Philip Schrodt

Read one of the following two applications to social media data:

A Bad Workman Blames His Tweets: The Consequences of Citizens’ Uncivil Twitter Use When Interacting with Party Candidates
Journal of Communication, 2016, 66 (6): 1-25
Yannis Theocharis, Pablo Barberá, Zoltán Fazekas, Sebastian Adrian Popa, and Olivier Parnet
Predicting and Interpolating State-Level Polls Using Twitter Textual Data
American Journal of Political Science, 2017, 61 (2): 490-503
Nicholas Beauchamp

Optional: For those who are interested, many researchers pay ordinary internet users to classify social media texts for the purpose of creating a training set. The following article examine how well these ordinary people perform classification tasks versus political science experts.

Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data
American Political Science Review, 2016, 110 (2): 278-295
Kenneth Benoit, Drew Conway, Benjamin E. Lauderdale, Michael Laver, and Slava Mikhaylov

Lecture

For reference, I will refer to the following articles in the lecture:

Signals of Public Opinion in Online Communication: A Comparison of Methods and Data Sources
ANNALS of the American Academy of Political and Social Science, 2015, 659 (1): 95-107
Sandra González-Bailón and Georgios Paltoglou
Automated Text Classification of News Articles: A Practical Guide
Political Analysis, 2021, 19 (1): 19-42
Pablo Barberá, Amber E. Boydstun, Suzanna Linn, Ryan McMahon, and Jonathan Nagler

To learn more about qualitative annotation for creating a labeled dataset, the classic book on the topic is the following:

Content Analysis: An Introduction to Its Methodology
SAGE Publishing, Thousand Oaks, CA, 2018
Klaus Krippendorff

Also, see this:

Natural Language Annotation for Machine Learning A Guide to Corpus-Building for Applications
O’Reilly, Sebastopol, CA, 2012
James Pustejovsky and Amber Stubbs

Some articles introducing machine learning in the social sciences more generally:

Machine Learning Methods That Economists Should Know About
Annual Review of Economics, 2019, 11: 685-725
Susan Athey and Guido W. Imbens

Big Data: New Tricks for Econometrics
Journal of Economic Perspectives, 2014, 28 (2): 3-28
Hal R. Varian

Labs

Cross-validation

The code for cross-validation: Perplexity_Cross_Validation.R
The outputted data from running cross-validation for all k (to make the graph): cv_out.rds

A summary of cross-validation

Split your training set into k groups at random, where k is typically 5 or 10.
For each group k:
- Set all observations in the current group aside as a validation set
- Fit your machine learning model to the data in the remaining training set (i.e. data from the k-1 groups)
- Use your fitted model to make predictions for the outcome in the validation set (the data you set aside). This is your out-of-sample test: making predictions for data that the model didn’t see.
- Save your measure(s) of how well the model did at predicting values in the validation set (e.g. accuracy, precision, recall, F score, area under the curve)
Summarize the cross-validated performance of the model by, for example, taking the average of the F scores (of any other performance metric)
Depending on what you are cross-validating, choose the hyperparameters or supervised learning model that performed best
If you had separated your labeled data into a training and test set, use the best-performing model from your cross-validation to predict values in the test set to get a final unbiased estimate of out-of-sample performance (on data that wasn’t even used in the cross-validation procedure)

Applied supervised learning case

Lab code: Supervised_Learning_Lab.R
Tweets from Members of Congress: MOC_Tweets.rds
Tokenized tweet data: SML_Tokens.rds
Fitted random forests model: rf_model.rds
Fitted elastic net model: elastic_model.rds
The caret library in R: caret