During this week, we will discuss how to classify the topics discussed by politicians and users on social media.

Presentations

André, Max, and Nikolaj are presenting:
A 61-million-person Experiment in Social Influence and Political Mobilization
Robert M. Bond, Christopher J. Fariss, Jason J. Jones, Adam D. I. Kramer, Cameron Marlow, Jaime E. Settle, and James H. Fowler (2019)

Readings

Probabilistic Topic Models
Communications of the ACM, 2012, 55 (4): 77-84
David M. Blei
- This is a nice non-technical introduction to topic models. Please use this article to get the intuition behind these models.
Computer-Assisted Text Analysis for Comparative Politics
Political Analysis, 2015, 23 (2): 254-277
Christopher Lucas, Richard A. Nielsen, Margaret E. Roberts, Brandon M. Stewart, Alex Storer, and Dustin Tingley
Latent Dirichlet Allocation
Journal of Machine Learning Research, 2003, 3: 993-1022
David M. Blei, Andrew Y. Ng, and Michael I. Jordan
- Skim this one. This paper was the first to introduce LDA and gets cited often. But it is probably far more technical a paper than you are used to. That’s totally okay. For this class, you just need to get the intuition behind these types models.

Read one of the following empirical applications to get a sense of how topic models are used to study social media data:

Elites Tweet to Get Feet Off the Streets: Measuring Regime Social Media Strategies During Protest
Political Science Research & Methods, 2019, 7 (4): 815-834
Kevin Munger, Richard Bonneau, Jonathan Nagler, and Joshua A. Tucker
Who Leads? Who Follows? Measuring Issue Attention and Agenda Setting by Legislators and the Mass Public Using Social Media Data
American Political Science Review, 2019, 113 (4): 883-901
Pablo Barberá, Andreu Casas, Jonathan Nagler, Patrick J. Egan, Richard Bonneau, John T. Jost, and Joshua A. Tucker

In addition to the lecture, there are two videos for the topic models class. The first introduces how topic models work in the abstract, and the lab introduces their use for an applied question in political science.

Labs

Lab 1

In the first lab, I simulate data to show you how an LDA model works (this will be essentially equivalent with what I walked through in the lecture). Please download the code and run it step by step to follow along with the video.

Note that in this lab, the goal of walking through the R code is to show you how an LDA model assumes that texts are created. Obviously it is “wrong”: when people write texts they do not pick a topic and then pick a word, but instead write in full sentences. The goal of showing the data-generating process (DGP) by simulating data is to give you some sense of how an LDA model is imagining that texts are generated. LDA models are designed to model that process to uncover what the topics are and what their distributions are within each document. But even though the DGP is clearly wrong on its face, as you will see in Lab 2, it nevertheless works surprisingly well with real-world data.

The code used in this first lab is here: LDA_DGP.R

For reference, in the first lab, I refer at the end to the following articles:

A Correlated Topic Model of Science
Annals of Applied Statistics, 2007. 1 (1): 17-35
David M. Blei and John D. Lafferty

How to Analyze Political Attention with Minimal Assumptions and Costs
American Journal of Political Science, 2010. 54 (1): 209-228
Kevin M. Quinn, Burt L. Monroe, Michael Colaresi, Michael H. Crespin, and Dragomir R. Radev

A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases
Political Analysis, 2010. 18 (1): 1-35
Justin Grimmer

Structural Topic Models for Open-Ended Survey Responses
American Journal of Political Science, 2014. 58 (4): 1064-1082
Margaret E. Roberts, Brandon M. Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, and David G. Rand

Lab 2

In this lab, we will apply a 100-topic LDA model to tweets from US Members of Congress to examine discussions about COVID-19 by Democrats and Republicans over time. Who was quicker to set the agenda around COVID-19, Democrats or Republicans? Make a guess if you know something about US politics.

The lab cleans and analyzes the Members of Congress data in essentially the same way as Barberá et al. (2019) and Munger et al. (2019). So if you want to apply an LDA model in a similar fashion to what was done in those two applied papers from the readings, this lab shows how.

Before working through the video, please install the following libraries, which are add-ons to to the text analysis library “quanteda”. I use one or two of the functions from these libraries in the lab:

install.packages("quanteda.textmodels")
install.packages("quanteda.textstats")
install.packages("quanteda.textplots")

The R file for this lab can be found here: Topic_Models_Lab.R

The data are the tweets from US Members of Congress, which can be found here: MOC_Tweets.rds

Tokenized data object (the output of one of the steps so you can avoid waiting for 15+ minutes): Tokens.rds

100-topic LDA model object (the output of one of the steps so you can avoid waiting for over an hour): model_lda_100.rds

Calculating the number of the topics (Optional)

One step in using topic models that was not discussed in the lab is how to calculate k: the number of topics. For a given dataset, should k be 2, 5, 10, 20, 50, 100, more?

Both Munger et al. (2019) and Barberá et al. (2019) use a cross-validation technique to calculate some statistics to determine the number of topics to use with their data. In both papers, the authors selected k = 100. However, as you will see on pages 824-825 in Munger et al. (2019), and on page 889 and in Appendix G.2 in Barberá et al. (2019), they calculate perplexity and log-likelihood statistics for a variety of k to determine what number of topics is optimal.

Perplexity and log-likelihood statistics are relatively technical, so I will not go into these in detail here. These measures essentially capture how well a model fits the data: if you try a model with k = 30, does it make better topic predictions than if you try a model with, say, k = 100? We use cross-validation to calculate these statistics. This means repeatedly fitting a topic model to, say, 90% of the data and then trying to make predictions for the 10% of the data that the model didn’t use. We’ll cover cross-validation shortly.

However, because I want you to have all of the technical tools necessary to write a paper like Munger et al. (2019) or Barberá et al. (2019), I have written code for you that calculates these statistics for any topic model you wish to run. If you can understand and run the code provided below, you will have all of the technical skills necessary to collect and analyze data to write a paper like the Munger et al. (2019) article. There’s essentially nothing else technically that you would need to know if you wanted to write that paper. For an article like Barberá et al. (2019), you would also need to learn about vector auto-regression (a time series model), but that would be it.

I have pretty heavily documented each step in the code provided below so that you can follow along. The code outputs a graph that is analogous to Figure 2 on page 825 in Munger et al. (2019), and to Figure A12 in Barberá et al. (2019) on page 27 of their Appendix.

The code for calculating these statistics (and the graph) is here: Perplexity_CV.R

The cross-validation bit takes over 2 hours to run (or probably 6+ hours on a slower laptop), so I provide you with the resulting model object here in case you don’t want to run it yourself: cv_out.rds