Pinterest WhatsApp

The final Presidential debate of 2016 was as heated as the previous two—well demonstrated by the following name-calling exchange:

CLINTON: ...[Putin would] rather have a puppet as president of the United States.
TRUMP: No puppet. No puppet.
CLINTON: And it's pretty clear...
TRUMP: You're the puppet!
CLINTON: It's pretty clear you won't admit ...
TRUMP: No, you're the puppet.

It is easy to form our opinions of the debate and on the differences between the Presidential candidates on excerpts like this and memorable one-liners. But are small extracts representative of the debate as a whole? Moreover, how can we objectively analyse what was said, who got to say the most, and how the candidates differed in their responses? One approach is to turn to basic text analysis, which is easily implemented in R (using a package such as “quanteda”).

To begin, all we need is a transcript of the debate, with each candidate’s answers saved in a different text (.txt) file. Once loaded into R as a corpus (i.e. an object that contains both our files), we have all we need to get started.

A simple summary of the texts tells us a few interesting things straight away.

Words used Unique words Sentences
Clinton 8,190 1,493 446
Trump 8,077 1,167 620


We see that both candidates say roughly the same number of words (“Words Used”) but that Clinton uses a greater variety of words (“Unique Words”) in her responses. In fact she uses approximately 28 percent more unique terms than her opponent. Clinton also has longer sentences than Trump. This can be deduced from the fact that the candidates both use approximately the same number of words but Trump puts his words into 174 more sentences than Clinton.  Based on this basic evidence, we can say that in the third debate, Trump preferred shorter sentences with more repeated words and Clinton longer sentences with more word variation.

Going further, we can see the similarity and differences between the vocabulary used by the two candidates creating a venn diagram (using the “venneuler” package in R) of the unique words used by each candidate and the words they have in common.

Roughly 28 percent of the words used in the debate are used by both candidates. Clinton, as mentioned above, and visible by her larger circle in the Venn diagram, uses more unique words than Trump.

To get a better feeling for what the candidates actually said, we can make a document-feature matrix (DFM). A DFM is a simple matrix that counts the number of times a word or set or words (e.g. every pair of 2 words or “bi-grams”) appears in each document. In order to make a DFM, extra punctuation is often removed, along with common “stop words” (e.g. the, a, an, by, etc.) and generally all words are changed into lowercase.

A simple way to visualise the results of a DFM is to plot them as a word cloud. First, Trump’s top words:

Now Clinton’s top words:

The size and darkness of the words corresponds to their prominence. For example, the most prominent word in both clouds is “people”. Trump said the word “people” 49 times and Clinton “37” times. However words out of context do not always mean much. We can instead plot pairs of words or “bi-grams” to get more of a sense of how the words fit together.

By looking at the bigrams the context behind the candidates’ words becomes clearer. We can easily pick out “planned parenthood”, “social security” and “women’s rights” in Clinton’s top bi-grams—all phrases that have very different meanings when their component words are paired together. In Trump’s word cloud, the contrasting terms “strong borders” and “open borders” are prominent, along with “make america” and “america great”. Interestingly, none of the word clouds include the word “puppet”.

We can go even further and make tri-grams or for that matter, any number of “n”-grams to get a better sense of the context in which these words were spoken. Going beyond increasing n-grams, text analysis should be paired with an understanding of what’s going on between the texts (over time and across space) and behind the words. As Trump quipped at a fundraiser on the 20th of October, “Michelle Obama gives a speech, and everyone loves it. It’s fantastic. They think she is absolutely great. My wife, Melania, gives the exact same speech and people get on her case.”

It was the context behind the re-use of Michelle Obama’s speech by Melania Trump that made that particular case of text-re-use controversial. But it is up to the savvy analyst to interpret their findings and to decide the substantive significance of the re-use.

Returning to the case of the Third Presidential Debate of 2016, text analysis techniques can be used to show the different choices the two candidates made in terms of the words, phrases, and sentences they used. These simple analyses can also be expanded to compare this latest debate with its predecessors to get a broader picture of the candidates’ approaches. Which strategies and talking points will pay off though? I’ll leave that to Nate Silver to predict.

This article was first published at the Oxford Q-Step Centre’s blog



Previous post

When Naming Cyber Threat Actors Does More Harm Than Good

Next post

“Repeal of the Corn Laws: Lessons for 2016?”