What Big Data can teach political scientists

Big Data is now a buzzword in the political science field. Some might call this hype. Others see unlocking the power of “Big Data” as the most significant transformation in research this century.

In the world of research, Big Data seems to be living up to its promise. And the results include a wave of new and inspiring projects.

What is Big Data?

Big data is not simply research that uses a large set of observations. It might be thought of as re-imagining large-n inquiries, dealing with hundreds of thousands, and, in some cases, even millions of observations. Big Data means giant N.

But it is more than a question of quantity. In their well-known Ted Talk, Erez Lieberman Aiden and Jean-Baptiste Michel helpfully distinguish between two axes of research: practical, and, for want of a better word “awesome. They suggest practicality is still the core of Big Data, which pushes the boundaries of what is technically feasible. Big Data analysis, as a field, urges advances in computing, methods, data availability, mathematics, etc… These advances allow us to push projects further.

With “awesome”, Aiden and Michel harness an Americanism with good cause: Big Data allows us to engage with big transformations over the longue durée. We can now investigate lengthy trends that political scientists are usually ill-equipped to map with traditional datasets.

Moreover, this has affected the nature and nuance of possible research questions. Instead of focusing on what an individual politician says, we can assess millions of speeches over hundreds of years to show how political language changes over time, or how a specific kind of contentious issue develops. If applied properly, with millions of observations at our disposal, we can observe previously unseen patterns; analyse political behaviour at both the aggregate and grassroots levels; and capture previously unexplored phenomena.

While Big data analysis pushes practitioners to the extremes on both the feasibility and the “awesome” axes, it does so with fewer barriers to entry. What was impossible two decades ago is practicable in a matter of minutes with little more than a laptop, free statistical software such as R, or knowledge of simple programming languages like Python.

It is therefore not surprising that an increasing number of papers employ Big Data methods. This ranges from the use of House of Commons speeches to trace ministerial responsiveness (Eggers and Spirling, 2014), to new measures of political ideology based on 100 million observations of financial contribution records (Bonica 2014).

The Benefits of Big Data

The future of exploratory science lies along the extremes of these two axes. Using Big Data, we can engage with some of the biggest questions in a variety of scientific disciplines, including the social sciences, humanities, and natural sciences.

This fevered trend is driven by two exciting developments. First, the improved speed in capturing data, which currently doubles every year. And second, the “rapidly advancing techniques of artificial intelligence, whether natural language processing, pattern recognition or machine learning“.

More specifically, Big Data advances political science research in three important ways. First, Big Data aids hypothesis generation. With the availability of masses of new data and our newfound ability to manipulate and investigate it quickly and cheaply, we can observe patterns that we have not observed before. From this improved ability to describe and explore data comes the possibility of generating new and interesting hypotheses.

Second, Big Data helps identify instrumental variables. When political scientists cannot measure the phenomenon of interest directly, they sometimes use a proxy (or “instrument”) that is closely correlated with the variable of interest. According to Clark and Golder, “Big data can help to the extent that it makes previously unobservable variables observable, thereby reducing the need for an instrument, or by making new potential instruments available” (2015: p. 67).

Third, Big Data allows us to scale research up and down more effectively. We can design experiments on a scale previously impossible in the social sciences thanks to “granular data” (Grimmer 2015), while expanding hand-coded material into larger datasets with machine learning. At the same time, with more data across a greater array of contexts, researchers can formulate and test hypotheses at a more detailed level.

Examples: What can we do with Big Data?

Big Data analysis can be applied in many interesting ways. Below are just two examples.

Forecasting: Measuring Political Sentiment

A growing number of political scientists rely on Twitter data to measure political preferences. Twitter is a massive data stream, with some 200 billion tweets per year.

We can analyse this data quite cheaply with something called supervised and aggregated sentiment analysis (SASA). First, a large subset of the texts (in this case, tweets) are analysed by human coders and scored based on the sentiment they convey. Usually, coding distinguishes between positive and negative sentiments, but a more complex scheme can be used as well. Then, this data is fed into an algorithm that “learns” what is a positive and what is a negative text, and finally, that algorithm is applied to the whole dataset.

How can political scientists use this information? If we can gauge how people feel about a particular phenomenon based on tweets, we can similarly use tweets to measure their sentiments toward political candidates. This is useful for forecasting election outcomes. In an earlier blog post on this website, for example, Andrea Ceron, Luigi Curini, and Stefano M. Iacus discuss their use of SASA to analyse the 2012 Italian primary election.

Measuring Political Ideology

Computational text analysis has recently been applied to large sets of speeches from parliaments to measure the ideological position of legislators (e.g. Lowe and Benoit, 2013; Schwarz et al., 2015). Traditionally, political scientists have relied on vote records in order to estimate ideal points. But in parliamentary systems, where party discipline is high and debate is relatively open, such methods are not particularly effective: voting is often strategic and reveals little in terms of ideology.

Debates, on the other hand, can yield significant textual data than can be analysed and used to estimate ideological positions. The digitisation efforts of some parliaments (including the UK and the US legislatures) have made huge sets of speech data available to researchers, dating from the early nineteenth century. With existing algorithms, millions of speeches can be scaled in a matter of hours.

Researchers use computer algorithms such as Wordscores or Wordfish to code this mass of textual data. Both programs rely on relative word frequencies. The former is a so-called “supervised” method and requires an expert to code two reference texts (each at either extreme of the political spectrum). The algorithm subsequently scales speeches (“virgin texts”) according to the similarity of word use compared to the reference texts.

With Wordfish, which falls into the “unsupervised” category, the algorithm estimates an underlying latent dimension itself, and places speeches (or other texts) in this one-dimensional space. Here, the challenge of validating measures lies at the post-estimation stage, where researchers have to demonstrate that they are actually capturing a dimension of conflict.

Despite the expected challenges of any new research regime, Big Data offers a promising new avenue for gauging political preferences in parliament—information that should benefit research on institutional change, decision-making and many other areas.

Harnessing the power of Big Data

Big data is not immune to critique. One of the most important challenges with Big Data is, somewhat paradoxically, its scale. As the number of observations increases, so does the risk of false positives (type I errors). That’s why Justin Grimmer, Assistant Professor at Princeton, calls for Big Data scientists to become social scientists. I could not agree more. Expertise is necessary to make sense of all the data: no computer algorithm can substitute for a deep understanding of the subject matter, nor can it replace sound causal inference.

As political scientists, we need to think deeply about how to scale up our research and think critically about the questions we are asking, the hypotheses that we formulate, the causal claims we assert, and—generally—how we design our research.

If we get it right, we can harness the power of millions, or even billions of observations. And just imagine what questions we may answer. “We’re really just getting under way,” says Gary King, the renowned Harvard statistician. “But the march of quantification, made possible by enormous new sources of data, will sweep through academia, business and government. There is no area that is going to be untouched.”

References

Bonica, Adam (2014). “Mapping the Ideological Marketplace”. American Journal of Political Science 58 (2): pp. 367–386.

Clark, William Roberts and Matt Golder (2015). “Big Data, Causal Inference, and Formal Theory: Contradictory Trends in Political Science?” PS: Political Science & Politics 48: pp. 65-70.

Eggers, Andrew C. and Arthur Spirling (2014). “Ministerial Responsiveness in Westminster Systems: Insti- tutional Choices and House of Commons Debate, 1832-1915”. American Journal of Political Science 58 (4): pp. 873–887.

Grimmer, Justin (2015). “We Are All Social Scientists Now: How Big Data, Machine Learning, and Causal Inference Work Together.” PS: Political Science & Politics 48: pp. 80-83.

Lohr, Steve (11^th February 2011). “The Age of Big Data.” The New York Times. Available at: http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html.

Monroe, Burt L. (2011). “The Five Vs of Big Data Political Science Introduction to the Virtual Issue on Big Data in Political Science.” Political Analysis 19: 66-86.

Lowe, Will and Kenneth Benoit (2013). “Validating Estimates of Latent Traits from Textual Data Using Human Judgment as a Benchmark”. Political Analysis 21 (3), pp. 298–313.

Proksch, Sven-Oliver and Jonathan B. Slapin (2008). “A Scaling Model for Estimating Time-Series Party Positions from Texts”. American Journal of Political Science 52 (3), pp. 705–722.

Schwarz, Daniel, Denise Traber, and Kenneth Benoit (2015). “Estimating Intra-Party Preferences: Comparing Speeches to Votes”. Political Science Research and Methods FirstView, pp. 1–18.

A Special Series: Advances in Political Science Methods

This article is part of our Advances in Political Science Methods series. Big data, computer science, experimental methods, and computational text analysis are part of an ever-growing range of methods embraced by political science. This new series, co-hosted by the Oxford University Politics Blog and the Oxford Q-Step Centre is all about “methods”.

What advances have we seen in recent years? What can we
learn today that we could not a decade ago? And, what is the future of methods in political science? Find out more in our Advances in Political Science Methods series.

Comments

comments

Tags:Big Data

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	This cookie is set by Google. In addition to certain standard Google cookies, reCAPTCHA sets a necessary cookie (_GRECAPTCHA) when executed for the purpose of providing its risk analysis.
connect.sid	1 day	This cookie is used for authentication and for secure log-in. It registers the log-in information.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other".
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_gtag_UA_69029762_1	1 minute	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visited in an anonymous form.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.
YSC	session	This cookies is set by Youtube and is used to track the views of embedded videos.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to deliver advertisement when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr	3 months	The cookie is set by Facebook to show relevant advertisments to the users and measure and improve the advertisements. The cookie also tracks the behavior of the user across the web on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
NID	6 months	This cookie is used to a profile based on user's interest and display personalized ads to the users.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.

Cookie	Duration	Description
CONSENT	16 years 8 months 26 days 14 hours	No description
lang		This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
yt-remote-connected-devices	never	This cookie is set by Youtube and stores user video player preferences for embedded YouTube videos
yt-remote-device-id	never	This cookie is set by Youtube and stores user video player preferences for embedded YouTube videos

What Big Data can teach political scientists

What is Big Data?

The Benefits of Big Data

Examples: What can we do with Big Data?

Forecasting: Measuring Political Sentiment

Measuring Political Ideology

Harnessing the power of Big Data

References

A Special Series: Advances in Political Science Methods

Comments

To be alive is to have hope

Impeaching Brazil’s president: A tragedy foretold

Niels Goet

Smart City Citizenship: A Techno-Political Review (of Cities and Nations)

What Big Data can teach political scientists

What Big Data can teach political scientists

What is Big Data?

The Benefits of Big Data

Examples: What can we do with Big Data?

Forecasting: Measuring Political Sentiment

Measuring Political Ideology

Harnessing the power of Big Data

References

A Special Series: Advances in Political Science Methods

Comments

To be alive is to have hope

Impeaching Brazil’s president: A tragedy foretold

Niels Goet

Related Posts

Smart City Citizenship: A Techno-Political Review (of Cities and Nations)

What Big Data can teach political scientists