Forecasting elections with social media? Yes, we can. Almost…

With the failure of traditional forecasting methods to accurately predict the outcomes of the UK General Election of May 2015, can social media based predictions do any better? In this article, Andrea Ceron, Luigi Curini, and Stefano M. Iacus (University of Milan and VOICES from the Blogs) find that supervised and aggregated sentiment analysis (SASA) applied in proportional electoral systems produces the most accurate forecasts of election results.

The exponential growth of social media and social network sites, such as Facebook and Twitter, and their potential impact on real world politics has increasingly attracted the attention of scholars in recent years. Among the other things, researchers have started to explore social media as a device to assess the popularity of politicians, to track the political alignment of social media users, and to compare citizens’ political preferences expressed online with those reported by polls. Analysing social media during an electoral campaign can indeed be very interesting for a number of reasons. Besides being cheaper and faster compared with traditional surveys, social media analysis can monitor an electoral campaign on a daily (or on an hourly) basis. Consequently, the possibility to nowcast a campaign, that is, to track trends in real time and capture (eventual) sudden changes (so called “momentum”) in public opinion faster than is possible through traditional polls (for example, the results of a TV debate), becomes a reality. Some scholars, however, go even further, claiming that analysing social media allows a reliable forecast of the final result. This is quite fascinating, as forecasting an election is one of the few exercises in social science where an independent measure of the outcome that a model is trying to predict is clearly and indisputable available, i.e., the vote share of candidates (and/or parties) at the ballots.

To reach this aim, however, at least two challenges need to be successfully overcome. Last year, while attending a conference, we heard a speaker arguing that Giuseppe Civati won the primary election of the Italian Democratic Party, at least on Twitter. The speaker justified this statement by asserting that all the people the speaker was following on Twitter were all posting messages in favour of Civati: therefore, Civati should have won! After collecting and analysing almost 6oo,ooo tweets through VOICES from the Blogs posted in the three weeks leading up to polling day, which discussed the primary election, we can confidently say that this was not the case. In fact, Civati was the third (and therefore, the last) candidate in terms of declared support on Twitter, clearly beyond Matteo Renzi as well as Gianni Cuperlo. This example warns us against the risk of political homophily and selective exposure that is always present regardless of the promise of a virtual world where everyone can freely connect with anyone else.

Moreover, relying on random sampling of Big Data Internet is extremely complex, more so than working with traditional surveys. There is no comprehensive phone list of the entire Internet community on which the standard techniques of sampling are applied. In addition, no reliable information about the individual traits of social media users is currently accessible, making the possibility of a stratified sample unfeasible. However, unlike traditional surveys where we have to rely on a sample precisely because analysing the universe is unattainable, when we talk about social media, the entire universe is in principle available, at least the universe referring to public posts. Let’s leave aside the technical challenge to get access to such “universe” (a far from irrelevant task), and let’s suppose that we were able to collect it. The difficult part would begin just now for the researcher: how does one analyse such a large amount of data? How would one extract politically significant meaning from the data?

This is clearly a methodological problem. For example, is it enough to count the volume of data related to candidates or parties to try to predict the final electoral result? Let us revisit the example of the Italian primary election, but this time, concentrate on the 2012 centre-left election. In November 2012, Matteo Renzi had approximately 73,000 mentions on Twitter (i.e., posts that contained the word “Renzi”), while Pierluigi Bersani reached approximately 26,000 mentions. According to these numbers, Renzi should have exceeded Bersani by approximately 73 per cent; however, Bersani won the polls with a 10 per cent margin in the first round (and over 20 per cent in the second round). Of course, this should not be that surprising. Indeed, the number of mentions are indicative of only the notoriety (positive or negative alike), not the popularity or the (potential) support (at least online) for a politician (Ceron et al., 2015a).

We recently conducted a meta-analysis of 219 electoral social-media forecasts related to 89 different elections held between 2008 and 2015 (Ceron et al., 2015b). Overall, the Mean Absolute Error (MAE) of social media based prediction was higher than 7. Note that survey polls in the same subset of elections produced a MAE slightly lower than 2. Compared to surveys, the predictive power of social media appears therefore rather poor prima facie. However, in some cases, social media predictions actually were comparable (if not better) than survey polls.

Our aim has been therefore to ascertain the reasons that could explain the accuracy of the electoral forecast, focusing in particular on the method adopted to analyse social media. We differentiated between computational approaches (either based on volume data, such as the number of mentions related to a party or candidate or the occurrence of particular hashtags; or endorsement data, such as the number of Twitter followers, Facebook friends or the number of “likes” received on Facebook walls), sentiment analysis approaches, that pay attention to the language and try to attach a qualitative meaning to the comments (posts, tweets) published by social media users employing automated tools for sentiment analysis (i.e., via natural language processing models or the employment of pre-defined ontological dictionaries), and finally what we call supervised and aggregated sentiment analysis (SASA), that is, techniques that exploit the human codification in their process and focus on the estimation of the aggregated distribution of the opinions, rather than on individual classification of each single text (Ceron et al. 2016). More in details, the SASA method is based on a two-stage process (Ceron et al. 2015a). In the first step human coders read and codify a subsample of the documents. This subsample, with no particular statistical property, represents a training set that will be used by the second step of the algorithm to classify all the unread documents (the test set). At the second stage, the aggregated statistical estimation of the SASA algorithms extends such accuracy to the whole population of posts, allowing one to properly obtain the opinions expressed on social networks.

The SASA approach, first introduced by Hopkins and King (2010), aims to solve two different problems. First, users on social media use natural language, which evolves continuously and varies depending on the person who is actually writing (male, female, young, old, officer, journalist, etc.) and a particular topic (soccer, politics, music, etc.). In addition, metaphoric or ironic sentences as well as jargon, contractions or neologisms are used in different and new ways every time. This fact puts all unsupervised methods based on ontological dictionaries or statistical methods based on natural language processing (NLP) models under stress when it comes to accurately capturing sentiment. For these reasons, supervised human coding of a training set is a cornerstone of the SASA methodology. Human coding, in fact, allows to reduce misclassification errors given that human coders are indeed more effective than ontological dictionaries in recognising all the specificity of the language and in interpreting the texts and the author’s attitude. Second, by directly estimating the aggregated distribution of opinions, SASA produces more reliable aggregate results in a context where they (i.e., the final vote-share of parties and/or candidates) are what concerns us.

Our meta-analysis, in this respect, shows that SASA increases the accuracy of the forecasts by a remarkable 3.7 points if compared to forecasts based on a mere computational approach and by 2.6 points if compared to other sentiment analysis techniques based on ontological dictionaries, which are no more effective than computational methods in improving the accuracy of the prediction.

Although highly relevant, the method is not the only factor affecting the accuracy of the prediction. The electoral system matters too. When the elections are held under proportional representation, social media forecasts are remarkably more precise. This effect is due to the lower incentive to cast a strategic vote. Because every vote counts in proportional electoral systems, citizens are freer to behave according to their sincere preferences. As a consequence, we observe a higher congruence between opinions expressed online and actual voting behaviour. Conversely, when there is an incentive to behave strategically, the analysis of the opinions expressed online becomes less relevant because voters may express their sincere preference online while casting a strategic vote at the polls. This suggests that when some elements prompt the coherence between online opinions and offline behaviour, the accuracy of social media based predictions is heightened. The fact that our analysis consistently shows that the error is lower in elections with a high turnout and at the same a huge volume of comments, points in the same direction.

In sum, despite the well-known limits and challenges faced by social media analysis, there are reasons to be optimistic about the capability of sentiment analysis becoming (if it is not already) a useful complement to traditional offline polls. But in this respect a word of caution is well needed: Big Data is likely to contribute so long as the desired qualities of the data are not negatively correlated with the quantity of data (Clark and Golder 2015). The method employed in this respect, as well as the (institutional) context in which you run the analysis, make a difference!

References

Ceron, Andrea, Luigi Curini and Stefano M. Iacus (2015a). “Using sentiment analysis to monitor electoral campaigns: method matters. Evidence from the United States and Italy”, Social Science Computer Review, 33(1), 2015, 3-20

Ceron, Andrea, Luigi Curini and Stefano M. Iacus (2015b). “Social Media and Elections. A meta-analysis of online-based electoral forecasts”, in Kai Arzheimer, Jocelyn Evans and Michael Lewis-Beck (eds.), The Handbook of Electoral Behaviour, Sage, forthcoming

Ceron, Andrea, Luigi Curini and Stefano M. Iacus (2016). Forecasting and Nowcasting Elections Using Social Media: Just By Chance? London: Ashgate, forthcoming

Clark William Roberts, and Matt Golder (2015) “Big Data, Causal Inference, and Formal Theory: Contradictory Trends in Political Science?” PS: Political Science & Politics, 48(1): 65-70

Hopkins, D.J., King, G. (2010) A method of automated nonparametric content analysis for social science, American Journal of Political Science, 54(1), 229-247.

This post is part of our Decision 2015 series. Corresponding author: Luigi Curini (luigi.curini@unimi.it).

Comments

comments

Tags:Forecasting Media Social Media

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	This cookie is set by Google. In addition to certain standard Google cookies, reCAPTCHA sets a necessary cookie (_GRECAPTCHA) when executed for the purpose of providing its risk analysis.
connect.sid	1 day	This cookie is used for authentication and for secure log-in. It registers the log-in information.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other".
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_gtag_UA_69029762_1	1 minute	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visited in an anonymous form.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.
YSC	session	This cookies is set by Youtube and is used to track the views of embedded videos.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to deliver advertisement when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr	3 months	The cookie is set by Facebook to show relevant advertisments to the users and measure and improve the advertisements. The cookie also tracks the behavior of the user across the web on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
NID	6 months	This cookie is used to a profile based on user's interest and display personalized ads to the users.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.

Cookie	Duration	Description
CONSENT	16 years 8 months 26 days 14 hours	No description
lang		This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
yt-remote-connected-devices	never	This cookie is set by Youtube and stores user video player preferences for embedded YouTube videos
yt-remote-device-id	never	This cookie is set by Youtube and stores user video player preferences for embedded YouTube videos

Forecasting elections with social media? Yes, we can. Almost…

The House of Lords’ proportionality problem

Do Europeans want poorer countries in the EU?

Andrea Ceron

Luigi Curini

Stefano M. Iacus

Social Media: The Creative Destruction of Pakistani Politics

Can Mexicans predict the presidential election in June?

OxPol Blogcast. Politics, Re-Imagined — Media, Identity and Misinformation with Rasmus Kleis Nielsen

Politics in Nigeria After Social Media

1 Comment