Estimating the Effect of Feature Selection in Computational Text Analysis

Below, I discuss and analyse pre-processing decisions in relation to an often-used application of text analysis: scaling. Here, I’ll be using a new tool, called preText (for R statistical software), to investigate the potential effect of different pre-processing options on our estimates. Replication material for this post may be found on my GitHub page.

Feature Selection and Scaling

Scaling algorithms rely on the bag-of-words (BoW) assumption, i.e. the idea that we can reduce text to individual words and sample them independently from a “bag” and still get some meaningful insights from the relative distribution of words across a corpus.

For the demonstration below, I’ll be using the same selection of campaign speeches from one of my earlier blog posts, in which I used a couple of simple off-the-shelf tools to measure linguistic complexity in campaign speeches by now-president Donald Trump, and the then Democrat nominee Hilary Clinton. Transcripts of Hillary Clinton’s speeches are available here. The press releases section of Donald Trump’s website contains speeches in the form of “remarks as prepared for delivery”.

In scaling models, we reduce our corpus of texts to a document-term-matrix (DTM) and consider the variation in term use across the matrix. There are a considerable number of pre-processing decisions we can make at this stage. Here, I’ll follow Denny and Spirling’s working paper on pre-processing for unsupervised learning, and use their preText package for R to show the effects of punctuation, numbers, lowercasing, stemming, stop word removal, n-gram inclusion and infrequently used terms. The key challenge here is to retain meaningful features.

PreText allows us to estimate the effect of pre-processing on the composition of our DTM by considering changes to the ranking in distance pairs between the word-frequency-matrix of individual documents in the DTM between different applications. In each step, we measure the similarity (e.g. cosine or Euclidean) between the documents in the DTMs, and look at the degree to which the order of pairwise distance changes. Subsequently, we can consider the top k pairs that change in rank order the most, and calculate the rank difference for each pair k between a specification and all other specification. We can then take the mean of these differences across all top pairs k.

Results

To generate a measure of influence for a pre-processing decision (i.e. a measure of its effect on our DTM), we can take this preText score as a DV and run a linear regression with all pre-processing decisions included as dummies. For my data, the results are shown in the graph below.

Negative coefficients imply that the pre-processing decision does not produce unusual results; and vice versa (in other words: we should be worried about the estimates that fall to the right of the zero line).

Overall, it seems that our different feature selection options would not substantively change our results in a scaling exercises for this corpus. The preText coefficients are below zero in some cases, and all 95 per cent confidence intervals include zero.

Still, even if we do find that a pre-processing decision has a strong effect (i.e. produces an unusual result), this does not mean that we should not use it. All in all, feature selection should always be informed by theory.

And as the set of tools that we have at our disposal to improve pre-processing quality grows, so should the time we spend thinking about what we are aiming to measure, and what features should therefore be the focus of our attention.

Comments

comments

Tags:Feature selection text analysis

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	This cookie is set by Google. In addition to certain standard Google cookies, reCAPTCHA sets a necessary cookie (_GRECAPTCHA) when executed for the purpose of providing its risk analysis.
connect.sid	1 day	This cookie is used for authentication and for secure log-in. It registers the log-in information.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other".
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_gtag_UA_69029762_1	1 minute	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visited in an anonymous form.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.
YSC	session	This cookies is set by Youtube and is used to track the views of embedded videos.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to deliver advertisement when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr	3 months	The cookie is set by Facebook to show relevant advertisments to the users and measure and improve the advertisements. The cookie also tracks the behavior of the user across the web on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
NID	6 months	This cookie is used to a profile based on user's interest and display personalized ads to the users.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.

Cookie	Duration	Description
CONSENT	16 years 8 months 26 days 14 hours	No description
lang		This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
yt-remote-connected-devices	never	This cookie is set by Youtube and stores user video player preferences for embedded YouTube videos
yt-remote-device-id	never	This cookie is set by Youtube and stores user video player preferences for embedded YouTube videos

Estimating the Effect of Feature Selection in Computational Text Analysis

Feature Selection and Scaling

Results

Comments

Another Labour Meltdown?

Why Don’t Women Speak Up? Towards an assertive (and feminine) form of communication

Niels Goet

Words that matter: What text analysis can tell us about the third presidential debate

Estimating the Effect of Feature Selection in Computational Text Analysis

Estimating the Effect of Feature Selection in Computational Text Analysis

Feature Selection and Scaling

Results

Comments

Another Labour Meltdown?

Why Don’t Women Speak Up? Towards an assertive (and feminine) form of communication

Niels Goet

Related Posts

Words that matter: What text analysis can tell us about the third presidential debate

Estimating the Effect of Feature Selection in Computational Text Analysis