Topic Modeling

Investigating Bias Between Popular News Sources with Unsupervised
Topic Modeling and Sentiment Analysis.

All analyses done using Python.

Background

The media plays a vital role in society, keeping people informed and educated. What the media covers provides citizens knowledge of current and important events. What the media covers, however, can also lead to a disproportionate idea of how frequent or significant an event may be, also referred to as the availability heuristic. Recently, there has been a large emphasis on 'doom scrolling', or spending considerable amounts of time focusing on negative news. This is not without consequences, as research has started to show how news can not only shape our political believes, such as attides towards immigrants and other countries but can potentially itself influence the health of the economy as well as the mental and physical health of the consumer.

Information is always presented through the perspective of the educator, naturally leading to informational bias. It's widely accepted throughout the United States that the majority of mainstream news sources display political bias. Although bias is natural and unavoidable, its important to understand the potential biases you may be subjected to in information you receive. Interestingly, data from Gallup shows that the majority of Americans have greater concern that news other people receive may be biased, than concern that the news they themselves take in may be biased.

For this project I aimed to evaluate the news headlines of six mainstream media outlets in America to investigate what topics these sources tend to cover, and whether different sources display differing sentiments when covering the same topics. Additionally, I was curious whether or not news outlets are overly subjective in their presentation of events, instead of 'sticking to the facts'. The algorithms used in this project allowed me to investigate both the sentiment and objectivity of news headlines.

Image Source

Data Collection

News Articles

The dataset used for this project is the NELA-GT-2020 datasets from the Harvard Dataverse. For my investigation, I filtered the data to include only articles published by six popular news sources in the United States: CBS, CNN, Fox News, The New York Times, USA Today, and the Washington Post. Due to the size of the dataset I randomly selected 3,000 articles from each news source. Below is a snapshot of the data prior to any text preprocessing.

Also, at this time I've decided to only analyze news headlines. Analyzing the article contents will be something I'll continue to explore, however due to the length of the text data, labeling an article's sentiment is difficult as there are often many different sentiments expressed throughout an article, as well as the potential for multiple topics to be discussed. Thus for simplicity I've presented only results from the analysis of headlines.

Data Preparation

Text Preprocessing

The functions used to preprocess the text are available NLTK library.

To prepare the data, I performed standard text preprocessing steps prior to passing the data into my models. This included stop word removal, contraction expansion, converting to lowercase, and subsequently word tokenization. After tokenization, bigrams and trigrams were created to account for potential common phrases that may occur together. Finally word lemmatization was performed on the tokenized text.

The Gensim package was then used to create a dictionary of tokens. The dictionary was subsequently filtered to exclude any tokens that appeared in less than 15 headlines and more than 50% of the headlines, suggesting they may not be beneficial to separate topics.

Following the text preprocessing, I created two different corpus objects to pass into my topic models. These were represented by the commonly used Bag of Words and Term Frequency - Inverse Document Frequency matrices.

Latent Dirichlet Allocation (LDA)

LDA is an unsupervised topic modeling algorithm. It assumes text data is a mixture of topics which themselves have probabilities corresponding to specific words. Thus the representation of each topic is a combination of words, instead of a single word. The LDA algorithm evaluates the words in each document and labels each document as a combination of topics, in which the sum of the topics contribution sum to one. The document can then be assigned to the topic which composes the majority of the document. I've read conflicting opinions on whether or not using the TF-IDF corpus as input into LDA models is appropriate, however I have used the topic modeling output from the Bag of Words dictionary regardless.

Latent Semantic Indexing (LSI)

LSI is another unsupervised topic modeling algorithm to learn hidden topics in text. It reduces the dimensionality of the corpus matrix via Singular Value Decomposition into a 'k' dimensional matrix, where k is the number of topics for the document set. The rows of this matrix represents document vectors with each column (topic) coordinate representing the role that topic plays in that document. Similar documents (in terms of topics) are thus nearer to each other within the k-dimensional space. The output of the LSI algorithm is different than the LDA models, in that it does not provide dominant topics for each document. It provides an overall set of topics that encompass the document set and groups similar documents together in space.

Topic Modeling Results

Parameter Selection

For each model I used the coherence score for each topic number to evaluate the model performance. In general, the higher the coherence the better the model. Additionally, the LDA models can be evaluated with the perplexity measure which I further evaluated. Perplexity, on the other hand, is smaller in better performing models.

Because these models are unsupervised, however, its important to visually inspect the created models by each algorithm. After doing this, I decided to select 35 topics for LDA Bag of Words model, 60 topics for the LDA TF-IDF model, and 50 topics for the LSI model.

Topics

Each model produced the number of specified topics. As mentioned above, the LDA models assign each document to a specific topic, whereas the LSI model provides a set of topics for the entire document set. Below are snapshots of the topics output from each model. Below these snapshots is an interactive visualization of the LDA TF-IDF topic model using the Pyldavis package from the Gensim Library. It uses principal component analysis to reduce the feature space and plot the topics onto 2-D space. This visualization indicates how well the topics separate. The goal is to minimize the overlap between different topics. There is some overlap within my visualization, however for this project I felt comfortable continuing with the model.

Most Frequent Topics from the LDA Bag of Words Model:

Most Frequent Topics from the LDA TF-IDF Model:

First Five Topics from the LSA TF-IDF Model:

Visualizing the LDA Bag of Words Topic Model

Sentiment Analysis

Before performing sentiment analysis I looked through the topics output by the LDA Bag of Words topic model and selected 5 topics that I thought may present conflicting opinions between different news sources. I then used two sentiment analysis algorithms (VADER and TextBlob) to evaluate the news headlines within my dataset.

Valence Aware Dictionary and sEntiment Reasoner (VADER)

VADER is a commonly used lexicon and rule-based algorithm associated with the NLTK library. It was initially developed for analysis of social media posts, however has shown to perform well with other types of text. To perform sentiment analysis, I passed in the unprocessed headlines, and the VADER algorithm performs the preprocessing steps itself. These preprocessing steps are comparable to the steps I performed for topic modeling above.

VADER uses a pre-defined dictionary that contains emotionally defined words with associated weights defined by the strength or intensity of the word. Examples of positive words might 'great', 'amazing', and 'beautiful', but also terms such as 'bad ass' are accounted for in the dictionary and noted as positive. VADER also accounts for negation words such as 'not', which can flip the direction of the sentiment in a text.

The algorithm evaluates the combination of positive, neutral, and negatively defined terms and outputs an overall polarity score between [-1, 1]. Scores < -.05 are indicate a negative sentiment, whereas scores > 0.05 indicate positive sentiment. Scores within [-0.05, 0.05] are define as neutral sentiments.

TextBlob

TextBlob is another lexicon and rule-based algorithm. It utilizes the NLTK and pattern library to perform a wide variety of NLP tasks. For the sentiment analysis, it largely performs the same steps as the VADER algorithm, however uses a different dictionary with the defined terms and weights.

In addition to the ability to evaluate sentiments in given texts, TextBlob can evaluate the objectivity of given text which I do in this project.

Sentiment Analysis Results

VADER

Below is an image of the output from the VADER algorithm. Each headline is given a label as displaying positive, neutral, or negative sentiment.

The combined output for the VADER is presented in the figures below. Displayed is the overall sentiments of all given headlines in the data as well as the sentiments of five topics selected from the LDA Bag of Words topic model. The title of each graph is the topic number followed by the keywords associated with that topic. Additionally, each topic references a single headline which is most representative of said topic. This is also includes in the title of each graph. These graphs are also adjusted for the number of headlines published within that topic by each source to more appropriately compare.

For the most part, it seems that the majority sentiment is actually neutral, which is somewhat surprising to me. However this is using the predefined threshold of 0.05 to define neutrally labeled sentiments. This threshold can be increased or decreased, resulting in headlines being labeled as neutral more often or less often, respectively.

Overall, it does appear that USA Today provides the most frequently positive headlines, whereas CBS and the New York Times provide the most frequently neutral headlines. Fox News and the Washington Post, on the other hand, are most frequently negative.

Within each of the five topics, there is seem some disparity in the sentiment labels which I find interesting. I won't discuss each topic specifically, but the figures below display each topic and source clearly.

Sentiment Analysis Output: VADER Algorithm

Sentiment Analysis Output for Five Chosen Topics: VADER Algorithm

TextBlob

Interestingly, the sentiment analysis results were strikingly different for TextBlob than for the VADER algorithm. TextBlob was considerably more likely to label headlines as neutral, while almost never labeling a headline as negative. When evaluating the source code, the only considerable difference I could see between how the two models perform were that they each used different dictionaries and and thus the emotional words and associated weights would be different. As my analyses are unsupervised, its difficult to quantitatively evaluate the performance of each model, however I found VADER algorithm to be more accurate upon personal evaluation.

For TextBlob I've chosen to present the overall polarity (sentiment) scores as well as the subjectivity scores. For the polarity measure, scores are between [-1, 1]. Values less than zero are defined as having a negative sentiment, while scores greater than zero show positive sentiment. A score of zero represents neutral. Subjectivity scores fall between [0, 1] with 0 = 'very objective' and 1 = 'very subjective'.

In the images below, its apparent that as headlines become increasingly polar (emotional), they become more subjective. I've also included box plot figures below to more accurately present the polarity and subjectivity data. The median headlines for each source is relatively neutral in sentiment, with CBS and the New York Times displaying the tightest distribution around the median. For subjectivity data, CBS and the New York Times show the highest level of objectivity (least subjective) in their headlines, while USA Today and the Washington Post display more subjective headlines (although still relatively objective).

Polarity by Subjectivity Scores for each News Source: TextBlob

Boxplot view of Subjectivity and Polarity Scores for News Headlines

Closing

Overall I felt that the results from this project were very interesting as their do seem to be some differences between sources in their sentiments toward different topics. 2020 was a unique year which seems to have had a few topics dominate the headlines more than other years, so it'd be interesting to extend this analysis to earlier years.

Ideally, I'd like to continue working on this while researching different algorithms which would enable me to more precisely measure 'doomscrolling'. One limitation of unsupervised learning is that there are no labels that I can use to train the algorithms and improve the sentiment classification of each headline. Further, I'd like to find methods in which I can evaluate the entirety of the news articles more accurately, not just the headlines. It may be benficial to explore different topic modeling algorithms such as HDP and NMF which I plan to study next. There are also many other sentiment analysis algorithms which may be better suited for longer text input such as entire news articles.

Investigating Bias Between Popular News Sources with Unsupervised Topic Modeling and Sentiment Analysis.