I want to tackle sentiment analysis using R in a simple way, just to get me started. With this in mind we begin loading all the packages we’ll be using.

1
2
3
4
5
6
library(readr)
library(dplyr)
library(tidytext)
library(tokenizers)
library(stopwords)
library(ggplot2)

Then we need to load our dataset. This data comes from Kaggle Fake and real news dataset.

1
2
Fake <- read_csv('~/fakenews/Fake.csv')
True <- read.csv('~/fakenews/True.csv')

I want to merge the two datasets, but first we need to create a new column that will tell me where the data came from.

1
2
3
4
Fake$news <- 'fake'
True$news <- 'real'

data <- rbind(Fake,True)

Now we can start the data cleaning. In this first moment, we’ll do a simple tokenization on the title and text variables. Then we’ll be removing the stopwords according to the snowball source from the stopwords package.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
title <- tibble(news = data$news, text = data$title)

corpus <- tibble(news = data$news, corpus = data$text)

tidy_title <- title %>%
  unnest_tokens(word, text, token = 'words') %>%
  filter(!(word %in% stopwords(source = 'snowball')))

tidy_corpus <- corpus %>%
  unnest_tokens(word, corpus, token = 'words') %>%
  filter(!(word %in% stopwords(source = "snowball")))

With the tidy data we can select the ten most frequent words from which title news’ group.

1
2
3
4
5
p0 <- tidy_title %>%
  group_by(news, word) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  slice(1:10)

Fake news titles mention video and trump by a large margin, 8477 and 7874 respectively. On the real news titles, trump is also one of the most mentioned, coming on first with 4883 appearances, followed by u.s, 4187, and says with 2981.

10 most frequent words by fake news titles and real news

Now we prepare the data to the sentiment analysis. I’m interested in classifing the data into sentiments of joy, anger, fear or surprise, for example. So I’ll be using the nrc dataset from Saif Mohammad and Peter Turney.

1
2
3
4
5
p1 <- tidy_title %>%
  inner_join(get_sentiments('nrc')) %>%
  group_by(sentiment, news) %>%
  summarise(n = n()) %>%
  mutate(prop = n/sum(n))

sentiment from news titles

Disgust seems to be the most common sentiment around fake news titles while trust is the lowest, even though it still is more than 50%. Overall fake news titles seems to have more “sentiment” than real news in this particular dataset. Even positive sentiments like joy and surprise.

1
2
3
4
5
p2 <- tidy_corpus %>%
inner_join(get_sentiments('nrc')) %>%
group_by(sentiment, news) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n))

For the news’ corpus we can see the same sentiments are prevalent, but the proportion is lower compared to the title. A fake news article loses trust when the reader takes more time to read it. It also becames less negative and shows less fear.

sentiment from news text

An improvement we could do here is to use our own stopwords and change the way we made the tokens. We had instances were trump and trump’s didn’t correspond to the same thing and if we had used this data to train a model this could become problematic.