Leonardo Feitosa: Text Mining and Sentiment Analysis

Text Mining analysis of Jaws by Peter Benchley

Text mining is a fun and powerful way of analyzing texts, including books, newspaper and scientific articles, as well as blog posts and even tweets! With it, we can perform sentiment analysis to see which types of words are more often used in a given text, what they are intended to mean, and a lot more. This post is just a scratch at the surface.

I decided to analyze the text in the Jaws novel by Peter Benchley because it’s one of my favorite books of all time - one of the few I’ve read in English - and because it enabled me to use a lot of cool tools for #dataviz like the wordcloud plot.

All the code used is accessible through the arrow buttons on the left.

hide

# Read in the image
img <- readJPEG(here("data", "jaws2.jpg"))


## Read in the data 
jaws_text <- pdf_text(here("data", "jaws.pdf"))

jaws_text_p50 <- jaws_text[50]

hide

## Wrangling text

jaws_text_tidy <- data.frame(jaws_text) %>% 
  mutate(text_full = str_split(jaws_text, pattern = "\\r\n")) %>% 
  unnest(text_full) %>% 
  mutate(text_full = str_squish(text_full))

## Tidying the data

jaws_df <- jaws_text_tidy %>% 
  slice(-(1:5)) %>% 
  mutate(part = case_when(
    str_detect(text_full, "PART") ~ text_full,
    TRUE ~ NA_character_
  )) %>% 
  fill(part) %>% 
  separate(col = part, into = c("pt", "no"), sep = " ") %>% 
  mutate(no = as.numeric(no))

## Counts of words
jaws_tokens <- jaws_df %>% 
  unnest_tokens(word, text_full) %>% 
  select(-jaws_text)

# Counts
jaws_wordcount <- jaws_tokens %>% 
  count(no, word)

hide

## Removing stop words
jaws_nonstop_words <- jaws_tokens %>% 
  anti_join(stop_words)

hide

#Counts of non stop words
nonstop_counts <- jaws_nonstop_words %>% 
  count(no, word) %>% 
  mutate(word = str_replace(word, pattern = "jaws.txt", replacement = "")) %>%
  mutate(word = str_squish(word)) %>% 
  slice(-(1:65))

## Counts of most frequent words by part
top_9_words <- nonstop_counts %>% 
  group_by(no) %>% 
  arrange(-n) %>%
  slice(1:9)

hide

# Plot
ggplot(data = top_9_words, aes(x = word, y = n)) +
  geom_col(fill = "royalblue4",
           alpha = 0.8,
           color = "black") +
  facet_wrap(~ no, scales = "free") +
  labs(x = "Words",
       y = "Number per chapter",
       title = "Top 9 words per part in Jaws by Peter Benchley") +
  coord_flip() +
  theme_bw() +
  theme(panel.grid = element_blank(),
        axis.title = element_text(size = 12, face = "bold", color = "black"),
        axis.text = element_text(size = 10, color = "gray19"),
        strip.background = element_rect(fill = "white"),
        strip.text = element_text(size = 11, face = "bold", color = "black"))

Worldcloud

In this figure I plotted the 50 most used words by the author along the book. As you can see, Brody, the main character - the police inspector who kills the great white shark in the Jaws movie - is the most used word along the book.

hide

top_50_words <- nonstop_counts %>% 
  group_by(no) %>% 
  arrange(-n) %>% 
  slice(1:50)

top_50_words %>% 
  filter(no == 1) %>% 
  ggplot(aes(label = word)) +
  background_image(img) +
  geom_text_wordcloud_area(aes(size = n, color = n),
                      shape = "triangle-upright") +
  scale_size_area(max_size = 60) +
  scale_color_fish(option = "Lepomis_megalotis") +
  theme_bw()

Sentiment analysis

For this sentiment analysis, I used the #NRC lexicon to figure out the most likely meaning of each words used in the text outside the most common words - a, an, the, he, she, it, etc. These are known as stop words in the text mining jargon and are easily treated with the textdata package. Here, I plotted the most common sentiments expressed in the book on each of the three parts. Really wished these lexicons were available in Portuguese so I could do a sentiment analysis of the Brazilian president’s tweets like David Robinson did on this awesome blog post: http://varianceexplained.org/r/trump-tweets/

hide

## NRC
jaws_nrc <- jaws_nonstop_words %>% 
  inner_join(get_sentiments("nrc"))

# 
jaws_nrc_counts <- jaws_nrc %>% 
  count(no, sentiment) %>% 
  mutate(sentiment = case_when(
    sentiment == "trust" ~ "Trust",
    sentiment == "surprise" ~ "Surprise",
    sentiment == "sadness" ~ "Sadness",
    sentiment == "positive" ~ "Positive",
    sentiment == "negative" ~ "Negative",
    sentiment == "joy" ~ "Joy",
    sentiment == "fear" ~ "Fear",
    sentiment == "disgust" ~ "Disgust",
    sentiment == "anticipation" ~ "Anticipation",
    sentiment == "anger" ~ "Anger"
  ))

hide

# Make the finalized plot

ggplot(data = jaws_nrc_counts, aes(x = sentiment, y = n)) +
  geom_col(fill = "royalblue4",
           alpha = 0.7,
           color = "black") +
  facet_wrap(~ no) +
  labs(x = "Sentiment based on the NRC lexicon",
       y = "Number of words",
       title = "Distribution of sentiments across the Jaws book") +
  coord_flip() +
  theme_bw() +
  theme(axis.title = element_text(size = 12, face = "bold", color = "black"),
        axis.text = element_text(size = 11, color = "black"),
        strip.background = element_rect(fill = "white"),
        strip.text = element_text(size = 12, color = "black", face = "bold"),
        panel.grid = element_blank())

That’s it for now. Learning curve is still ongoing.

Citation: Benchley, Peter. Jaws.


```{.r .distill-force-highlighting-css}