Billboard Baggins

Ram Mukund Kripa 8/03/2020

The Task

This file aims to identify trends in sentiments, and develop models to classify songs by genres and explicitness on the basis of their lyrics.

Packages

library(tidyverse)
library(here)
library(lubridate)
library(tidytext)
library(topicmodels)
library(tm)
library(knitr)
library(billboard)

Obtaining and Wrangling Data

Getting data: Billboard Lyrics & Spotify

Joining the lyrics and spotify datasets, because each provides useful information!

bilb_dat <- spotify_track_data %>%
  select(year,artist_name,track_name,explicit) %>%
  right_join(lyrics, by = c("artist_name" = "artist", 
                            "track_name" = "title",
                            "year" = "year"
                            )) %>%
  drop_na(lyrics) %>%
  drop_na(explicit)
head(bilb_dat)
## # A tibble: 6 x 5
##   year  artist_name    track_name       explicit lyrics                         
##   <chr> <chr>          <chr>            <lgl>    <chr>                          
## 1 1960  Johnny Preston Running Bear     FALSE    "On the bank of the river\nSto…
## 2 1960  Mark Dinning   Teen Angel       FALSE    "Teen Angel\nTeen Angel\nTeen …
## 3 1960  Jimmy Jones    Handy Man        FALSE    "Comma, comma, comma, comma, c…
## 4 1960  Elvis Presley  Stuck on You     FALSE    "[Verse 1]\nYou can shake an a…
## 5 1960  Connie Francis Everybody's Som… FALSE    "The tears I cried for you cou…
## 6 1960  The Brothers … Greenfields      FALSE    "Once there were greenfields\n…

Tokenizing and Stemming

billb_tokens <- bilb_dat %>%
  unnest_tokens(output = word, input = lyrics) %>%
  # removing stop words
  anti_join(stop_words) %>%
  # remove chorus 
  filter(word!="chorus") %>%
  # remove numbers 
  filter(!str_detect(word, "^[0-9]*$")) %>%
  # stemming
  mutate(word = SnowballC::wordStem(word))

Part 1: Positive or Negative?

This analysis aims to identify whether songs through the years in the billboard 100 tend to be positive or negative, and how that has changed over time. There is a general sentiment among audioheads that songs in the sixties and eighties were much “happier” than they are today. Let’s test that theory!

What does the sentiment distribution of songs look like?

song_sents <- billb_tokens %>% 
  inner_join(get_sentiments("bing")) %>%
  mutate(value = if_else(condition = c(sentiment =="positive"),
                         true = 1,
                         false = -1)
         ) %>%
  group_by(year,track_name) %>%
  summarize(value = sum(value))

song_sents %>%
  ggplot(mapping = aes(x = value,y = ..density..))+
  geom_histogram(fill ="orange")+
  labs(x = "Total Song Sentiment",
       y = "Density",
       title = "Histogram of Sentiments")

Overall song sentiment is clearly centered at zero, which makes sense. We would expect the distribution to be roughly normal!

song_sents %>%
  group_by(year) %>%
  summarize(avg_value = mean(value), miny = min(value), maxy = max(value)) %>%
  mutate(year = as.numeric(year)) %>%
  ggplot(mapping = aes(x = year,
                       y = avg_value,
                       ymin=miny,
                       ymax = maxy))+
  geom_pointrange(color = "green")+
  scale_x_continuous(breaks = seq(from = 1960, to = 2015, by = 5))+
  labs(x = "Year",
       y = "Range of Values of Sentiment",
       title = "Sentiment over Time")

It would seem as though Billboard hot 100 songs have been a little on the negative side lately, which concurs with the opinion of the general public that songs have been a little depressing lately. (I’m looking at you, Post Malone)

Part 2: Unsupervised Learning (Can we find song types based on what words the song is contains?)

This analysis aims to identify the genre or type of song on the basis of how frequently certain tokens occur!

Removing Low TF terms

billb_topkens <- billb_tokens %>%
  count(track_name, word) %>%
  bind_tf_idf(term = word, document = track_name, n = n) 

new_bilb <- billb_topkens %>%
  group_by(track_name) %>%
  top_n(5, wt = tf) %>%
  left_join(billb_tokens)

new_bilb %>%
  group_by(word) %>%
  summarize(count = n()) %>%
  top_n(10, wt = count) %>%
  ggplot(mapping = aes(x = fct_reorder(word,count), y = count))+
  geom_col(color = "tomato", fill = "yellow")+
  coord_flip()+
  labs(x = "Word Stem",
       y = "Count",
       title = "Most popular words in Songs",
       subtitle = "Among Billboard hot 100")

Creating the Doc Term Matrix

bilb_dtm <- billb_topkens %>%
  cast_dtm(document = track_name, term = word, value = n)

bilb_dtm
## <<DocumentTermMatrix (documents: 2252, terms: 16555)>>
## Non-/sparse entries: 106861/37174999
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)

Creating the 5 topic Model

bilb_model <- LDA(bilb_dtm, k = 5, control = list(seed = 123))
bilb_model
## A LDA_VEM topic model with 5 topics.

Visualizing the TF Model

bilb_model %>%
  tidy() %>%
  group_by(topic) %>%
  top_n(5, wt = beta) %>%
  ungroup %>%
  mutate(topic = as_factor(topic)) %>%
  ggplot(mapping = aes(x = fct_reorder(term,beta), y = beta, fill = topic))+
  geom_col()+
  facet_wrap(~ topic, scales = "free")+
  coord_flip()+
  labs(x = "Token",
       y = "Frequency",
       title = "Most Frequent Tokens by Type of Song")

1 and 2/3 seem to represent Love songs and Dance songs respectively, but the others seem difficult to understand. I don’t think that this topic model was very successful. I think it is partly because many factors, not just lyrics, contribute to the genre of a song.

Part 3: Supervised learning: Can we determine what words make a song explicit?

What are the top tokens in Explicit vs. Non Explicit Songs

billb_tokens %>%
  group_by(explicit,word) %>%
  summarize(count = n()) %>%
  group_by(explicit) %>%
  top_n(10, wt = count) %>%
  ungroup %>%
  mutate(word = as_factor(word))%>%
  group_by(explicit) %>%
  arrange(count,
          .by_group = TRUE) %>%
  ggplot(mapping = aes(x = word,
                       y = count,
                       fill = explicit))+
  geom_col()+
  coord_flip()+
  facet_wrap(~explicit,scales = "free")+
  labs(x="WORD",
       y = "COUNT",
       title = "Word Count for Explicit and Non-Explicit Songs")

DTM and Model

library(caret)

bilb_dat_x <- billb_tokens %>%
  group_by(track_name) %>%
  summarize(explicit = mean(explicit)) %>%
  mutate(explicit = as_factor(explicit))

bilbx_tree <- train(x = as.matrix(bilb_dtm),
                     y = factor(bilb_dat_x$explicit),
                     method = "ranger",
                     num.trees = 10,
                     importance = "impurity",
                     trControl = trainControl(method = "oob"))
## Growing trees.. Progress: 50%. Estimated remaining time: 38 seconds.
## Growing trees.. Progress: 50%. Estimated remaining time: 41 seconds.

Model testing

bilbx_tree$finalModel %>%
  # importance
  ranger::importance() %>%
  # framing
  enframe(name = "variable", value = "varimp") %>%
  top_n(n = 20, wt = varimp) %>%
  # plotting
  ggplot(aes(x = fct_reorder(variable, varimp), y = varimp)) +
  geom_col(fill = "purple") +
  coord_flip() +
  labs(x = "Token",
       y = "Variable importance",
       title = "Words that make a song Explicit")

This seemed really successful! Most of the words in this list seem like they could be censored, causing a song to be labelled as “explicit.” Words like the top two are definitely found in most songs that are marked “explicit.”

Session Info

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.2 (2020-06-22)
##  os       macOS Catalina 10.15.5      
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Europe/London               
##  date     2020-07-24                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package      * version    date       lib source        
##  assertthat     0.2.1      2019-03-21 [1] CRAN (R 4.0.0)
##  backports      1.1.8      2020-06-17 [1] CRAN (R 4.0.0)
##  billboard    * 0.1.0      2017-09-04 [1] CRAN (R 4.0.2)
##  blob           1.2.1      2020-01-20 [1] CRAN (R 4.0.0)
##  broom          0.5.6      2020-04-20 [1] CRAN (R 4.0.0)
##  callr          3.4.3      2020-03-28 [1] CRAN (R 4.0.0)
##  caret        * 6.0-86     2020-03-20 [1] CRAN (R 4.0.2)
##  cellranger     1.1.0      2016-07-27 [1] CRAN (R 4.0.0)
##  class          7.3-17     2020-04-26 [1] CRAN (R 4.0.2)
##  cli            2.0.2      2020-02-28 [1] CRAN (R 4.0.0)
##  codetools      0.2-16     2018-12-24 [1] CRAN (R 4.0.2)
##  colorspace     1.4-1      2019-03-18 [1] CRAN (R 4.0.0)
##  crayon         1.3.4      2017-09-16 [1] CRAN (R 4.0.0)
##  data.table     1.12.8     2019-12-09 [1] CRAN (R 4.0.0)
##  DBI            1.1.0      2019-12-15 [1] CRAN (R 4.0.0)
##  dbplyr         1.4.4      2020-05-27 [1] CRAN (R 4.0.0)
##  desc           1.2.0      2018-05-01 [1] CRAN (R 4.0.0)
##  devtools       2.3.0      2020-04-10 [1] CRAN (R 4.0.0)
##  digest         0.6.25     2020-02-23 [1] CRAN (R 4.0.0)
##  dplyr        * 1.0.0      2020-05-29 [1] CRAN (R 4.0.0)
##  e1071          1.7-3      2019-11-26 [1] CRAN (R 4.0.0)
##  ellipsis       0.3.1      2020-05-15 [1] CRAN (R 4.0.0)
##  evaluate       0.14       2019-05-28 [1] CRAN (R 4.0.0)
##  fansi          0.4.1      2020-01-08 [1] CRAN (R 4.0.0)
##  farver         2.0.3      2020-01-16 [1] CRAN (R 4.0.0)
##  forcats      * 0.5.0      2020-03-01 [1] CRAN (R 4.0.0)
##  foreach        1.5.0      2020-03-30 [1] CRAN (R 4.0.2)
##  fs             1.4.2      2020-06-30 [1] CRAN (R 4.0.2)
##  generics       0.0.2      2018-11-29 [1] CRAN (R 4.0.0)
##  ggplot2      * 3.3.2      2020-06-19 [1] CRAN (R 4.0.0)
##  glue           1.4.1      2020-05-13 [1] CRAN (R 4.0.0)
##  gower          0.2.2      2020-06-23 [1] CRAN (R 4.0.2)
##  gtable         0.3.0      2019-03-25 [1] CRAN (R 4.0.0)
##  haven          2.3.1      2020-06-01 [1] CRAN (R 4.0.0)
##  here         * 0.1        2017-05-28 [1] CRAN (R 4.0.0)
##  hms            0.5.3      2020-01-08 [1] CRAN (R 4.0.0)
##  htmltools      0.5.0      2020-06-16 [1] CRAN (R 4.0.0)
##  httr           1.4.1      2019-08-05 [1] CRAN (R 4.0.0)
##  ipred          0.9-9      2019-04-28 [1] CRAN (R 4.0.2)
##  iterators      1.0.12     2019-07-26 [1] CRAN (R 4.0.2)
##  janeaustenr    0.1.5      2017-06-10 [1] CRAN (R 4.0.2)
##  jsonlite       1.7.0      2020-06-25 [1] CRAN (R 4.0.0)
##  knitr        * 1.29       2020-06-23 [1] CRAN (R 4.0.0)
##  labeling       0.3        2014-08-23 [1] CRAN (R 4.0.0)
##  lattice      * 0.20-41    2020-04-02 [1] CRAN (R 4.0.2)
##  lava           1.6.7      2020-03-05 [1] CRAN (R 4.0.2)
##  lifecycle      0.2.0      2020-03-06 [1] CRAN (R 4.0.0)
##  lubridate    * 1.7.9      2020-06-08 [1] CRAN (R 4.0.0)
##  magrittr       1.5        2014-11-22 [1] CRAN (R 4.0.0)
##  MASS           7.3-51.6   2020-04-26 [1] CRAN (R 4.0.2)
##  Matrix         1.2-18     2019-11-27 [1] CRAN (R 4.0.2)
##  memoise        1.1.0      2017-04-21 [1] CRAN (R 4.0.0)
##  ModelMetrics   1.2.2.2    2020-03-17 [1] CRAN (R 4.0.2)
##  modelr         0.1.8      2020-05-19 [1] CRAN (R 4.0.0)
##  modeltools     0.2-23     2020-03-05 [1] CRAN (R 4.0.2)
##  munsell        0.5.0      2018-06-12 [1] CRAN (R 4.0.0)
##  nlme           3.1-148    2020-05-24 [1] CRAN (R 4.0.2)
##  NLP          * 0.2-0      2018-10-18 [1] CRAN (R 4.0.2)
##  nnet           7.3-14     2020-04-26 [1] CRAN (R 4.0.2)
##  pillar         1.4.4      2020-05-05 [1] CRAN (R 4.0.0)
##  pkgbuild       1.0.8      2020-05-07 [1] CRAN (R 4.0.0)
##  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.0.0)
##  pkgload        1.1.0      2020-05-29 [1] CRAN (R 4.0.0)
##  plyr           1.8.6      2020-03-03 [1] CRAN (R 4.0.0)
##  prettyunits    1.1.1      2020-01-24 [1] CRAN (R 4.0.0)
##  pROC           1.16.2     2020-03-19 [1] CRAN (R 4.0.2)
##  processx       3.4.3      2020-07-05 [1] CRAN (R 4.0.2)
##  prodlim        2019.11.13 2019-11-17 [1] CRAN (R 4.0.2)
##  ps             1.3.3      2020-05-08 [1] CRAN (R 4.0.0)
##  purrr        * 0.3.4      2020-04-17 [1] CRAN (R 4.0.0)
##  R6             2.4.1      2019-11-12 [1] CRAN (R 4.0.0)
##  ranger         0.12.1     2020-01-10 [1] CRAN (R 4.0.2)
##  Rcpp           1.0.5      2020-07-06 [1] CRAN (R 4.0.2)
##  readr        * 1.3.1      2018-12-21 [1] CRAN (R 4.0.0)
##  readxl         1.3.1      2019-03-13 [1] CRAN (R 4.0.0)
##  recipes        0.1.13     2020-06-23 [1] CRAN (R 4.0.2)
##  remotes        2.1.1      2020-02-15 [1] CRAN (R 4.0.0)
##  reprex         0.3.0      2019-05-16 [1] CRAN (R 4.0.0)
##  reshape2       1.4.4      2020-04-09 [1] CRAN (R 4.0.0)
##  rlang          0.4.6      2020-05-02 [1] CRAN (R 4.0.0)
##  rmarkdown      2.3        2020-06-18 [1] CRAN (R 4.0.0)
##  rpart          4.1-15     2019-04-12 [1] CRAN (R 4.0.2)
##  rprojroot      1.3-2      2018-01-03 [1] CRAN (R 4.0.0)
##  rstudioapi     0.11       2020-02-07 [1] CRAN (R 4.0.0)
##  rvest          0.3.5      2019-11-08 [1] CRAN (R 4.0.0)
##  scales         1.1.1      2020-05-11 [1] CRAN (R 4.0.0)
##  sessioninfo    1.1.1      2018-11-05 [1] CRAN (R 4.0.0)
##  slam           0.1-47     2019-12-21 [1] CRAN (R 4.0.2)
##  SnowballC      0.7.0      2020-04-01 [1] CRAN (R 4.0.2)
##  stringi        1.4.6      2020-02-17 [1] CRAN (R 4.0.0)
##  stringr      * 1.4.0      2019-02-10 [1] CRAN (R 4.0.0)
##  survival       3.1-12     2020-04-10 [1] CRAN (R 4.0.2)
##  testthat       2.3.2      2020-03-02 [1] CRAN (R 4.0.0)
##  tibble       * 3.0.2      2020-07-07 [1] CRAN (R 4.0.2)
##  tidyr        * 1.1.0      2020-05-20 [1] CRAN (R 4.0.0)
##  tidyselect     1.1.0      2020-05-11 [1] CRAN (R 4.0.0)
##  tidytext     * 0.2.5      2020-07-11 [1] CRAN (R 4.0.2)
##  tidyverse    * 1.3.0      2019-11-21 [1] CRAN (R 4.0.0)
##  timeDate       3043.102   2018-02-21 [1] CRAN (R 4.0.2)
##  tm           * 0.7-7      2019-12-12 [1] CRAN (R 4.0.2)
##  tokenizers     0.2.1      2018-03-29 [1] CRAN (R 4.0.2)
##  topicmodels  * 0.2-11     2020-04-19 [1] CRAN (R 4.0.2)
##  usethis        1.6.1      2020-04-29 [1] CRAN (R 4.0.0)
##  utf8           1.1.4      2018-05-24 [1] CRAN (R 4.0.0)
##  vctrs          0.3.1      2020-06-05 [1] CRAN (R 4.0.0)
##  withr          2.2.0      2020-04-20 [1] CRAN (R 4.0.0)
##  xfun           0.15       2020-06-21 [1] CRAN (R 4.0.0)
##  xml2           1.3.2      2020-04-23 [1] CRAN (R 4.0.0)
##  yaml           2.2.1      2020-02-01 [1] CRAN (R 4.0.0)
## 
## [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library