Code Walkthrough - The most discussed discs on Reddit

The following is a code walkthrough for my analysis of “The most recommended discs on Reddit”. This walkthrough will be of interest for anyone 1) interested in the details of that analysis, 2) looking for examples of how to use redditextractoR, or 3) anyone who just loves looking at okay-ish code.

Code and walkthrough

For the intrepid, here’s a walkthrough of how I generated the database used to make the results figures. I provide the code for how to make the first figure, but the other figures are just simple modifications of that.

The below code is not run within this post directly, but instead is copy-pasted from my work document. It should, however, provide you with the general process and I provide the final data table dynamically at the bottom.

Here’s the general process I wrote out for myself.

# Reddit comment extraction

# General idea
# - download all comments/posts with “recommend” or other terms in post
# - extract all words and counts of their occurrence
# - filter to include words of named discs
# - rank and plot!

# PACKAGES ----------------------------------------------------------------

library(tidyverse)
library(RedditExtractoR)
library(tm)

I downloaded disc data from alldiscs.com and saved it to a csv - check them out!

# DATA SOURCES ------------------------------------------------------------

all_discs <- read_csv("./data/data_raw/all_discs_raw2.csv")

Here’s the real meat of the process. redditextractoR has a function called get_reddit() which takes search terms (here I used “recommend”, “best disc”, and “suggest”) to gather URLs of relevant posts and extract their comments. I set the cn_threshold=3 which narrows the results only to include posts with 3 or more comments. Lastly, you set the subreddit of interest with ours being r/discgolf. The function has some defaults, like limiting the number of pages it crawls, in order to reduce the total amount of information and processing time. As such, the results we have represent a sample and not some totality of all r/discgolf posts ever (whew!).

I think a lot of people use a python program called PRAW for extracting reddit comments, but I’m not very familiar with python. Maybe some day!

# SCRAPE REDDIT -----------------------------------------------------------

closeAllConnections()

# recommend, purchase, best disc, suggestions

URLs_recommend <- get_reddit(search_terms = "recommend",
                   cn_threshold = 3,
                   subreddit = "discgolf")

URLs_best <- get_reddit(search_terms = "best disc",
                             cn_threshold = 3,
                             subreddit = "discgolf")

URLs_suggest <- get_reddit(search_terms = "suggest",
                        cn_threshold = 3,
                        subreddit = "discgolf")


URLs <- bind_rows(URLs_recommend, URLs_best, URLs_suggest)

closeAllConnections()

save(URLs, file = "./data/data_output/reddit_disc_URLs.Rdata")

Here we take the URLs, remove duplicate comments, and then feed them to tm::Corpus. This nifty package helps to remove punctuation, numbers, white space, etc. It can also filter to only include English language words (or whatever language), but I didn’t want to do that since disc names are not necessarily English names. For this bit, I mostly followed what these folks did (https://rpubs.com/SmilodonCub/586863) hence me getting a little lost at exactly how the matrix conversion worked - I’m much more comfortable with tibbles/data.frames and so the last step is converting back to that!

# WRANGLE COMMENTS --------------------------------------------------------

df_comments <- dplyr::select(URLs, comment) %>% 
  #remove non-distinct comments (mostly deleted ones)
  distinct()

commentCorpus <- Corpus( VectorSource( df_comments ) )
#We pipe the corpus through several tm_map() methods
commentCorpus <- commentCorpus %>%
  tm_map(removePunctuation) %>% ##eliminate punctuation
  tm_map(removeNumbers) %>% #no numbers
  tm_map(stripWhitespace) %>%#white spaces
  tm_map(tolower)

# Not really sure what's happening beyond reformatting and sorting...this is copy-paste
commentCorpus_mat <-as.matrix(TermDocumentMatrix( commentCorpus ))
commentCorpus_wordFreq <-sort(rowSums(commentCorpus_mat), decreasing=TRUE)

# convert to dataframe so I'm no longer lost
df_word_frq <- enframe(commentCorpus_wordFreq)

Next I took the list of discs from alldiscs and simplified it. I do some pretty hacky things below like using multiple separate() functions to break up ugly names…but it occurred to me later I could use str_remove() much more elegantly. A newer version of this code for another project does that instead - whoops! Anyway, it works and we get a dataframe of mold names, manufacturers, and some categorizations of the discs. I also excluded a few words that showed up a toooon but from personal knowledge of discs, figured those were discussions using those words rather than actually about those discs.

# FILTER ALL DISCS LIST ---------------------------------------------------

# NB - The nature of this beast likely results in some strange results. For instance, if someone spells "Teebird" as "Tee bird" it's going to get chopped up and lost. If people use slang for disc names, it's going to get lost. Common mispellings like "buzz" instead of "buzzz" result in losses. Some discs also have names that are common words (e.g. "truth", "spin", "birdie") and so are likely over represented. 


df_flt_discs <- all_discs %>% 
  # remove parentheticals
  separate(mold, into = c("mold", "trash"), sep = "[(]") %>% 
  # remove dashes
  separate(mold, into = c("mold", "trash"), sep = "-") %>% 
  dplyr::select(-trash) %>% 
  mutate(mold = tolower(mold)) %>% 
  # remove duplicates (e.g. "buzzz os" and "buzzz ss" all just go into "buzzz" due to the merge below)
  distinct(mold, .keep_all = TRUE) %>% 
  #a couple weird discs getting pulled in that I didn't think made sense because they're just common words
  filter(mold != "tee", mold != "money", mold != "max", mold != "birdie")

Finally, we join together the word frequency dataframe and the dataframe of disc molds. Using inner_join only retains rows with matching values in each dataframe. For example, a row with “teebird” in the word frequency dataframe gets retained since that’s a known mold in the disc dataframe. But rows with words like “the” “and” or “hello” are dropped since those are not disc names.

# JOIN WORD FREQUENCIES WITH LIST OF DISC MOLDS ---------------------------


df_disc_frq <- inner_join(df_word_frq, df_flt_discs, by = c("name" = "mold")) %>% 
  group_by(disc_type) %>% 
  mutate(type_rank = row_number()) %>% 
  ungroup() %>% 
  mutate(total_rank = row_number())

After all that we’re left with a joined dataframe of our discs and the frequency they are discussed (called “value” in the dataframe below). You can explore it - go on, touch it. The table is dynamic!

Last but not least, here’s the code to make the Top 20 discs plot seen in the main post.

# PLOT TOP 20 DISCS OF ANY TYPE -------------------------------------------


# all discs ungrouped by type

(p_top20 <- df_disc_frq %>%
    filter(total_rank <= 20) %>%
    mutate(name = str_to_title(name)) %>%
    ggplot(., aes(x = reorder(name, value), y = value, fill = disc_type)) +
    geom_segment(aes(yend = 0, xend = reorder(name, value)), color = "black") +
    geom_point(size = 5, shape = 21, color = "black") +
    coord_flip() +
    theme_minimal(base_size = 15) +
    theme(legend.position = c(0.8, 0.25), legend.background = element_rect(fill = "white")) +
    scale_fill_viridis_d() +
    labs(x = "Disc Mold", y = "Number of Mentions", fill = "Disc Type"))