I’m a fan of Chris Albon’s recent project #machinelearningflashcards on Twitter where generalized topics and methodologies are drawn out with key takeaways. It’s a great approach to sharing concepts about machine learning for everyone and a timely refresher for those of us who frequently forget algorithm basics.

I leveraged Maëlle Salmon’s recent blog post on the Faces of #rstats Twitter heavily as a tutorial for this attempt at extracting data from Twitter to download the #machinelearningflashcards.

Source Repo for this work: jasdumas/ml-flashcards


Directions

1. Load libraries:

For this project I used rtweet to connect the Twitter API to search for relevant tweets by the hash tag, dplyr to filter and pipe things, stringr to clean up the tweet description, and magick to process the images.

Note: I previously ran into trouble when downloading ImageMagick and detailed the errors and approaches, if you fall into the same trap I did: How to install imagemagick on MacOS

library(rtweet)
library(dplyr)
library(magick)
library(stringr)
library(kableExtra)
library(knitr)

2. Get tweets for the hash tag and only curated tweets for Chris Albon’s work:

ml_tweets <- search_tweets("#machinelearningflashcards", n = 500, include_rts = FALSE) %>% filter(screen_name == 'chrisalbon')
mt <- ml_tweets[1:3, 1:5]

kable(mt, format = "html") %>%
  kable_styling(bootstrap_options = "striped", 
                full_width = F) 
screen_name user_id created_at status_id text
chrisalbon 11518572 2017-05-09 22:51:43 862077650772164608 Mean Squared Error #machinelearningflashcards https://t.co/K1iDqLV5DD
chrisalbon 11518572 2017-05-09 18:15:39 862008178527031296 R-Squared #machinelearningflashcards https://t.co/73gR8tb5PA
chrisalbon 11518572 2017-05-09 16:23:04 861979845563105280 Motivation For Kernel PCA #machinelearningflashcards https://t.co/AhLB91gHBh
ml_tweets$clean_text <- ml_tweets$text
ml_tweets$clean_text <- str_replace(ml_tweets$clean_text,"#[a-zA-Z0-9]{1,}", "") # remove the hashtag
ml_tweets$clean_text <- str_replace(ml_tweets$clean_text, " ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "") # remove the url link
ml_tweets$clean_text <- str_replace(ml_tweets$clean_text, "[[:punct:]]", "") # remove punctuation

4. Write a function to download images of the flashcards from the media_url column and append the file name from the cleaned tweet text description and save into a folder:

save_image <- function(df){
  for (i in c(1:nrow(df))){
    image <- try(image_read(df$media_url[i]), silent = F)
  if(class(image)[1] != "try-error"){
    image %>%
      image_scale("1200x700") %>%
      image_write(paste0("data/", ml_tweets$clean_text[i],".jpg"))
  }
 
  }
   cat("Function complete...\n")
}

5. Apply the function:

save_image(ml_tweets)
## Function complete...

At the end of this process you can view all of the #machinelearningflashcards in one location! Thanks to Chris Albon for his work on this, and I’m looking forward to re-running this script to gain additional knowledge from new #machinelearningflashcards that are developed in the future!