Bruno Vidigal: Premier League: Twitter Data Visualization with R

Bruno Vidigal

Liverpool and Everton met on Saturday 17 October 2020 for the fifth round of the English Premier League (EPL), the most competitive and thrilling national tournament of the world - it’s just the best. As a good and traditional derby, the teams brought a lot of rivalry to the pitch and as result, a 2-2 draw, full of polemics and drama. As I am passionate about both statistics and football/soccer (the beautiful game), I decided to analyze what was going on on Twitter during the match. To do so, I used the great R package rtweet to pull tweets with the hashtags #EVELIV and #MerseySideDerby.

Data collection: rtweet

There are many tutorials on the web about how to pull Twitter data with rtweet. For instance, you can have a look at:

As there are many good articles on Internet, I will assume you know how to get your Twitter API access and will just show you the parameters used to get the tweets.

# load packages -----------------------------------------------------------
library(rtweet)
library(tidyverse)
library(lubridate)
library(scales)
library(gridExtra)
library(grid)
library(magick)

After loading the packages you will need to specify the authentication token’s provided in your Twitter App with the function create_token(). After that you can use the function search_tweets() to get your data.

create_token(
  app = "Liverpool tweets",
  consumer_key = "xxxxxxxxxxxxxxxxxxxxxxx",
  consumer_secret = "xxxxxxxxxxxxxxxxxxxxxxx",
  access_token = "xxxxxxxxxxxxxxxxxxxxxxx",
  access_secret = "xxxxxxxxxxxxxxxxxxxxxxx"
)

tweets <- search_tweets("EVELIV OR MerseysideDerby", n = 100000, 
                                    include_rts = FALSE, lang = "en", retryonratelimit = TRUE)

Data Manipulation

Although I have put n equals to 100,000, I got about 40k rows (39982), which I assume is the number of tweets with these hashtags. Another step towards our aim is to filter out some columns. The API gives you 90 columns. In this exercise we’re interested in some of them.

tweets <- tweets %>% select(user_id, status_id, created_at, screen_name, text, 
                            display_text_width, is_quote, is_retweet, favorite_count,
                            retweet_count, quote_count, hashtags, mentions_screen_name)

Bear in mind that the time shown in the Twitter API is set to UTC. The game took place in Liverpool (GMT +1) and started at 12:30 pm (local time), 11:30 am (UTC). We will consider the first and the last tweet of the Liverpool Football Club as the start (11:31:49 am UTC) and the end (13:28:15 UTC) time of the match. Using mutate() and case_when() we create a new variable moment with three categories: pre-game, game and post-game. Here we’re interested in just the moment of the match.

game_started_at <- as.POSIXct('2020-10-17 11:31:49', tz = 'UTC')
game_ended_at <- as.POSIXct('2020-10-17 13:28:15', tz = 'UTC')

tweets <- tweets %>% 
  mutate(
    moment = case_when(
      created_at >= game_started_at & created_at <= game_ended_at ~ "game",
      created_at > game_ended_at ~ "post-game", 
      TRUE ~ "pre-game"))

Data Visualization

We load the Premier League and Twitter logos to make the graph looks nicer. In this exercise I am using the package magick, function image_read().

premier_league_logo = image_read("premier_league_logo.jpg")
twitter_logo <- image_read("Twitter_bird_logo.png")

Now we’re ready to plot the time series of the number of tweets with the hashtags #EVERLIV and #MerseySideDerby during the match. However, let’s first create the graphic without customizing it. Thus, we’ll see how good is the graphic with the proper logos and colors. This is the best way to value how important is customization. As statisticians/data analysts/data scientists, we need to know how to present a good graphic to an end user/client.

  tweets %>% 
  filter(moment == "game") %>%
  ggplot(aes(created_at)) + 
  geom_freqpoly(binwidth = 60)

And now we make the same graphic adding the Premier League colors, logos and the main events with annotate(). Go check also the Michael Toth blog for more tips on how to brand your graphs.

source('theme_epl.R') # customize theme for English Premier League

  tweets %>% 
  filter(moment == "game") %>%
  ggplot(aes(created_at)) + 
  geom_freqpoly(binwidth = 60, size = 1.2, col = '#3d195b') +
  xlab('Time of the match - UTC') + ylab('') +  
  scale_y_continuous(labels = comma) +
  labs(title = "#tweets (#EVELIV and #MerseysideDerby) over Everton vs Liverpool", 
       subtitle = "English Premier League - 5th round, 17th October 2020",
       caption = "@vidigal_br", 
       family = "sans") + 
  annotate("text", x = as.POSIXct("2020-10-17 11:35:00", tz = 'UTC'), y = 1100, 
           label = "1st Goal \n Sadio Mane", colour = '#D00027', size = 8, family = "sans") +
  annotate("text", x = as.POSIXct("2020-10-17 11:42:00", tz = 'UTC'), y = 30, 
           label = "Virgil van Dijk \n replaced", colour = '#D00027', size = 8, family = "sans") +
  annotate("text", x = as.POSIXct("2020-10-17 11:51:00", tz = 'UTC'), y = 1100, 
           label = "1st Goal \n Michael Keane", colour = '#274488', size = 8, family = "sans") +
  annotate("text", x = as.POSIXct("2020-10-17 13:03:00", tz = 'UTC'), y = 800, 
           label = "2nd Goal \n Salah", colour = '#D00027', size = 8, family = "sans") +
  annotate("text", x = as.POSIXct("2020-10-17 13:11:57", tz = 'UTC'), y = 1100, 
           label = "2nd Goal \n Calvert-Lewin", colour = '#274488', size = 8, family = "sans") +
  annotate("text", x = as.POSIXct("2020-10-17 13:20:59", tz = 'UTC'), y = 30, 
           label = "Red card \n Richarlisson", colour = '#274488', size = 8, family = "sans") +
  annotate("text", x = as.POSIXct("2020-10-17 13:25:10", tz = 'UTC'), y = 1650, 
           label = "VAR: NO GOAL \n Offside", colour = '#D00027', size = 8, family = "sans") +
  annotate("text", x = as.POSIXct("2020-10-17 12:28:00", tz = 'UTC'), y = 800, 
           label = "Interval", colour = '#3d195b', size = 8, family = "sans") +
  annotate("text", x = as.POSIXct("2020-10-17 11:55:00", tz = 'UTC'), y = 1600, 
           label = "1st Half", colour = '#3d195b', size = 8, family = "sans") +
  annotate("text", x = as.POSIXct("2020-10-17 13:05:00", tz = 'UTC'), y = 1600, 
           label = "2nd Half", colour = '#3d195b', size = 8, family = "sans") +
  annotate(
    geom = "segment", x = as.POSIXct("2020-10-17 11:35:00", tz = 'UTC'), y = 1000, 
    xend = as.POSIXct("2020-10-17 11:35:00", tz = 'UTC'), yend = 450, 
    curvature = .3, arrow = arrow(length = unit(2, "mm"))
  ) +
  annotate(
    geom = "segment", x = as.POSIXct("2020-10-17 11:40:00", tz = 'UTC'), y = 100,
    xend = as.POSIXct("2020-10-17 11:40:00", tz = 'UTC'), yend = 300, 
    curvature = .3, arrow = arrow(length = unit(2, "mm"))
  ) +
  annotate(
    geom = "segment", x = as.POSIXct("2020-10-17 11:51:00", tz = 'UTC'), y = 1000,
    xend = as.POSIXct("2020-10-17 11:51:00", tz = 'UTC'), yend = 480, 
    curvature = .3, arrow = arrow(length = unit(2, "mm"))
  ) +
  annotate(
    geom = "segment", x = as.POSIXct("2020-10-17 13:03:00", tz = 'UTC'), y = 650, 
    xend = as.POSIXct("2020-10-17 13:03:00", tz = 'UTC'), yend = 480, 
    curvature = .3, arrow = arrow(length = unit(2, "mm"))
  ) +
  annotate(
    geom = "segment", x = as.POSIXct("2020-10-17 13:11:57", tz = 'UTC'), y = 950,
    xend = as.POSIXct("2020-10-17 13:11:57", tz = 'UTC'), yend = 650, 
    curvature = .3, arrow = arrow(length = unit(2, "mm"))
  ) +
  annotate(
    geom = "segment", x = as.POSIXct("2020-10-17 13:20:59", tz = 'UTC'), y = 150,
    xend = as.POSIXct("2020-10-17 13:20:59", tz = 'UTC'), yend = 300, 
    curvature = .3, arrow = arrow(length = unit(2, "mm"))
  ) +
  annotate(
    geom = "segment", x = as.POSIXct("2020-10-17 13:25:10", tz = 'UTC'), y = 1550,
    xend = as.POSIXct("2020-10-17 13:25:10", tz = 'UTC'), yend = 1450, 
    curvature = .3, arrow = arrow(length = unit(2, "mm"))
  ) + 
  geom_vline(xintercept = as.POSIXct("2020-10-17 12:21:08", tz = 'UTC'), linetype="dotted", 
             color = "#3d195b", size=1.5) + 
  geom_vline(xintercept = as.POSIXct("2020-10-17 12:36:07", tz = 'UTC'), linetype="dotted", 
             color = "#3d195b", size=1.5) +
  annotation_custom(rasterGrob(premier_league_logo),
                    xmin = as.POSIXct("2020-10-17 11:15:00", tz = 'UTC'), 
                    xmax = as.POSIXct("2020-10-17 11:25:00", tz = 'UTC'), 
                    ymin = -100, ymax = -250) +
  annotation_custom(rasterGrob(twitter_logo),
                    xmin = as.POSIXct("2020-10-17 13:20:00", tz = 'UTC'), 
                    xmax = as.POSIXct("2020-10-17 13:25:00", tz = 'UTC'), 
                    ymin = -250, ymax = -120) +
  coord_cartesian(clip = "off") +
  theme_minimal() +
  theme_epl(angle = 0, size = 20, size_title = 30,
            size_subt = 18, size_caption = 16)

As one can easily see, the spikes in the time series are explained by the main events of the match. Moreover, people on Twitter was reacting as the match was evolving and goals happening. But not only with the goals, for instance, when the Liverpool player Virgil van Dijk got injured and had to be replaced, the tweets went up immediately. The peak of the tweets happened in the end of the match when Liverpool scored a goal but VAR detected offside. It could have been the goal of the win for the Reds, but after some moment of euphoric, the goal was denied and the referee blew the final whistle.

Premier League: Twitter Data Visualization with R

Author

Affiliation

Published

Citation

Synopsis

Data collection: rtweet

Data Manipulation

Data Visualization

Footnotes

Citation