This post shows how to scrape Brazil’s Presidential Election Data from TSE.
I won’t lie to you. The very first time I downloaded elections data from the Tribunal Superior Eleitoral TSE was manually … and painful.
28 files to download and unzip … MANUALLY!
But it was also clear to me that this procedure wasn’t the most recommended. I should figure out a better way to do that.
Thus, looking for a simple way to solve this problem I found this question on stackoverflow:
And Hadley answered it with class.
With this valuable info, I needed to:
In order to find the downloadable link, see the gif below or check this youtube video as the resolution is higher. The tip is to look for the downloadable link right clicking on the View page source button.
The R packages used are:
The R code below is pretty much what Hadley had posted. I have just adapted it to my problem. Then the object page receives the html file we saw in the video above. The function read_html() from package xml2 just reads the html. After that we apply html_nodes() to find the links, html_attr() to get the url and str_subset() to find the files ending in .zip and excluding the ones ending in .sha.
page <- xml2::read_html("https://www.tse.jus.br/hotsites/pesquisas-eleitorais/resultados_anos/boletim_urna/2018/boletim_urna_2_turno.html")
zip_files <- page %>%
html_nodes("a") %>% # find all links
html_attr("href") %>% # get the url
str_subset("\\.zip") %>% # find those that end in .zip
str_subset("\\.sha", negate = TRUE) # rid of the ones ending in .sha
Once you have run this code above, you download those files, unzip and save them in your machine.
for(i in seq_along(zip_files)) {
temp <- tempfile()
download.file(zip_files[i], temp)
unzip(temp, exdir = "data/elections_2018")
unlink(temp)
}
As we are lazy (or should I say smart enough), let’s list all data at once with the function list.files()
.
csvs_to_read = list.files(
path = "data/elections_2018",
pattern = ".*(bweb_2t).*csv$",
recursive = TRUE,
full.names = TRUE
)
That done, you can use the fantastic R function map_df() from purrr coupled with fread() from data.table.
In a few seconds you get your data (nearly 3 million rows and 1,3GB) ready to be analysed.
That’s all folks. Pretty simple to web scraping Brazil’s Presidential Election data.
For attribution, please cite this work as
Vidigal (2021, Jan. 9). Bruno Vidigal: Web Scraping Brazil's Presidential Election Data. Retrieved from https://www.brunovidigal.com/posts/2021-01-09-web-scraping-brazils-presidential-election-data/
BibTeX citation
@misc{vidigal2021web, author = {Vidigal, Bruno}, title = {Bruno Vidigal: Web Scraping Brazil's Presidential Election Data}, url = {https://www.brunovidigal.com/posts/2021-01-09-web-scraping-brazils-presidential-election-data/}, year = {2021} }