Department of Basic Education: Schools Database
INTRODUCTION
A few years ago, I was a student pursuing Masters of Commerce degree. The topic had something to do with the relationship between Education, Labour and Business outcomes. My pursuit of the qualification is defunct. However, there are several artefacts worth writting up. In this post, we will discuss the South African Schools Database. The Department of Basic Education regularly publishes versions of the School Database including a number of notable variables such as the school’s location, contact information, number of learners and teachers etc. The data is published in a non-standardised matter. This makes it an interesting data wrangling task.
DATA SCRAPING
The first hurdle is the volume of files publised on the DBE website. It is possible to download all 200 files by hand and save them to a directory of your choosing. However, such a process would be tedious and error prone(speaking from experience, ofcourse). The R-Programming language is a perfect companion for this task. To download the files, we can use two important packages, rvest and xml2.
Firstly, we specify the url on the read_html function. Thereafter, we use the html_elements function to point to the html tag of interest. In this case, we are interest in the “a” tag, specifically, the href (or link) attribute.
We store the result in a data.frame object and use a regular expression to filter for values that start with a punctuation followed by “Link” and values that contain the term “forcedownload”. Finally, we append the path to file (our base url). The resulting data.frame contains all 200 downloadable files.
DOWNLOADS, SO MANY DOWNLOADS.
With the links in hand, we can tackle the next hurdle, downloading the files. Usually, we could use the commandline to download the files. For example, the single command below.
However, the DBE datasets are saved as either xlsx or xls format with a prompt on click to download the file. There are probably ways around this issue. Luckily, xml2 has a convenient function to handle this issue. In the code below, we use a for-loop to download each file in our dataset and save them in a specified sub folder. To avoid a break in the for-loop when an error occurs, we add the try function. As an add-on, we print a statement after each download. Depending on your internet connection speed and the website’s response time, this script can take five minutes to download all the files.
BRING YOUR BROOMS CAUSE IT’S A MESS!
Another set of tasks to address is reading and cleaning the excel files. Using other software such as Excel or SPSS, these tasks would be cumbersome. Yet with R, Python or other programming languages, it possible handle more than one file at a time. Below, we use the readxl and purrr R packages to iteratively read and clean the files.
see code
# All_Excel_Reader --------------------------------------------------------
all_excel <- function(path){
collect_sheets <- excel_sheets(path)
number_of_sheets <- 1:length(collect_sheets)
per_sheet <- list()
for(i in seq_along(number_of_sheets)){
per_sheet[[i]] <- read_xlsx(path = path,
sheet = collect_sheets[i])
}
return(per_sheet)}
# All_Masterlist ----------------------------------------------------------
EDU_Dbs <- data.frame(master_list = list.files(path = "./Schools_Db",
full.names=TRUE,
pattern = ".xlsx")) %>%
mutate(schools_db = map(master_list,all_excel))
EDU_Dbs <- EDU_Dbs %>%
unnest(schools_db)
EDU_Dbs$schools_db <- lapply(EDU_Dbs$schools_db,sapply,as.character)
EDU_Dbs$schools_db <- lapply(EDU_Dbs$schools_db,as.data.frame)
EDU_Dbs <- EDU_Dbs %>%
unnest(schools_db)
EDU_Dbs <- EDU_Dbs %>%
clean_names()
The resulting data.frame contains hundreds of thousands of rows and nearly 60 columns. Interestingly, most of these variable are effectively differing naming conventions such as emisno = natemis = oldnatemis = newnatemis. The insistent naming conventions extend to other variables such as gps coordinates and centre details. The spectacularly unoptimised script is availablehere. The scraper script,cleaning script and downloaded xlsx files are all available on the this github report.
CONCLUSION
Using R to clean data is a wise choice. This post highlighted an example of implementation on a relatively small dataset. The final dataset can be used to match school performance reports regularly published by the Department of Basic Education.
REFERENCES
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Sam Firke (2021). janitor: Simple Tools for Examining and Cleaning Dirty Data. R package version 2.1.0. https://CRAN.R-project.org/package=janitor
Hadley Wickham and Jennifer Bryan (2019). readxl: Read Excel Files. R package version 1.3.1. https://CRAN.R-project.org/package=readxl
Jeroen Ooms (2021). writexl: Export Data Frames to Excel ‘xlsx’ Format. R package version 1.4.0. https://CRAN.R-project.org/package=writexl
Hadley Wickham (2021). rvest: Easily Harvest (Scrape) Web Pages. R package version 1.0.2. https://CRAN.R-project.org/package=rvest