This is the rsuite
version of the original tests published by Jozef
Hajnala in
Gitlab.
It is conceptually the same scripts but adding these changes:
- a new R script using
awk
:06_fread_awk.R
- modification of the project structure using the
rsuite
paradigm - addition of comments on scripts
- generation of results dataframe:
show_results.R
- plots for memory and average time per operation
- packages are all installed exclusively for this project, under the
folder
packages
, insulated from the global environment
The data is the same used in the original project (airlines), CSV files
located here.
The files are downloaded compressed as bunzip, then expanded as csv
.
bash bench.sh R/01_base.R &> results/out_base.txt
bash bench.sh R/02_fread.R &> results/out_fread.txt
bash bench.sh R/03_readr.R &> results/out_readr.txt
bash bench.sh R/04_fread_grep.R &> results/out_fread_grep.txt
bash bench.sh R/05_readr_grep.R &> results/out_readr_grep.txt
bash bench.sh R/06_fread_awk.R &> results/out_fread_awk.txt
library(dplyr)
source("R/show_results.R")
df <- get_results_df() %>%
arrange(-mem_gb) %>%
select(result_file, rscript, mem_gb, avg_secs) %>%
mutate_if(is.numeric, round, digits = 2) %>%
print()
#> result_file rscript mem_gb avg_secs
#> 1 out_readr.txt R/03_readr.R 27.01 121.57
#> 2 out_base.txt R/01_base.R 21.93 293.52
#> 3 out_fread.txt R/02_fread.R 15.24 26.86
#> 4 out_readr_grep.txt R/05_readr_grep.R 1.64 28.19
#> 5 out_fread_grep.txt R/04_fread_grep.R 1.62 10.01
#> 6 out_fread_awk.txt R/06_fread_awk.R 1.47 25.60
This is the same table but including the original description for the scripts.
#> description rscript mem_gb
#> 1 `readr::read_csv` + `purrr:map_dfr` R/03_readr.R 27.01
#> 2 `utils::read.csv` + `base::rbind` R/01_base.R 21.93
#> 3 `data.table::fread` + `rbindlist` R/02_fread.R 15.24
#> 4 `readr::read_csv`+ `pipe()` from `grep` R/05_readr_grep.R 1.64
#> 5 `data.table::fread` from `grep` R/04_fread_grep.R 1.62
#> 6 `data.table::fread` from `awk` R/06_fread_awk.R 1.47
#> avg_secs
#> 1 121.57
#> 2 293.52
#> 3 26.86
#> 4 28.19
#> 5 10.01
#> 6 25.60
- mem_gb = Maximum resident set size, gigabytes
- avg_secs = Average of real time and user time as measured by
time
, seconds
library(ggplot2)
library(scales)
df$rscript <- reorder(df$rscript, df$mem_gb)
ggplot(df, aes(x = rscript, y = mem_gb)) +
geom_col() +
scale_y_continuous(breaks=pretty_breaks(n=20)) +
coord_flip()
df$rscript <- reorder(df$rscript, df$avg_secs)
ggplot(df, aes(x = rscript, y = avg_secs)) +
geom_col() +
scale_y_continuous(breaks=pretty_breaks(n=20)) +
coord_flip()
- Download and install the RSuite client. Available for Linux, Mac and Windows.
- Install the
rsuite
package withrsuite install
- Clone or download this repository.
- Change to this repo folder and install the dependencies on its own
isolated reproducible environment. Use
rsuite proj depsinst
- Build the project with
rsuite proj build
- Download the data running this from the console
Rscript R/data_prep.R
- Run each of the tests. Example:
bash bench.sh R/01_base.R &> results/out_base.txt
. See above for the rest. - Generate a comparative table with the results running
Rscript R/show_results.R
-
Original article by Jozef Hajnala: How data.table’s fread can save you a lot of time and memory, and take input from shell commands
-
Original repository: https://gitlab.com/jozefhajnala/fread-benchmarks
-
Article by Nick Strayer: Using AWK and R to parse 25tb
-
This repository by Alfonso R. Reyes