fread-benchmarks-rsuite

Introduction

This is the rsuite version of the original tests published by Jozef Hajnala in Gitlab.

It is conceptually the same scripts but adding these changes:

a new R script using awk: 06_fread_awk.R
modification of the project structure using the rsuite paradigm
addition of comments on scripts
generation of results dataframe: show_results.R
plots for memory and average time per operation
packages are all installed exclusively for this project, under the folder packages, insulated from the global environment

Data

The data is the same used in the original project (airlines), CSV files located here. The files are downloaded compressed as bunzip, then expanded as csv.

Running the benchmarks

For base R

bash bench.sh R/01_base.R &> results/out_base.txt

For data.table::fread

bash bench.sh R/02_fread.R &> results/out_fread.txt

For readr::read_csv

bash bench.sh R/03_readr.R &> results/out_readr.txt

For data.table::fread with grep

bash bench.sh R/04_fread_grep.R &> results/out_fread_grep.txt

For readr::read_csv with grep

bash bench.sh R/05_readr_grep.R &> results/out_readr_grep.txt

For data.table::fread with awk

bash bench.sh R/06_fread_awk.R &> results/out_fread_awk.txt

Results

Tables

library(dplyr)

source("R/show_results.R")

df <- get_results_df() %>% 
    arrange(-mem_gb) %>% 
    select(result_file, rscript, mem_gb, avg_secs) %>% 
    mutate_if(is.numeric, round, digits = 2) %>% 
    print()
#>          result_file            rscript mem_gb avg_secs
#> 1      out_readr.txt      R/03_readr.R   27.01   121.57
#> 2       out_base.txt       R/01_base.R   21.93   293.52
#> 3      out_fread.txt      R/02_fread.R   15.24    26.86
#> 4 out_readr_grep.txt R/05_readr_grep.R    1.64    28.19
#> 5 out_fread_grep.txt R/04_fread_grep.R    1.62    10.01
#> 6  out_fread_awk.txt  R/06_fread_awk.R    1.47    25.60

This is the same table but including the original description for the scripts.

#>                               description            rscript mem_gb
#> 1     `readr::read_csv` + `purrr:map_dfr`      R/03_readr.R   27.01
#> 2       `utils::read.csv` + `base::rbind`       R/01_base.R   21.93
#> 3       `data.table::fread` + `rbindlist`      R/02_fread.R   15.24
#> 4 `readr::read_csv`+ `pipe()` from `grep` R/05_readr_grep.R    1.64
#> 5         `data.table::fread` from `grep` R/04_fread_grep.R    1.62
#> 6          `data.table::fread` from `awk`  R/06_fread_awk.R    1.47
#>   avg_secs
#> 1   121.57
#> 2   293.52
#> 3    26.86
#> 4    28.19
#> 5    10.01
#> 6    25.60

mem_gb = Maximum resident set size, gigabytes
avg_secs = Average of real time and user time as measured by time, seconds

Memory required (GB)

library(ggplot2)
library(scales)

df$rscript <- reorder(df$rscript, df$mem_gb)
ggplot(df, aes(x = rscript, y = mem_gb)) +
    geom_col() +
    scale_y_continuous(breaks=pretty_breaks(n=20)) +
    coord_flip()

Average Time (seconds)

df$rscript <- reorder(df$rscript, df$avg_secs)
ggplot(df, aes(x = rscript, y = avg_secs)) +
    geom_col() +
    scale_y_continuous(breaks=pretty_breaks(n=20)) +
    coord_flip()

How to reproduce this project yourself

Download and install the RSuite client. Available for Linux, Mac and Windows.
Install the rsuite package with rsuite install
Clone or download this repository.
Change to this repo folder and install the dependencies on its own isolated reproducible environment. Use rsuite proj depsinst
Build the project with rsuite proj build
Download the data running this from the console Rscript R/data_prep.R
Run each of the tests. Example: bash bench.sh R/01_base.R &> results/out_base.txt. See above for the rest.
Generate a comparative table with the results running Rscript R/show_results.R

References

Original article by Jozef Hajnala: How data.table’s fread can save you a lot of time and memory, and take input from shell commands
Original repository: https://gitlab.com/jozefhajnala/fread-benchmarks
Article by Nick Strayer: Using AWK and R to parse 25tb
This repository by Alfonso R. Reyes
Rsuite client

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

fread-benchmarks-rsuite

Introduction

Data

Running the benchmarks

For base R

For data.table::fread

For readr::read_csv

For data.table::fread with grep

For readr::read_csv with grep

For data.table::fread with awk

Results

Tables

Memory required (GB)

Average Time (seconds)

How to reproduce this project yourself

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

fread-benchmarks-rsuite

Introduction

Data

Running the benchmarks

For base R

For data.table::fread

For readr::read_csv

For data.table::fread with grep

For readr::read_csv with grep

For data.table::fread with awk

Results

Tables

Memory required (GB)

Average Time (seconds)

How to reproduce this project yourself

References