fread: performance difference with/without specifying colClasses #6105

scipima · 2024-04-29T10:42:40Z

Hello,
I noticed a large performance difference in fread() with and without specifying colClasses().
If i specify the classes of the toy dataset below, the execution takes considerably longer than without specifying any classes.
There is a similar question on StackOverflow (currently without answer), but in that case it is not entirely clear whether the issue is relative to the calculation also present in the script or due to fread() itself.
So, i created the script below which deals exclusively with fread().
In one case, I also specify the keys when reading in the data, but that does not seem to make any difference.

# Minimal reproducible example; please be sure to set verbose=TRUE where possible!

set.seed(1234)
vote_id <- sample(x = 100000:200000, size = 20000, replace = FALSE)
pers_id <- sample(x = 100000:200000, size = 900, replace = FALSE)
dates <- sample(x = seq(as.Date('2010/01/01'), as.Date('2011/01/01'), by="day"), size = 250, replace = FALSE)
result <- sample(x = -1:1, size = 1000000, replace = TRUE)
big_dt <- data.table::data.table( 
  vote_id = vote_id,
  pers_id = pers_id,
  dates = dates,
  result = result)
# write data  
data.table::fwrite(x = big_dt, "big_dt.csv")
# becnmark fread
res = bench::mark(
  fread_classes_keys = data.table::fread(
    file = "big_dt.csv",
    colClasses = list(
      integer = c("vote_id", "pers_id", "result"),
      Date = c("dates") ), 
    key = c("dates", "vote_id", "pers_id")),
  fread_classes = data.table::fread(
    file = "big_dt.csv",
    colClasses = list(
      integer = c("vote_id", "pers_id", "result"),
      Date = c("dates") )),
  fread = data.table::fread(
    file = "big_dt.csv"),
  check = FALSE, iterations = 10)
print(res)
sessionInfo()

Results

> source("/home/scipima/Documents/github/ep_rollcall/test_scripts/fread_test.R", encoding = "UTF-8")
# A tibble: 3 × 13
  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>  <bch:> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 fread_clas…  2.54s   2.55s     0.390   159.3MB    0.780    10    20      25.6s
2 fread_clas…  2.51s   2.51s     0.397   146.9MB    0.794    10    20      25.2s
3 fread       8.75ms 12.59ms    83.4      16.8MB    8.34     10     1    119.9ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
Warning messages:
1: Item 2 has 900 rows but longest item has 1000000; recycled with remainder. 
2: Some expressions had a GC in every iteration; so filtering is disabled.

Output of sessionInfo()

sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Linux Mint 21.3

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=nl_BE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=nl_BE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=nl_BE.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Brussels
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_4.4.0    magrittr_2.0.3    bench_1.1.3       cli_3.6.2        
 [5] tools_4.4.0       pillar_1.9.0      glue_1.7.0        tibble_3.2.1     
 [9] utf8_1.2.4        fansi_1.0.6       vctrs_0.6.5       data.table_1.15.0
[13] jsonlite_1.8.8    lifecycle_1.0.4   pkgconfig_2.0.3   rlang_1.1.3      
[17] profmem_0.6.0

The text was updated successfully, but these errors were encountered:

tdhock · 2024-04-29T16:53:36Z

I confirm, here is the atime version of the benchmark (runs in less than 1 sec, and you can still clearly see the 10x difference in memory usage, and 100x difference in time)

atime.res <- atime::atime(setup={
  set.seed(1234)
  result <- sample(x = -1:1, size = N, replace = TRUE)
  Nsmall <- N/2
  Nseq <- seq(Nsmall,Nsmall*2)
  vote_id <- sample(Nseq, size = N, replace=TRUE)
  pers_id <- sample(Nseq, size = N, replace=TRUE)
  dates <- sample(
    x = seq(as.Date('2010/01/01'), as.Date('2011/01/01'), by="day"),
    size = N, replace = TRUE)
  big_dt <- data.table::data.table( 
    vote_id,
    pers_id = rep(pers_id,l=length(vote_id)),
    dates,
    result)
  # write data  
  data.table::fwrite(x = big_dt, "big_dt.csv")
}, fread_classes_keys = data.table::fread(
  file = "big_dt.csv",
  colClasses = list(
    integer = c("vote_id", "pers_id", "result"),
    Date = c("dates") ), 
  key = c("dates", "vote_id", "pers_id")
), fread_classes = data.table::fread(
  file = "big_dt.csv",
  colClasses = list(
    integer = c("vote_id", "pers_id", "result"),
    Date = c("dates") )
), fread = data.table::fread(
  file = "big_dt.csv"))
plot(atime.res)
atime.refs <- atime::references_best(atime.res)
atime.pred <- predict(atime.refs, seconds=0.01, kilobytes=100)
plot(atime.pred)

scipima · 2024-04-29T17:34:08Z

Wow, thanks for the quick reply.
Then my question is: is this the intended behaviour? My (naive) assumption was that, by specifying the classes, the command would have run faster rather than slower, as the guessing part was out of the picture. Consider that this toy example is very simple, as most classes are integer except one date column.
Again, thanks for the quick feedback!!

tdhock · 2024-04-29T18:00:10Z

I think this is intended, because if you switch Date to IDate, you get the expected speedup.

atime.res <- atime::atime(setup={
  set.seed(1234)
  result <- sample(x = -1:1, size = N, replace = TRUE)
  Nsmall <- N/2
  Nseq <- seq(Nsmall,Nsmall*2)
  vote_id <- sample(Nseq, size = N, replace=TRUE)
  pers_id <- sample(Nseq, size = N, replace=TRUE)
  dates <- sample(
    x = seq(as.Date('2010/01/01'), as.Date('2011/01/01'), by="day"),
    size = N, replace = TRUE)
  big_dt <- data.table::data.table( 
    vote_id,
    pers_id = rep(pers_id,l=length(vote_id)),
    dates,
    result)
  # write data  
  data.table::fwrite(x = big_dt, "big_dt.csv")
}, fread_classes_IDate = data.table::fread(
  file = "big_dt.csv",
  colClasses = list(
    integer = c("vote_id", "pers_id", "result"),
    IDate = c("dates") )
), fread_classes = data.table::fread(
  file = "big_dt.csv",
  colClasses = list(
    integer = c("vote_id", "pers_id", "result"),
    Date = c("dates") )
), fread = data.table::fread(
  file = "big_dt.csv"
), result=TRUE)
plot(atime.res)
atime.refs <- atime::references_best(atime.res)
atime.pred <- predict(atime.refs, seconds=0.01, kilobytes=100)
plot(atime.pred)
atime.res$measurements[expr.name=="fread"]$result[[1]]
atime.res$measurements[expr.name=="fread_classes"]$result[[1]]
atime.res$measurements[expr.name=="fread_classes_IDate"]$result[[1]]

Below we see that fread defaults to reading IDate, and when you specify colClasses Date you get that instead of IDate (and above plot shows that IDate is much more efficient than Date).

> atime.res$measurements[expr.name=="fread"]$result[[1]]
   vote_id pers_id      dates result
     <int>   <int>     <IDat>  <int>
1:       1       1 2010-05-13      0
2:       2       1 2010-04-08      0
> atime.res$measurements[expr.name=="fread_classes"]$result[[1]]
   vote_id pers_id      dates result
     <int>   <int>     <Date>  <int>
1:       1       1 2010-05-13      0
2:       2       1 2010-04-08      0
> atime.res$measurements[expr.name=="fread_classes_IDate"]$result[[1]]
   vote_id pers_id      dates result
     <int>   <int>     <IDat>  <int>
1:       1       1 2010-05-13      0
2:       2       1 2010-04-08      0

So you should be able to get speedups by switching Date to IDate in your colClasses.
If that solves your issue, can you please close?

scipima · 2024-04-29T18:04:38Z

Thanks a lot!
Best, Marco

MichaelChirico · 2024-04-29T18:08:44Z

Thanks for investigating Toby. I agree with OP that this is not ideal behavior & with the expectation that it would be more efficient to specify colClasses as a rule. That we default to IDate instead of Date is something of an implementation detail -- e.g. new users of data.table might not even know this class exists.

I think we should close this with an improved user experience, e.g. a message() in this case that displays once per session advising the user to (1) specify IDate instead or (2) just let fread() guess the class itself.

Alternatively, since IDate inherits from Date, probably fread() shouldn't do anything in this case -- the user's request (I want a Date column) is satisfied, since inherits(x$dates) will be TRUE even when we return an IDate column. If they explicitly want a Date-not-IDate column, I think it's reasonable for the user to be responsible for the coercion.

MichaelChirico · 2024-04-29T18:58:10Z

Investigating and this situation is even worse than I thought.

I originally thought the inefficiency is because the core fread returns an IDate, and we were apply as.Date(<IDate>).

But actually, the colClasses override is first leading to the IDate being coerced to character, and then to IDate, meaning we're doing, in effect, as.Date(as.character(<IDate>)), which is way slower -- as.Date.character() in effect has to re-do all the inefficient work of parsing the dates that fread already did!

scipima · 2024-05-03T14:43:18Z

Thanks for investigating Toby. I agree with OP that this is not ideal behavior & with the expectation that it would be more efficient to specify colClasses as a rule. That we default to IDate instead of Date is something of an implementation detail -- e.g. new users of data.table might not even know this class exists.

I think we should close this with an improved user experience, e.g. a message() in this case that displays once per session advising the user to (1) specify IDate instead or (2) just let fread() guess the class itself.

Alternatively, since IDate inherits from Date, probably fread() shouldn't do anything in this case -- the user's request (I want a Date column) is satisfied, since inherits(x$dates) will be TRUE even when we return an IDate column. If they explicitly want a Date-not-IDate column, I think it's reasonable for the user to be responsible for the coercion.

HI all,
thanks for the replies.
I think that Micheal's idea to get a warning once per session would be ideal - the end user would then be aware of what is entailed by interpreting the column as Date rather than IDate.
Thanks again, marco

Towards #6105. This PR is a simple refactor to reduce code nesting / improve readability. It makes the actual implementation of #6105, #6107, much simpler.

tdhock added the performance label Apr 29, 2024

tdhock added the fread label Apr 29, 2024

scipima closed this as completed Apr 29, 2024

MichaelChirico reopened this Apr 29, 2024

This was referenced Apr 29, 2024

Refactor colClasses handling for readability #6106

Merged

Defer as.Date() coercion to R level #6107

Merged

MichaelChirico added a commit that referenced this issue Jun 1, 2024

Refactor colClasses handling for readability (#6106)

6a4bd54

Towards #6105. This PR is a simple refactor to reduce code nesting / improve readability. It makes the actual implementation of #6105, #6107, much simpler.

MichaelChirico closed this as completed in #6107 Sep 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fread: performance difference with/without specifying colClasses #6105

fread: performance difference with/without specifying colClasses #6105

scipima commented Apr 29, 2024

tdhock commented Apr 29, 2024

scipima commented Apr 29, 2024 •

edited

Loading

tdhock commented Apr 29, 2024

scipima commented Apr 29, 2024

MichaelChirico commented Apr 29, 2024 •

edited

Loading

MichaelChirico commented Apr 29, 2024

scipima commented May 3, 2024 •

edited

Loading

fread: performance difference with/without specifying colClasses #6105

fread: performance difference with/without specifying colClasses #6105

Comments

scipima commented Apr 29, 2024

tdhock commented Apr 29, 2024

scipima commented Apr 29, 2024 • edited Loading

tdhock commented Apr 29, 2024

scipima commented Apr 29, 2024

MichaelChirico commented Apr 29, 2024 • edited Loading

MichaelChirico commented Apr 29, 2024

scipima commented May 3, 2024 • edited Loading

scipima commented Apr 29, 2024 •

edited

Loading

MichaelChirico commented Apr 29, 2024 •

edited

Loading

scipima commented May 3, 2024 •

edited

Loading