Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread: performance difference with/without specifying colClasses #6105

Closed
scipima opened this issue Apr 29, 2024 · 7 comments · Fixed by #6107
Closed

fread: performance difference with/without specifying colClasses #6105

scipima opened this issue Apr 29, 2024 · 7 comments · Fixed by #6107

Comments

@scipima
Copy link

scipima commented Apr 29, 2024

Hello,
I noticed a large performance difference in fread() with and without specifying colClasses().
If i specify the classes of the toy dataset below, the execution takes considerably longer than without specifying any classes.
There is a similar question on StackOverflow (currently without answer), but in that case it is not entirely clear whether the issue is relative to the calculation also present in the script or due to fread() itself.
So, i created the script below which deals exclusively with fread().
In one case, I also specify the keys when reading in the data, but that does not seem to make any difference.

# Minimal reproducible example; please be sure to set verbose=TRUE where possible!

set.seed(1234)
vote_id <- sample(x = 100000:200000, size = 20000, replace = FALSE)
pers_id <- sample(x = 100000:200000, size = 900, replace = FALSE)
dates <- sample(x = seq(as.Date('2010/01/01'), as.Date('2011/01/01'), by="day"), size = 250, replace = FALSE)
result <- sample(x = -1:1, size = 1000000, replace = TRUE)
big_dt <- data.table::data.table( 
  vote_id = vote_id,
  pers_id = pers_id,
  dates = dates,
  result = result)
# write data  
data.table::fwrite(x = big_dt, "big_dt.csv")
# becnmark fread
res = bench::mark(
  fread_classes_keys = data.table::fread(
    file = "big_dt.csv",
    colClasses = list(
      integer = c("vote_id", "pers_id", "result"),
      Date = c("dates") ), 
    key = c("dates", "vote_id", "pers_id")),
  fread_classes = data.table::fread(
    file = "big_dt.csv",
    colClasses = list(
      integer = c("vote_id", "pers_id", "result"),
      Date = c("dates") )),
  fread = data.table::fread(
    file = "big_dt.csv"),
  check = FALSE, iterations = 10)
print(res)
sessionInfo()

Results

> source("/home/scipima/Documents/github/ep_rollcall/test_scripts/fread_test.R", encoding = "UTF-8")
# A tibble: 3 × 13
  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>  <bch:> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 fread_clas…  2.54s   2.55s     0.390   159.3MB    0.780    10    20      25.6s
2 fread_clas…  2.51s   2.51s     0.397   146.9MB    0.794    10    20      25.2s
3 fread       8.75ms 12.59ms    83.4      16.8MB    8.34     10     1    119.9ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
Warning messages:
1: Item 2 has 900 rows but longest item has 1000000; recycled with remainder. 
2: Some expressions had a GC in every iteration; so filtering is disabled. 

Output of sessionInfo()

sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Linux Mint 21.3

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=nl_BE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=nl_BE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=nl_BE.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Brussels
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_4.4.0    magrittr_2.0.3    bench_1.1.3       cli_3.6.2        
 [5] tools_4.4.0       pillar_1.9.0      glue_1.7.0        tibble_3.2.1     
 [9] utf8_1.2.4        fansi_1.0.6       vctrs_0.6.5       data.table_1.15.0
[13] jsonlite_1.8.8    lifecycle_1.0.4   pkgconfig_2.0.3   rlang_1.1.3      
[17] profmem_0.6.0    
@tdhock
Copy link
Member

tdhock commented Apr 29, 2024

I confirm, here is the atime version of the benchmark (runs in less than 1 sec, and you can still clearly see the 10x difference in memory usage, and 100x difference in time)

atime.res <- atime::atime(setup={
  set.seed(1234)
  result <- sample(x = -1:1, size = N, replace = TRUE)
  Nsmall <- N/2
  Nseq <- seq(Nsmall,Nsmall*2)
  vote_id <- sample(Nseq, size = N, replace=TRUE)
  pers_id <- sample(Nseq, size = N, replace=TRUE)
  dates <- sample(
    x = seq(as.Date('2010/01/01'), as.Date('2011/01/01'), by="day"),
    size = N, replace = TRUE)
  big_dt <- data.table::data.table( 
    vote_id,
    pers_id = rep(pers_id,l=length(vote_id)),
    dates,
    result)
  # write data  
  data.table::fwrite(x = big_dt, "big_dt.csv")
}, fread_classes_keys = data.table::fread(
  file = "big_dt.csv",
  colClasses = list(
    integer = c("vote_id", "pers_id", "result"),
    Date = c("dates") ), 
  key = c("dates", "vote_id", "pers_id")
), fread_classes = data.table::fread(
  file = "big_dt.csv",
  colClasses = list(
    integer = c("vote_id", "pers_id", "result"),
    Date = c("dates") )
), fread = data.table::fread(
  file = "big_dt.csv"))
plot(atime.res)
atime.refs <- atime::references_best(atime.res)
atime.pred <- predict(atime.refs, seconds=0.01, kilobytes=100)
plot(atime.pred)

image

@tdhock tdhock added the fread label Apr 29, 2024
@scipima
Copy link
Author

scipima commented Apr 29, 2024

Wow, thanks for the quick reply.
Then my question is: is this the intended behaviour? My (naive) assumption was that, by specifying the classes, the command would have run faster rather than slower, as the guessing part was out of the picture. Consider that this toy example is very simple, as most classes are integer except one date column.
Again, thanks for the quick feedback!!

@tdhock
Copy link
Member

tdhock commented Apr 29, 2024

I think this is intended, because if you switch Date to IDate, you get the expected speedup.

atime.res <- atime::atime(setup={
  set.seed(1234)
  result <- sample(x = -1:1, size = N, replace = TRUE)
  Nsmall <- N/2
  Nseq <- seq(Nsmall,Nsmall*2)
  vote_id <- sample(Nseq, size = N, replace=TRUE)
  pers_id <- sample(Nseq, size = N, replace=TRUE)
  dates <- sample(
    x = seq(as.Date('2010/01/01'), as.Date('2011/01/01'), by="day"),
    size = N, replace = TRUE)
  big_dt <- data.table::data.table( 
    vote_id,
    pers_id = rep(pers_id,l=length(vote_id)),
    dates,
    result)
  # write data  
  data.table::fwrite(x = big_dt, "big_dt.csv")
}, fread_classes_IDate = data.table::fread(
  file = "big_dt.csv",
  colClasses = list(
    integer = c("vote_id", "pers_id", "result"),
    IDate = c("dates") )
), fread_classes = data.table::fread(
  file = "big_dt.csv",
  colClasses = list(
    integer = c("vote_id", "pers_id", "result"),
    Date = c("dates") )
), fread = data.table::fread(
  file = "big_dt.csv"
), result=TRUE)
plot(atime.res)
atime.refs <- atime::references_best(atime.res)
atime.pred <- predict(atime.refs, seconds=0.01, kilobytes=100)
plot(atime.pred)
atime.res$measurements[expr.name=="fread"]$result[[1]]
atime.res$measurements[expr.name=="fread_classes"]$result[[1]]
atime.res$measurements[expr.name=="fread_classes_IDate"]$result[[1]]

image

Below we see that fread defaults to reading IDate, and when you specify colClasses Date you get that instead of IDate (and above plot shows that IDate is much more efficient than Date).

> atime.res$measurements[expr.name=="fread"]$result[[1]]
   vote_id pers_id      dates result
     <int>   <int>     <IDat>  <int>
1:       1       1 2010-05-13      0
2:       2       1 2010-04-08      0
> atime.res$measurements[expr.name=="fread_classes"]$result[[1]]
   vote_id pers_id      dates result
     <int>   <int>     <Date>  <int>
1:       1       1 2010-05-13      0
2:       2       1 2010-04-08      0
> atime.res$measurements[expr.name=="fread_classes_IDate"]$result[[1]]
   vote_id pers_id      dates result
     <int>   <int>     <IDat>  <int>
1:       1       1 2010-05-13      0
2:       2       1 2010-04-08      0

So you should be able to get speedups by switching Date to IDate in your colClasses.
If that solves your issue, can you please close?

@scipima
Copy link
Author

scipima commented Apr 29, 2024

Thanks a lot!
Best, Marco

@scipima scipima closed this as completed Apr 29, 2024
@MichaelChirico
Copy link
Member

MichaelChirico commented Apr 29, 2024

Thanks for investigating Toby. I agree with OP that this is not ideal behavior & with the expectation that it would be more efficient to specify colClasses as a rule. That we default to IDate instead of Date is something of an implementation detail -- e.g. new users of data.table might not even know this class exists.

I think we should close this with an improved user experience, e.g. a message() in this case that displays once per session advising the user to (1) specify IDate instead or (2) just let fread() guess the class itself.

Alternatively, since IDate inherits from Date, probably fread() shouldn't do anything in this case -- the user's request (I want a Date column) is satisfied, since inherits(x$dates) will be TRUE even when we return an IDate column. If they explicitly want a Date-not-IDate column, I think it's reasonable for the user to be responsible for the coercion.

@MichaelChirico
Copy link
Member

Investigating and this situation is even worse than I thought.

I originally thought the inefficiency is because the core fread returns an IDate, and we were apply as.Date(<IDate>).

But actually, the colClasses override is first leading to the IDate being coerced to character, and then to IDate, meaning we're doing, in effect, as.Date(as.character(<IDate>)), which is way slower -- as.Date.character() in effect has to re-do all the inefficient work of parsing the dates that fread already did!

@scipima
Copy link
Author

scipima commented May 3, 2024

Thanks for investigating Toby. I agree with OP that this is not ideal behavior & with the expectation that it would be more efficient to specify colClasses as a rule. That we default to IDate instead of Date is something of an implementation detail -- e.g. new users of data.table might not even know this class exists.

I think we should close this with an improved user experience, e.g. a message() in this case that displays once per session advising the user to (1) specify IDate instead or (2) just let fread() guess the class itself.

Alternatively, since IDate inherits from Date, probably fread() shouldn't do anything in this case -- the user's request (I want a Date column) is satisfied, since inherits(x$dates) will be TRUE even when we return an IDate column. If they explicitly want a Date-not-IDate column, I think it's reasonable for the user to be responsible for the coercion.

HI all,
thanks for the replies.
I think that Micheal's idea to get a warning once per session would be ideal - the end user would then be aware of what is entailed by interpreting the column as Date rather than IDate.
Thanks again, marco

MichaelChirico added a commit that referenced this issue Jun 1, 2024
Towards #6105. This PR is a simple refactor to reduce code nesting / improve readability. It makes the actual implementation of #6105, #6107, much simpler.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants