-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fread: performance difference with/without specifying colClasses #6105
Comments
I confirm, here is the atime version of the benchmark (runs in less than 1 sec, and you can still clearly see the 10x difference in memory usage, and 100x difference in time) atime.res <- atime::atime(setup={
set.seed(1234)
result <- sample(x = -1:1, size = N, replace = TRUE)
Nsmall <- N/2
Nseq <- seq(Nsmall,Nsmall*2)
vote_id <- sample(Nseq, size = N, replace=TRUE)
pers_id <- sample(Nseq, size = N, replace=TRUE)
dates <- sample(
x = seq(as.Date('2010/01/01'), as.Date('2011/01/01'), by="day"),
size = N, replace = TRUE)
big_dt <- data.table::data.table(
vote_id,
pers_id = rep(pers_id,l=length(vote_id)),
dates,
result)
# write data
data.table::fwrite(x = big_dt, "big_dt.csv")
}, fread_classes_keys = data.table::fread(
file = "big_dt.csv",
colClasses = list(
integer = c("vote_id", "pers_id", "result"),
Date = c("dates") ),
key = c("dates", "vote_id", "pers_id")
), fread_classes = data.table::fread(
file = "big_dt.csv",
colClasses = list(
integer = c("vote_id", "pers_id", "result"),
Date = c("dates") )
), fread = data.table::fread(
file = "big_dt.csv"))
plot(atime.res)
atime.refs <- atime::references_best(atime.res)
atime.pred <- predict(atime.refs, seconds=0.01, kilobytes=100)
plot(atime.pred) |
Wow, thanks for the quick reply. |
I think this is intended, because if you switch Date to IDate, you get the expected speedup. atime.res <- atime::atime(setup={
set.seed(1234)
result <- sample(x = -1:1, size = N, replace = TRUE)
Nsmall <- N/2
Nseq <- seq(Nsmall,Nsmall*2)
vote_id <- sample(Nseq, size = N, replace=TRUE)
pers_id <- sample(Nseq, size = N, replace=TRUE)
dates <- sample(
x = seq(as.Date('2010/01/01'), as.Date('2011/01/01'), by="day"),
size = N, replace = TRUE)
big_dt <- data.table::data.table(
vote_id,
pers_id = rep(pers_id,l=length(vote_id)),
dates,
result)
# write data
data.table::fwrite(x = big_dt, "big_dt.csv")
}, fread_classes_IDate = data.table::fread(
file = "big_dt.csv",
colClasses = list(
integer = c("vote_id", "pers_id", "result"),
IDate = c("dates") )
), fread_classes = data.table::fread(
file = "big_dt.csv",
colClasses = list(
integer = c("vote_id", "pers_id", "result"),
Date = c("dates") )
), fread = data.table::fread(
file = "big_dt.csv"
), result=TRUE)
plot(atime.res)
atime.refs <- atime::references_best(atime.res)
atime.pred <- predict(atime.refs, seconds=0.01, kilobytes=100)
plot(atime.pred)
atime.res$measurements[expr.name=="fread"]$result[[1]]
atime.res$measurements[expr.name=="fread_classes"]$result[[1]]
atime.res$measurements[expr.name=="fread_classes_IDate"]$result[[1]] Below we see that fread defaults to reading IDate, and when you specify colClasses Date you get that instead of IDate (and above plot shows that IDate is much more efficient than Date). > atime.res$measurements[expr.name=="fread"]$result[[1]]
vote_id pers_id dates result
<int> <int> <IDat> <int>
1: 1 1 2010-05-13 0
2: 2 1 2010-04-08 0
> atime.res$measurements[expr.name=="fread_classes"]$result[[1]]
vote_id pers_id dates result
<int> <int> <Date> <int>
1: 1 1 2010-05-13 0
2: 2 1 2010-04-08 0
> atime.res$measurements[expr.name=="fread_classes_IDate"]$result[[1]]
vote_id pers_id dates result
<int> <int> <IDat> <int>
1: 1 1 2010-05-13 0
2: 2 1 2010-04-08 0 So you should be able to get speedups by switching Date to IDate in your colClasses. |
Thanks a lot! |
Thanks for investigating Toby. I agree with OP that this is not ideal behavior & with the expectation that it would be more efficient to specify I think we should close this with an improved user experience, e.g. a Alternatively, since IDate inherits from Date, probably |
Investigating and this situation is even worse than I thought. I originally thought the inefficiency is because the core fread returns an IDate, and we were apply But actually, the |
HI all, |
Hello,
I noticed a large performance difference in
fread()
with and without specifyingcolClasses()
.If i specify the classes of the toy dataset below, the execution takes considerably longer than without specifying any classes.
There is a similar question on StackOverflow (currently without answer), but in that case it is not entirely clear whether the issue is relative to the calculation also present in the script or due to
fread()
itself.So, i created the script below which deals exclusively with
fread()
.In one case, I also specify the keys when reading in the data, but that does not seem to make any difference.
#
Minimal reproducible example
; please be sure to setverbose=TRUE
where possible!Results
Output of sessionInfo()
The text was updated successfully, but these errors were encountered: