'by' operations much slower when verbose=TRUE #6286

joshhwuu · 2024-07-16T22:25:10Z

Seen when benchmarking #6228 , seems that by operations run much slower with verbose=TRUE

dt = data.table(a = 1:1000000)
system.time(copy(dt)[, 1, by = a])
#   user  system elapsed 
#  0.263   0.001   0.202

system.time(copy(dt)[, 1, by = a, verbose=TRUE])
#   user  system elapsed 
#  1.375   1.505   2.820

My initial guess for the culprits are the numerous calls to clock() when verbose=TRUE

# Output of sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Vancouver
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.15.99 testthat_3.2.1.1   devtools_2.4.5     usethis_2.2.3     

loaded via a namespace (and not attached):
 [1] miniUI_0.1.1.1    compiler_4.4.1    brio_1.1.5        promises_1.3.0   
 [5] Rcpp_1.0.12       stringr_1.5.1     callr_3.7.6       later_1.3.2      
 [9] fastmap_1.2.0     mime_0.12         R6_2.5.1          htmlwidgets_1.6.4
[13] desc_1.4.3        profvis_0.3.8     rprojroot_2.0.4   shiny_1.8.1.1    
[17] rlang_1.1.4       cachem_1.1.0      stringi_1.8.4     httpuv_1.6.15    
[21] fs_1.6.4          pkgload_1.3.4     memoise_2.0.1     cli_3.6.2        
[25] withr_3.0.0       magrittr_2.0.3    ps_1.7.6          processx_3.8.4   
[29] digest_0.6.35     rstudioapi_0.16.0 xtable_1.8-4      remotes_2.5.0    
[33] lifecycle_1.0.4   vctrs_0.6.5       glue_1.7.0        urlchecker_1.0.1 
[37] sessioninfo_1.2.2 pkgbuild_1.4.4    purrr_1.0.2       tools_4.4.1      
[41] ellipsis_0.3.2    htmltools_0.5.8.1

The text was updated successfully, but these errors were encountered:

joshhwuu · 2024-07-16T22:47:25Z

Investigating:

I replaced all calls to clock() with wallclock, and it seems that the performance improved dramatically:

# with wallclock()
system.time(copy(dt)[, 1, by = a, verbose=TRUE])
# Detected that j uses these columns: <none>
# Finding groups using forderv ... forder.c received 1000000 rows and 1 columns
# 0.008s elapsed (0.007s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.005s elapsed (0.041s cpu) 
# Optimization is on but left j unchanged (single plain symbol): '1'
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 0.031s for 1000000 groups
#   eval(j) took 0.028s for 1000000 calls
# 0.274s elapsed (0.273s cpu) 
#    user  system elapsed 
#   0.322   0.020   0.289

# with clock() ie current
system.time(copy(dt)[, 1, by = a, verbose=TRUE])
# Detected that j uses these columns: <none>
# Finding groups using forderv ... forder.c received 1000000 rows and 1 columns
# 0.008s elapsed (0.024s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.005s elapsed (0.043s cpu) 
# Optimization is on but left j unchanged (single plain symbol): '1'
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 0.453s for 1000000 groups
#   eval(j) took 0.438s for 1000000 calls
# 1.955s elapsed (0.509s cpu) 
#    user  system elapsed 
#   0.577   1.446   1.969

Can anyone comment on the difference between using clock() and wallclock()? If there aren't any issues with wallclock() I'll file a PR to refactor the current calls.

Anirban166 · 2024-07-16T22:50:34Z

Can reproduce. Adding a zero adds to a more noticeable difference:

library(data.table) # 1.15.99
dt = data.table(a = 1:10000000)
system.time(copy(dt)[, 1, by = a])
#   user  system elapsed 
#  1.924   0.039   1.849 
system.time(copy(dt)[, 1, by = a, verbose=TRUE])
# Detected that j uses these columns: <none>
# Finding groups using forderv ... forder.c received 10000000 rows and 1 columns
# 0.069s elapsed (0.155s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.034s elapsed (0.032s cpu)
# Optimization is on but left j unchanged (single plain symbol): '1'
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 4.140s for 10000000 groups
#   eval(j) took 4.099s for 10000000 calls
# 18.0s elapsed (6.423s cpu) 
#   user  system elapsed 
#  6.611  11.598  18.096

joshhwuu · 2024-07-16T23:08:39Z

Interesting, this doesn't happen with 1.15.4
I'll see if I can find the reason

library(data.table) # 1.15.4
dt = data.table(a = 1:10000000)
system.time(copy(dt)[, 1, by = a, verbose=TRUE])
# Detected that j uses these columns: <none>
# Finding groups using forderv ... forder.c received 10000000 rows and 1 columns
# 0.370s elapsed (0.020s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.050s elapsed # (0.050s cpu) 
# Optimization is on but left j unchanged (single plain symbol): '1'
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 0.258s for 10000000 groups
#   eval(j) took 0.234s for 10000000 calls
# 2.230s elapsed (2.150s cpu) 
#    user  system elapsed 
#    2.23    0.06    2.67

Anirban166 · 2024-07-16T23:17:19Z

Can anyone comment on the difference between using clock() and wallclock()?

I reckon clock() is used for profiling CPU time and likely ends up making more system calls in the process (has been found to be slower before, for e.g. this comment)

wallclock() does seem to have a simpler implementation, so in theory the overhead difference could be true.

Interesting, this doesn't happen with 1.15.4
I'll see if I can find the reason

Good find! Seems like a potential performance regression sneaked in.

joshhwuu · 2024-07-16T23:33:09Z

NVM, it seems that this only affects non-Windows. AFAIK there are slight implementation differences between clock()s in Linux and Windows. I just tried to recreate this on Windows R with 1.15.99, but there is no issue:

library(data.table)
# data.table 1.15.99 IN DEVELOPMENT built 2024-07-16 23:30:59 UTC; joshu using 8 threads (see ?getDTthreads).  Latest news: r-datatable.com
dt = data.table(a = 1:1000000)
system.time(copy(dt)[, 1, by = a])
#    user  system elapsed 
#    0.17    0.00    0.17 
system.time(copy(dt)[, 1, by = a, verbose=TRUE])
# Detected that j uses these columns: <none>
# Finding groups using forderv ... forder.c received 1000000 rows and 1 columns
# 0.020s elapsed (0.000s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# Optimization is on but left j unchanged (single plain symbol): '1'
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 0.029s for 1000000 groups
#   eval(j) took 0.022s for 1000000 calls
# 0.230s elapsed (0.220s cpu) 
#    user  system elapsed 
#    0.22    0.03    0.27

Found this SO thread

tdhock · 2024-07-17T00:53:35Z

I don't think anybody cares about performance when verbose=TRUE, since it is for debugging

joshhwuu · 2024-07-17T01:01:05Z

I don't think anybody cares about performance when verbose=TRUE, since it is for debugging

Hm, I can understand that but switching to wallclock() is trivial as it is already implemented and gives us consistent performance over different platforms. Additionally, IMO a performance difference of over 10x seems a little egregious, even for a debug mode

tdhock · 2024-07-17T01:17:44Z

ok if you want to make the change, then make sure to add an atime performance test

tdhock · 2024-07-17T11:51:48Z

https://github.com/Rdatatable/data.table/wiki/Performance-testing

…ent)) is invalid or OS was the source of difference rather than the version

jangorecki · 2024-07-21T14:19:12Z

I can imagine that for verbose=T we could sometimes add extra forder calls (or potentially other heavy calls) just to report more information to the console. I would say this is acceptable. For performance verbose needs to be F.

aitap · 2024-08-25T08:47:07Z

Edit: let's keep a distinction between CPU time spent executing instructions by the current process (returned by clock on non-Windows, getrusage on POSIX, possibly clock_gettime(CLOCK_PROCESS_CPUTIME_ID) on some POSIX implementations; GetProcessTimes on Windows because clock doesn't conform there) and elapsed time that will include waiting for I/O and other processes (measured by clock_gettime(CLOCK_MONOTONIC) and sometimes (if not adjusted in the background) timespec_get/time/clock_gettime(CLOCK_REALTIME)/gettimeofday; QueryPerformanceCounter or clock on Windows). It looks like it usually costs a system call to know the CPU time, but there are faster ways of obtaining a monotonically increasing time value.

Since the values of wallclock() are frequently subtracted to obtain durations, can we use clock_gettime(CLOCK_MONOTONIC) instead of CLOCK_REALTIME? Subtracting wall-clock values may cause some confusing bugs if the system time is adjusted in the background. POSIX requires both CLOCK_REALTIME and CLOCK_MONOTONIC.

Empirically, on amd64 Linux, both CLOCK_REALTIME and CLOCK_MONOTONIC use the vDSO and thus avoid the system call penalty, but clock() is required to measure the CPU time, so uses CLOCK_PROCESS_CPUTIME_ID and costs an extra system call, as does getrusage.

MichaelChirico · 2024-09-03T05:25:43Z

Thanks for pointing out CLOCK_MONOTONIC!

wallclock() (according to a comment that's unfortunately been lost) is adapted from R's own currentTime(), which uses CLOCK_REALTIME:

0b3cd65#diff-4052380a202e1373dcfdfa2fff9850d61b26758fc0475260b2871e70f0d153a1R129

https://github.com/r-devel/r-svn/blob/4778112deccaf909e93a0bec77d6c5ccd0e9155f/src/main/times.c#L105

CLOCK_REALTIME has always been used (13 years) since currentTime() was implemented, I don't see any discussion of the issue you raise on r-devel or bugzilla either:

r-devel/r-svn@8cf6136
https://github.com/search?q=repo%3AMichaelChirico%2Fr-mailing-list-archive%20%22CLOCK_REALTIME%22&type=code
long bugzilla link

I would suggest starting this as a discussion on r-devel first; at a minimum, it's out of scope here & gets its own issue.

This was referenced Jul 17, 2024

Master List of data.table Issues for GSoC '24 (Josh) joshhwuu/gsoc-2024#1

Open

Fix Performance of 'by' Operations when verbose=TRUE #6296

Merged

Anirban166 added a commit to Anirban166/data.table that referenced this issue Jul 18, 2024

rm commit SHAs for Slow/Fast since the comment (Rdatatable#6286 (comm…

9854be6

…ent)) is invalid or OS was the source of difference rather than the version

MichaelChirico closed this as completed in #6296 Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'by' operations much slower when verbose=TRUE #6286

'by' operations much slower when verbose=TRUE #6286

joshhwuu commented Jul 16, 2024 •

edited

Loading

joshhwuu commented Jul 16, 2024

Anirban166 commented Jul 16, 2024

joshhwuu commented Jul 16, 2024

Anirban166 commented Jul 16, 2024

joshhwuu commented Jul 16, 2024 •

edited

Loading

tdhock commented Jul 17, 2024

joshhwuu commented Jul 17, 2024

tdhock commented Jul 17, 2024

tdhock commented Jul 17, 2024

jangorecki commented Jul 21, 2024 •

edited

Loading

aitap commented Aug 25, 2024 •

edited

Loading

MichaelChirico commented Sep 3, 2024

'by' operations much slower when verbose=TRUE #6286

'by' operations much slower when verbose=TRUE #6286

Comments

joshhwuu commented Jul 16, 2024 • edited Loading

joshhwuu commented Jul 16, 2024

Anirban166 commented Jul 16, 2024

joshhwuu commented Jul 16, 2024

Anirban166 commented Jul 16, 2024

joshhwuu commented Jul 16, 2024 • edited Loading

tdhock commented Jul 17, 2024

joshhwuu commented Jul 17, 2024

tdhock commented Jul 17, 2024

tdhock commented Jul 17, 2024

jangorecki commented Jul 21, 2024 • edited Loading

aitap commented Aug 25, 2024 • edited Loading

MichaelChirico commented Sep 3, 2024

joshhwuu commented Jul 16, 2024 •

edited

Loading

joshhwuu commented Jul 16, 2024 •

edited

Loading

jangorecki commented Jul 21, 2024 •

edited

Loading

aitap commented Aug 25, 2024 •

edited

Loading