Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to allocate TMP for items in parallel batch counting #5169

Closed
matthewgson opened this issue Sep 20, 2021 · 4 comments
Closed

Unable to allocate TMP for items in parallel batch counting #5169

matthewgson opened this issue Sep 20, 2021 · 4 comments

Comments

@matthewgson
Copy link

I encountered an issue similar to #4295 but it seems slightly different.

I'm working with a data.table of 890 million rows and 114 columns.

When I do groupby with hour and minute variables

intraday <- dt[, .(
      Nobs = .N,
      col1 = mean(col1, na.rm = TRUE),
      col2 = mean(col2, na.rm = TRUE)
    ), keyby = .(hour(datetime), minute(datetime)]

The following error occurs:

Detected that j uses these columns: qtys_all,vols_all,qtys_f,vols_f,qtys_bd,vols_bd,qtys_mm,vols_mm,qtys_cu,vols_cu,qtys_pc,vols_pc
Finding groups using forderv ... forder.c received 890185979 rows and 2 columns
Error in forderv(byval, sort = keyby, retGrp = TRUE) :
  Unable to allocate TMP for my_n=890185979 items in parallel batch counting

I successfully have done this operation before, but only thing I added was .N. part.
It worked as I removed this part.

intraday <- dt[, .(
      col1 = mean(col1, na.rm = TRUE),
      col2 = mean(col2, na.rm = TRUE)
    ), keyby = .(hour(datetime), minute(datetime)]
Detected that j uses these columns: qtys_all,vols_all,qtys_f,vols_f,qtys_bd,vols_bd,qtys_mm,vols_mm,qtys_cu,vols_cu,qtys_pc,vols_pc
Finding groups using forderv ... forder.c received 890185979 rows and 2 columns
3.230s elapsed (22.3s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'list(mean(qtys_all, na.rm = T), se_mean(qtys_all), mean(vols_all, na.rm = T), se_mean(vols_all), mean(qtys_f, na.rm = T), se_mean(qtys_f), mean(vols_f, na.rm = T), se_mean(vols_f), mean(qtys_bd, na.rm = T), '
GForce is on, left j unchanged
Old mean optimization changed j from 'list(mean(qtys_all, na.rm = T), se_mean(qtys_all), mean(vols_all,     na.rm = T), se_mean(vols_all), mean(qtys_f, na.rm = T), se_mean(qtys_f),     mean(vols_f, na.rm = T), se_mean(vols_f), mean(qtys_bd, na.rm = T),     se_mean(qtys_bd), mean(vols_bd, na.rm = T), se_mean(vols_bd),     mean(qtys_mm, na.rm = T), se_mean(qtys_mm), mean(vols_mm,         na.rm = T), se_mean(vols_mm), mean(qtys_cu, na.rm = T),     se_mean(qtys_cu), mean(vols_cu, na.rm = T), se_mean(vols_cu),     mean(qtys_pc, na.rm = T), se_mean(qtys_pc), mean(vols_pc,         na.rm = T), se_mean(vols_pc))' to 'list(.External(Cfastmean, qtys_all, T), se_mean(qtys_all), .External(Cfastmean, vols_all, T), se_mean(vols_all), .External(Cfastmean, qtys_f, T), se_mean(qtys_f), .External(Cfastmean, vols_f, T), se_mean(vols_f), '
Making each group and running j (GForce FALSE) ...

  collecting discontiguous groups took 1571.924s for 48 groups
  eval(j) took 279.962s for 48 calls
00:04:59 elapsed (00:20:27 cpu)

My server has 1TB memory and I believe there's no memory issue here though I need a thorough check.

Here's sessionInfo()

> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] matrixStats_0.59.0 tictoc_1.0.1       forcats_0.5.1      stringr_1.4.0
 [5] dplyr_1.0.7        purrr_0.3.4        readr_2.0.1        tidyr_1.1.3
 [9] tibble_3.1.4       ggplot2_3.3.5      tidyverse_1.3.1    fst_0.9.4
[13] data.table_1.14.0

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.1 haven_2.4.3      colorspace_2.0-2 vctrs_0.3.8
 [5] generics_0.1.0   utf8_1.2.2       rlang_0.4.11     pillar_1.6.2
 [9] glue_1.4.2       withr_2.4.2      DBI_1.1.1        dbplyr_2.1.1
[13] modelr_0.1.8     readxl_1.3.1     lifecycle_1.0.0  munsell_0.5.0
[17] gtable_0.3.0     cellranger_1.1.0 rvest_1.0.1      tzdb_0.1.2
[21] parallel_4.0.5   fansi_0.5.0      broom_0.7.9      Rcpp_1.0.7
[25] scales_1.1.1     backports_1.2.1  jsonlite_1.7.2   fs_1.5.0
[29] hms_1.1.0        stringi_1.7.4    grid_4.0.5       cli_3.0.1
[33] tools_4.0.5      magrittr_2.0.1   crayon_1.4.1     pkgconfig_2.0.3
[37] ellipsis_0.3.2   xml2_1.3.2       reprex_2.0.1     lubridate_1.7.10
[41] assertthat_0.2.1 httr_1.4.2       rstudioapi_0.13  R6_2.5.1
[45] compiler_4.0.5
@matthewgson matthewgson changed the title Error Performing Aggregation on Large data.table Unable to allocate TMP for items in parallel batch counting Sep 20, 2021
@MichaelChirico
Copy link
Member

Can you check if maybe it's related to #5077? Updating to the latest dev version would solve the issue if so.

@matthewgson
Copy link
Author

matthewgson commented Sep 21, 2021

Definitely, I'll update and run the code once again. I'll see if 1.14.1 fixes this issue.

@matthewgson
Copy link
Author

@MichaelChirico You're right, it works on 1.14.1 version. Thanks!

@MichaelChirico
Copy link
Member

awesome, glad to hear it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants