Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to allocate counts or TMP when assigning g in gforce #4295

Closed
renkun-ken opened this issue Mar 10, 2020 · 27 comments · Fixed by #4297
Closed

Failed to allocate counts or TMP when assigning g in gforce #4295

renkun-ken opened this issue Mar 10, 2020 · 27 comments · Fixed by #4297
Labels
bug GForce issues relating to optimized grouping calculations (GForce)
Milestone

Comments

@renkun-ken
Copy link
Member

renkun-ken commented Mar 10, 2020

I'm working with a data.table of 1423657324 rows and 14 columns.

There's an integer group (grp) with 3790 unique values.

When I do the following

stats <- ft[, .(
      col1 = median(col1, na.rm = TRUE),
      col2 = median(col2, na.rm = TRUE),
      col3 = median(col3, na.rm = TRUE)
    ), keyby = grp]

The following error occurs:

Error in gforce(thisEnv, jsub, o__, f__, len__, irows) :
  Internal error: Failed to allocate counts or TMP when assigning g in gforce

This does not occur when a shorter version of the data (1/3 of the size) is done with the same aggregation.

@MichaelChirico
Copy link
Member

I would imagine median could be very memory hungry, this is why you usually see approx_quantile in "big data" SQL and median itself is dropped entirely.

Do you get the same issue if you try and sort the data by col1? If not, it may be that gmedian is trying to do all three columns at once.

As a workaround, do ft[ , .N, keyby = .(grp, col1)] and get the median from the frequency table, since you said there are far fewer unique values.

@shrektan
Copy link
Member

Will changing the median to stats::median help? It should prevent data.table from optimizing with gforce.

@renkun-ken
Copy link
Member Author

Disabling gforce or using stats::median will not trigger this error. But I'm still curious about why this occurs?

My server memory is 1TB and should be enough to process.

@MichaelChirico
Copy link
Member

how about stats::median(c(col1, col2, col3)) (just to check if tripling the memory footprint has an impact)

@renkun-ken
Copy link
Member Author

stats::median(c(col1, col2, col3)) works quite normally and nothing special occurs.

@shrektan
Copy link
Member

Well, the error is thrown from

data.table/src/gsumm.c

Lines 114 to 115 in b1b1832

int *counts = calloc(nBatch*highSize, sizeof(int)); // TODO: cache-line align and make highSize a multiple of 64
int *TMP = malloc(nrow*2*sizeof(int));

However, I also see the nrow is declared as an int

static int nrow = 0; // length of underlying x; same as length(ghigh) and length(glow)

It looks like it will lead to an overflow for 1423657324 * 2...

@shrektan
Copy link
Member

@renkun-ken Would you mind to have a test? Just cut the row count of your data to 1073741823L / 1073741824L respectively and try your original code.

I'm expecting the first case (1073741823L row) works but the second will fail.

@shrektan
Copy link
Member

I believe that's the cause. The overflow leads to an infinite memory allocation...

Rcpp::cppFunction("size_t test(int x) {
                    return x*2*sizeof(int);
                  }")
test(1073741823L)
#> [1] 8589934584
test(1073741824L)
#> [1] 1.844674e+19
test(1423657324L)
#> [1] 1.844674e+19

Created on 2020-03-10 by the reprex package (v0.3.0)

@renkun-ken
Copy link
Member Author

Yes, 1073741823L case works perfectly while 1073741824L case does not work as you expected.

@shrektan
Copy link
Member

@renkun-ken It would be great if you can verify that PR #4297 does fix this issue, when you have time, of course.

@renkun-ken
Copy link
Member Author

@shrektan Thanks! I'll verify it soon.

@renkun-ken
Copy link
Member Author

@shrektan sadly too hard to fetch the repo

remote: Enumerating objects: 1431, done.
remote: Counting objects: 100% (1431/1431), done.
remote: Compressing objects: 100% (181/181), done.
Timeout, server github.com not responding. KiB | 2.00 KiB/s  
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed

tried many times but no luck

@shrektan
Copy link
Member

The network issue I encountered before... so I packaged the source and have sent it to your email (I assume you are able to access your personal email).

@renkun-ken
Copy link
Member Author

Thanks for your nice packaging! My git fetch worked magically as I was sleeping last night.

I retried and can confirm that the bug is fixed with the PR. Good work!

BTW, are you suspecting that there are some other similar int problems like this?

@MichaelChirico
Copy link
Member

Here are the other 60 or so places with a similar pattern in the code (haven't combed through it)

grep -Enr "[^.][0-9]+\s*[*]" src --include=*.c | grep -Ev "^src/[a-z]+[.]c:[0-9]+:\s*[/][/]"
src/forder.c:260:    memset(thiscounts, 0, 256*sizeof(int));
src/forder.c:380:  dmask = dround ? 1 << (8*dround-1) : 0;
src/forder.c:425:  memset(stat,   0, 257*sizeof(uint64_t));
src/forder.c:728:  if (!TMP || !UGRP /*|| TMP%64 || UGRP%64*/) STOP(_("Failed to allocate TMP or UGRP or they weren't cache line aligned: nth=%d"), nth);
src/forder.c:1012:          memcpy(my_starts, my_starts_copy, 256*sizeof(uint16_t));  // restore starting offsets
src/forder.c:1051:  uint8_t  *ugrps =  malloc(nBatch*256*sizeof(uint8_t));
src/frank.c:93:          dans[xorder[j]-1] = (2*xstart[i]+xlen[i]-1)/2.0;
src/frank.c:126:        int offset = 2*xstart[i]+xlen[i]-2;
src/gsumm.c:115:    int *TMP   = malloc(nrow*2*sizeof(int));
src/gsumm.c:132:      int *restrict my_tmp = TMP + b*2*batchSize;
src/gsumm.c:135:        int *p = my_tmp + 2*my_counts[w]++;
src/gsumm.c:146:        const int *restrict p = TMP + b*2*batchSize + start*2;
src/fsort.c:125:  if (batchSize < 1024) batchSize = 1024; // simple attempt to work reasonably for short vector. 1024*8 = 2 4kb pages
src/fsort.c:178:                       (int)(nBatch*MSBsize*sizeof(R_xlen_t)/(1024*1024)),
src/fsort.c:179:                       (int)(nBatch*MSBsize*sizeof(R_xlen_t)/(4*1024*nBatch)),
src/assign.c:21:  SETLENGTH(x,50+n*2*sizeof(void *)/4);  // 1*n for the names, 1*n for the VECSXP itself (both are over allocated).
src/assign.c:575:        char *s5 = (char*) malloc(strlen(tc2) + 5); //4 * '_' + \0
src/bmerge.c:496:                 ival.d-xval.d == rollabs /*#1007*/))
src/bmerge.c:510:                   xval.d-ival.d == rollabs /*#1007*/))
src/fread.c:194:  char *ptr = buf + 501 * flip;
src/fread.c:282:  const char *mostConsumed = start; // tests 1550* includes both 'na' and 'nan' in nastrings. Don't stop after 'na' if 'nan' can be consumed too.
src/fread.c:375:    ans = (double) tp.tv_sec + 1e-9 * (double) tp.tv_nsec;
src/fread.c:379:    ans = (double) tv.tv_sec + 1e-6 * (double) tv.tv_usec;
src/fread.c:434:  mmp_copy = (char *)malloc((size_t)fileSize + 1/* extra \0 */);
src/fread.c:596:    acc = 10*acc + digit;
src/fread.c:628:    acc = 10*acc + digit;
src/fread.c:693:    acc = 10*acc + digit;
src/fread.c:727:      acc = 10*acc + digit;
src/fread.c:914:      E = 10*E + digit;
src/fread.c:1256:      int nbit = 8*sizeof(char *); // #nocov
src/fread.c:1655:    if (jump0size*100*2 < sz) nJumps=100;  // 100 jumps * 100 lines = 10,000 line sample
src/fread.c:1656:    else if (jump0size*10*2 < sz) nJumps=10;
src/fread.c:1663:    else DTPRINT(_("(%"PRIu64" bytes from row 1 to eof) / (2 * %"PRIu64" jump0size) == %"PRIu64"\n"),
src/fread.c:1664:                 (uint64_t)sz, (uint64_t)jump0size, (uint64_t)(sz/(2*jump0size)));
src/fread.c:1687:    if (ch<lastRowEnd) ch=lastRowEnd;  // Overlap when apx 1,200 lines (just over 11*100) with short lines at the beginning and longer lines near the end, #2157
src/fread.c:1823:    allocnrow = clamp_szt((size_t)(bytesRead / fmax(meanLineLen - 2*sd, minLen)),
src/fread.c:1824:                          (size_t)(1.1*estnrow), 2*estnrow);
src/fread.c:1833:      DTPRINT(_("  Initial alloc = %"PRIu64" rows (%"PRIu64" + %d%%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]\n"),
src/fread.c:1973:  size_t chunkBytes = umax((size_t)(1000*meanLineLen), 1ULL/*MB*/ *1024*1024);
src/fread.c:2030:      .buff8 = malloc(rowSize8 * myBuffRows + 8),
src/fread.c:2031:      .buff4 = malloc(rowSize4 * myBuffRows + 4),
src/fread.c:2032:      .buff1 = malloc(rowSize1 * myBuffRows + 1),
src/fread.c:2102:          ctx.buff8 = realloc(ctx.buff8, rowSize8 * myBuffRows + 8);
src/fread.c:2103:          ctx.buff4 = realloc(ctx.buff4, rowSize4 * myBuffRows + 4);
src/fread.c:2104:          ctx.buff1 = realloc(ctx.buff1, rowSize1 * myBuffRows + 1);
src/fread.c:2453:    DTPRINT(_("%8.3fs (%3.0f%%) Memory map %.3fGB file\n"), tMap-t0, 100.0*(tMap-t0)/tTot, 1.0*fileSize/(1024*1024*1024));
src/fread.c:2460:      tAlloc-tColType, 100.0*(tAlloc-tColType)/tTot, (uint64_t)allocnrow, ncol, DTbytes/(1024.0*1024*1024), (uint64_t)DTi, 100.0*DTi/allocnrow);
src/fread.c:2464:            tReread-tAlloc, 100.0*(tReread-tAlloc)/tTot, nJumps, nSwept, (double)chunkBytes/(1024*1024), (int)(DTi/nJumps), nth);
src/fifelse.c:167:    REPROTECT(cons = eval(SEXPPTR_RO(args)[2*i], rho), Icons);
src/fifelse.c:168:    REPROTECT(outs = eval(SEXPPTR_RO(args)[2*i+1], rho), Iouts);
src/fifelse.c:173:      error("Argument #%d must be logical.", 2*i+1);
src/fwrite.c:376:    ch += 7 + 2*!squashDateTime;
src/fwrite.c:389:    ch += 8 + 2*!squashDateTime;
src/fwrite.c:614:  size_t buffSize = (size_t)1024*1024*args.buffMB;
src/fwrite.c:645:  size_t maxLineLen = eolLen + args.ncol*(2*(doQuote!=0) + 1/*sep*/);
src/fwrite.c:648:    maxLineLen += 2*(doQuote!=0/*NA('auto') or true*/) + 1/*sep*/;
src/fwrite.c:782:  if (maxLineLen*2>buffSize) { buffSize=2*maxLineLen; rowsPerBatch=2; }
src/fwrite.c:910:          int used = 100*((double)(ch-myBuff))/buffSize;  // percentage of original buffMB

@renkun-ken
Copy link
Member Author

Also BTW, for such bugs that require very large data to reproduce, do we need to build test cases for them?

For this bug, a minimal reproducible example is

library(data.table)

n <- 1500000000
ngrp <- 4000
dt <- data.table(group = sample.int(ngrp, n, replace = TRUE), x = runif(n))
res <- dt[, .(
  xmedian = median(x, na.rm = TRUE)
), keyby = group]

but dt is 18GB and creating dt costs 5-10 mins.

@shrektan
Copy link
Member

You can get a faster example by getting rid of the random functions.... just use rep() and c()... I actually did try to test this.

The difficulty is the requirement of a very large amount of memory...

@renkun-ken
Copy link
Member Author

I tried

dt <- data.table(group = rep(seq_len(ngrp), each = n / ngrp), x = numeric(n))

and

dt <- data.table(group = rep(seq_len(ngrp), each = n / ngrp), x = rep(rnorm(1000), each = n / 1000))

Both are much faster but cannot produce the error. 👀

@ColeMiller1
Copy link
Contributor

Instead of rep(..., each =), remove the each. When the groupings are sorted, the part of code that uses TMP isn't used.

@steffen-windmueller
Copy link

I got a similar problem. When running my R script in a Windows SQL Server (R version 3.5.2 and data.table 1.12.0), the following error occurs:
Error in forderv(ans, cols, sort = TRUE, retGrp = FALSE, order = if (decreasing) -order else order, : Failed to allocate TMP or UGRP or they weren't cache line aligned: nth=8 Call: source ... [ -> [.data.table -> eval -> eval -> forder -> forderv
that is called by the following line of forcer.c:
src/forder.c:728: if (!TMP || !UGRP /*|| TMP%64 || UGRP%64*/) STOP(_("Failed to allocate TMP or UGRP or they weren't cache line aligned: nth=%d"), nth);

The script joins two data tables (X and Y) with an on=.(Id,date>=start_date,date<=end_date) statement and uses by=.EACHI for the operation. When running the same script on my local RStudio version, there is no error. Do you think setting "Id" and "date" resp. "start_date" and "end_date" as keys in X and Y will prevent the integer overload? Alternatively, would changing by=.EACHI to keyby=.EACHI do the thing?

Thank you already in advance.

@shrektan
Copy link
Member

shrektan commented May 5, 2020

Do you think setting "Id" and "date" resp. "start_date" and "end_date" as keys in X and Y will prevent the integer overload? Alternatively, would changing by=.EACHI to keyby=.EACHI do the thing?

I don't have experience in MSSQLServer with R but I don't think it will solve your problem. Is it expensive to give a try?

@steffen-windmueller
Copy link

I gave it a try and it did not work out. Do you know what might help here? Could it have to do something with the dependencies between data.table and bit64, since dependencies are sometimes distorted in MSSQL?

@shrektan
Copy link
Member

shrektan commented May 5, 2020

In my opinion, it should have nothing to do with bit64 because bit64 has stopped updating for 3 years. I have some (limited) suggestions for you:

  • Turn-off the multiple threads, i.e., data.table::setDTthreads(1L)
  • Use the dev version of data.table (albeit I doubt it will work)
  • Make a local reproducible example and report it here
  • Call for support from Microsoft if you have a business license

@steffen-windmueller
Copy link

Turning off multiple threads did not work. Sometimes, the error changes to "invalid BXL stream", what can be attributed to having not enough memory.

Would you know if the error src/forder.c:728: if (!TMP || !UGRP /*|| TMP%64 || UGRP%64*/) STOP(_("Failed to allocate TMP or UGRP or they weren't cache line aligned: nth=%d"), nth); can be caused by a lack of RAM?

@jangorecki
Copy link
Member

@scharlatan0139 yes, it does look exactly like an error caused by lack of RAM.

@jangorecki jangorecki added the GForce issues relating to optimized grouping calculations (GForce) label Dec 4, 2020
@Debasis5
Copy link

Debasis5 commented Dec 9, 2020

@shrektan , Is the rightway to import your fix is remotes::install_github("Rdatatable/data.table#fix4295")

I tried this but got an invalid repo error

@shrektan
Copy link
Member

shrektan commented Dec 9, 2020

@Debasis5 It should be

remotes::install_github("rdatatable/data.table#4297") 

or

remotes::install_github("Rdatatable/data.table@fix4295") 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug GForce issues relating to optimized grouping calculations (GForce)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants