Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread crashes R #5041

Closed
e-nascimento opened this issue Jun 11, 2021 · 9 comments · Fixed by #5046
Closed

fread crashes R #5041

e-nascimento opened this issue Jun 11, 2021 · 9 comments · Fixed by #5046
Assignees
Milestone

Comments

@e-nascimento
Copy link

Hello,

I am facing the same issue as described on #2228
I could isolate the issue.

RStudio crashes if you try to read row 183131
DT <- fread(file = "file.csv", sep = ";", fill = T, verbose = T, nrows = 183130)

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0 tinytex_0.32 xfun_0.23

file.zip

@avimallu
Copy link
Contributor

avimallu commented Jun 12, 2021

I don't seem to be able to reproduce on Windows (although a slightly newer version than yours). Maybe something else?

Output
> DT <- fread(file = "file.csv", sep = ";", fill = T, verbose = T, nrows = 183130)
  OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            12
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          12
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 6 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 6 threads (omp_get_max_threads()=12, nth=6)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file file.csv
  File opened, size = 40.53MB (42497961 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<CNPJ_CIA;DT_REFER;VERSAO;DENOM>>
[06] Detect separator, quoting rule, and ncolumns
  Using supplied sep ';'
  sep=';'  with 14 fields using quote rule 0
  Detected 14 columns on line 1. This line is either column names or first data row. Line starts as: <<CNPJ_CIA;DT_REFER;VERSAO;DENOM>>
  Quote rule picked = 0
  fill=true and the most number of columns found is 14
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because nrow limit (183130) supplied
  Type codes (jump 000)    : CA5C5CCCCACC7C  Quote rule 0
  'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 100 sample rows
  All rows were sampled since file is small so we know nrow=100 exactly
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : CA5C5CCCCACC7C
[10] Allocate memory for the datatable
  Allocating 14 column slots (14 - 0 dropped) with 100 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=42497821
  Too few rows allocated. Allocating additional 219656 rows (now nrows=183130) and continue reading from jump 0
  jumps=[0..1), chunk_size=1048576, total_size=42497821
Read 183130 rows x 14 columns from 40.53MB (42497961 bytes) file in 00:00.369 wall clock time
[12] Finalizing the datatable
  Type counts:
         2 : int32     '5'
         1 : float64   '7'
         2 : int32     'A'
         9 : string    'C'
=============================
   0.001s (  0%) Memory map 0.040GB file
   0.001s (  0%) sep=';' ncol=14 and header detection
   0.000s (  0%) Column type detection using 100 sample rows
   0.000s (  0%) Allocation of 183130 rows x 14 cols (0.000GB) of which 183130 (100%) rows used
   0.367s ( 99%) Reading 1 chunks (0 swept) of 1.000MB (each chunk 183130 rows) using 1 threads
   +    0.224s ( 61%) Parse to row-major thread buffers (grown 38 times)
   +    0.128s ( 35%) Transpose
   +    0.015s (  4%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   0.369s        Total
SessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.1

loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0

@e-nascimento
Copy link
Author

Hi avimallu.

Please try the next row.
DT <- fread(file = "file.csv", sep = ";", fill = T, verbose = T, nrows = 183131)

Row 183130 is the last one before the crash.

Thanks

@avimallu
Copy link
Contributor

Gotcha. I am able to reproduce. Verbose until crash (on cmd on Windows):

Show output
> DT <- fread(file = "file.csv", sep = ";", fill = T, verbose = T, nrows = 183$
  OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            12
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          12
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 6 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 6 threads (omp_get_max_threads()=12, nth=6)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file file.csv
  File opened, size = 40.53MB (42497961 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<CNPJ_CIA;DT_REFER;VERSAO;DENOM>>
[06] Detect separator, quoting rule, and ncolumns
  Using supplied sep ';'
  sep=';'  with 14 fields using quote rule 0
  Detected 14 columns on line 1. This line is either column names or first data row. Line starts as: <<CNPJ_CIA;DT_REFER;VERSAO;DENOM>>
  Quote rule picked = 0
  fill=true and the most number of columns found is 14
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because nrow limit (183131) supplied
  Type codes (jump 000)    : CA5C5CCCCACC7C  Quote rule 0
  'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 100 sample rows
  All rows were sampled since file is small so we know nrow=100 exactly
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : CA5C5CCCCACC7C
[10] Allocate memory for the datatable
  Allocating 14 column slots (14 - 0 dropped) with 100 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=42497821
  Too few rows allocated. Allocating additional 219656 rows (now nrows=183131) and continue reading from jump 0
  jumps=[0..1), chunk_size=1048576, total_size=42497821

Interestingly, fread works fine without specifying fill = TRUE, throwing a warning about improper out of sample error while in RStudio. Might have something to do with that, I suppose?

Show output without fill specified
> DT <- fread(file = "file.csv", sep = ";", verbose = T, nrows = 183131)
  OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            12
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          12
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 6 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 6 threads (omp_get_max_threads()=12, nth=6)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file file.csv
  File opened, size = 40.53MB (42497961 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<CNPJ_CIA;DT_REFER;VERSAO;DENOM>>
[06] Detect separator, quoting rule, and ncolumns
  Using supplied sep ';'
  sep=';'  with 100 lines of 14 fields using quote rule 0
  Detected 14 columns on line 1. This line is either column names or first data row. Line starts as: <<CNPJ_CIA;DT_REFER;VERSAO;DENOM>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 14
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because nrow limit (183131) supplied
  Type codes (jump 000)    : CA5C5CCCCACC7C  Quote rule 0
  'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 100 sample rows
  All rows were sampled since file is small so we know nrow=100 exactly
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : CA5C5CCCCACC7C
[10] Allocate memory for the datatable
  Allocating 14 column slots (14 - 0 dropped) with 100 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=42497821
  Too few rows allocated. Allocating additional 219656 rows (now nrows=183131) and continue reading from jump 0
  jumps=[0..1), chunk_size=1048576, total_size=42497821
  Restarting team from jump 0. nSwept==0 quoteRule==1
  jumps=[0..1), chunk_size=1048576, total_size=42497821
  Restarting team from jump 0. nSwept==0 quoteRule==2
  jumps=[0..1), chunk_size=1048576, total_size=42497821
Read 183131 rows x 14 columns from 40.53MB (42497961 bytes) file in 00:00.360 wall clock time
[12] Finalizing the datatable
  Type counts:
         2 : int32     '5'
         1 : float64   '7'
         2 : int32     'A'
         9 : string    'C'
=============================
   0.001s (  0%) Memory map 0.040GB file
   0.000s (  0%) sep=';' ncol=14 and header detection
   0.000s (  0%) Column type detection using 100 sample rows
   0.001s (  0%) Allocation of 183131 rows x 14 cols (0.000GB) of which 183131 (100%) rows used
   0.358s ( 99%) Reading 1 chunks (0 swept) of 1.000MB (each chunk 183131 rows) using 1 threads
   +    0.226s ( 63%) Parse to row-major thread buffers (grown 38 times)
   +    0.125s ( 35%) Transpose
   +    0.007s (  2%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   0.360s        Total
Warning message:
In fread(file = "file.csv", sep = ";", verbose = T, nrows = 183131) :
  Found and resolved improper quoting out-of-sample. First healed line 183132: <<76.535.764/0001-43;2018-03-31;1;OI S.A.;011312;DF Consolidado - Balanço Patrimonial Passivo;REAL;MIL;PENÚLTIMO;2017-12-31;2.03.02.13;"Senior Notes" Reestruturados;0.0000000000;N>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

Session Info is the same as before - first command run on R on Command Prompt, second on RStudio.

@e-nascimento
Copy link
Author

Hmm, I had not tried it without fill == TRUE because I am always worried about unequal row length (esp. when I am not sure about any schema).
It works fine except for the warning message.

The issue seems to be related to the quoting I highlighted here (row 183132):
<<76.535.764/0001-43;2018-03-31;1;OI S.A.;011312;DF Consolidado - Balanço Patrimonial Passivo;REAL;MIL;PENÚLTIMO;2017-12-31;2.03.02.13;"Senior Notes" Reestruturados;0.0000000000;N>>

I removed all quotes in the original file and successfully tried:

DT <- fread(file = "file.csv", sep = ";", fill = T)
So the issue must be related to the improper quoting.

I also tried with the original file:

it works with warning (as you mentioned)

DT <- fread(file = "file.csv", sep = ";")

it works fine

DT <- fread(file = "file.csv", sep = ";", quote = "", fill = T)

It seems that you have to consciously disable quoting (to suppress any warnings) to enable fill == T to work.
Apparently, improver quoting plus fill = T are crashing R.

Any thoughts?

@tlapak tlapak self-assigned this Jun 14, 2021
@tlapak
Copy link
Contributor

tlapak commented Jun 14, 2021

For now, in order to read the file, disable quoting as you did in your last line of code. The default value of quote is "\"" and fread will attempt to fix any improperly quoted fields. In combination with fill = TRUE this currently leads to a segfault if the field had already been guessed to be of type character.

Best practice is to escape all literal quote characters in your input file. Failing that, turn off quoting. Otherwise you may end up with unexpected results even if it doesn't crash.

Simpler reprex:

fread(paste0(paste(rep(c('a; b'), 100), collapse = '\n'), c('\n"a" 2;b')), fill = TRUE, quote = '\"')

(I've located the bug but will need another day or two to push a fix, depending on when I find the time.)

@tlapak tlapak closed this as completed Jun 14, 2021
@tlapak tlapak reopened this Jun 14, 2021
@MichaelChirico
Copy link
Member

MichaelChirico commented Jun 14, 2021 via email

@mattdowle mattdowle added this to the 1.14.1 milestone Jun 21, 2021
@trashbirdecology
Copy link

Could someone dumb the issue and probable bug down a little? I am running into the same problem (a crash on using fill=TRUE). The reprex above still crashes the instance.

fread(paste0(paste(rep(c('1; 2'), 100), collapse = '\n'), c('\n"a";2\n1; 2\n'), paste(rep(c('1; 2'), 100), collapse = '\n')),
fill = TRUE, verbose = TRUE, header = FALSE)

@ben-schwen
Copy link
Member

fread(paste0(paste(rep(c('1; 2'), 100), collapse = '\n'), c('\n"a";2\n1; 2\n'), paste(rep(c('1; 2'), 100), collapse = '\n')),
fill = TRUE, verbose = TRUE, header = FALSE)

@trashbirdecology Example is working in current dev. You can update to dev version using data.table:::update.dev.pkg() (or data.table:::update_dev_pkg() depending on which data.table version you have installed locally)

@trashbirdecology
Copy link

trashbirdecology commented Jan 10, 2023 via email

@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants