fread crashes R #5041

e-nascimento · 2021-06-11T20:44:07Z

Hello,

I am facing the same issue as described on #2228
I could isolate the issue.

RStudio crashes if you try to read row 183131
DT <- fread(file = "file.csv", sep = ";", fill = T, verbose = T, nrows = 183130)

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0 tinytex_0.32 xfun_0.23

file.zip

avimallu · 2021-06-12T04:13:59Z

I don't seem to be able to reproduce on Windows (although a slightly newer version than yours). Maybe something else?

Output

> DT <- fread(file = "file.csv", sep = ";", fill = T, verbose = T, nrows = 183130)
  OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            12
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          12
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 6 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 6 threads (omp_get_max_threads()=12, nth=6)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file file.csv
  File opened, size = 40.53MB (42497961 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<CNPJ_CIA;DT_REFER;VERSAO;DENOM>>
[06] Detect separator, quoting rule, and ncolumns
  Using supplied sep ';'
  sep=';'  with 14 fields using quote rule 0
  Detected 14 columns on line 1. This line is either column names or first data row. Line starts as: <<CNPJ_CIA;DT_REFER;VERSAO;DENOM>>
  Quote rule picked = 0
  fill=true and the most number of columns found is 14
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because nrow limit (183130) supplied
  Type codes (jump 000)    : CA5C5CCCCACC7C  Quote rule 0
  'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 100 sample rows
  All rows were sampled since file is small so we know nrow=100 exactly
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : CA5C5CCCCACC7C
[10] Allocate memory for the datatable
  Allocating 14 column slots (14 - 0 dropped) with 100 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=42497821
  Too few rows allocated. Allocating additional 219656 rows (now nrows=183130) and continue reading from jump 0
  jumps=[0..1), chunk_size=1048576, total_size=42497821
Read 183130 rows x 14 columns from 40.53MB (42497961 bytes) file in 00:00.369 wall clock time
[12] Finalizing the datatable
  Type counts:
         2 : int32     '5'
         1 : float64   '7'
         2 : int32     'A'
         9 : string    'C'
=============================
   0.001s (  0%) Memory map 0.040GB file
   0.001s (  0%) sep=';' ncol=14 and header detection
   0.000s (  0%) Column type detection using 100 sample rows
   0.000s (  0%) Allocation of 183130 rows x 14 cols (0.000GB) of which 183130 (100%) rows used
   0.367s ( 99%) Reading 1 chunks (0 swept) of 1.000MB (each chunk 183130 rows) using 1 threads
   +    0.224s ( 61%) Parse to row-major thread buffers (grown 38 times)
   +    0.128s ( 35%) Transpose
   +    0.015s (  4%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   0.369s        Total

SessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.1

loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0

e-nascimento · 2021-06-12T11:00:48Z

Hi avimallu.

Please try the next row.
DT <- fread(file = "file.csv", sep = ";", fill = T, verbose = T, nrows = 183131)

Row 183130 is the last one before the crash.

Thanks

avimallu · 2021-06-12T12:38:28Z

Gotcha. I am able to reproduce. Verbose until crash (on cmd on Windows):

Show output

> DT <- fread(file = "file.csv", sep = ";", fill = T, verbose = T, nrows = 183$
  OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            12
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          12
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 6 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 6 threads (omp_get_max_threads()=12, nth=6)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file file.csv
  File opened, size = 40.53MB (42497961 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<CNPJ_CIA;DT_REFER;VERSAO;DENOM>>
[06] Detect separator, quoting rule, and ncolumns
  Using supplied sep ';'
  sep=';'  with 14 fields using quote rule 0
  Detected 14 columns on line 1. This line is either column names or first data row. Line starts as: <<CNPJ_CIA;DT_REFER;VERSAO;DENOM>>
  Quote rule picked = 0
  fill=true and the most number of columns found is 14
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because nrow limit (183131) supplied
  Type codes (jump 000)    : CA5C5CCCCACC7C  Quote rule 0
  'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 100 sample rows
  All rows were sampled since file is small so we know nrow=100 exactly
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : CA5C5CCCCACC7C
[10] Allocate memory for the datatable
  Allocating 14 column slots (14 - 0 dropped) with 100 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=42497821
  Too few rows allocated. Allocating additional 219656 rows (now nrows=183131) and continue reading from jump 0
  jumps=[0..1), chunk_size=1048576, total_size=42497821

Interestingly, fread works fine without specifying fill = TRUE, throwing a warning about improper out of sample error while in RStudio. Might have something to do with that, I suppose?

Show output without fill specified

> DT <- fread(file = "file.csv", sep = ";", verbose = T, nrows = 183131)
  OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            12
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          12
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 6 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 6 threads (omp_get_max_threads()=12, nth=6)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file file.csv
  File opened, size = 40.53MB (42497961 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<CNPJ_CIA;DT_REFER;VERSAO;DENOM>>
[06] Detect separator, quoting rule, and ncolumns
  Using supplied sep ';'
  sep=';'  with 100 lines of 14 fields using quote rule 0
  Detected 14 columns on line 1. This line is either column names or first data row. Line starts as: <<CNPJ_CIA;DT_REFER;VERSAO;DENOM>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 14
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because nrow limit (183131) supplied
  Type codes (jump 000)    : CA5C5CCCCACC7C  Quote rule 0
  'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 100 sample rows
  All rows were sampled since file is small so we know nrow=100 exactly
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : CA5C5CCCCACC7C
[10] Allocate memory for the datatable
  Allocating 14 column slots (14 - 0 dropped) with 100 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=42497821
  Too few rows allocated. Allocating additional 219656 rows (now nrows=183131) and continue reading from jump 0
  jumps=[0..1), chunk_size=1048576, total_size=42497821
  Restarting team from jump 0. nSwept==0 quoteRule==1
  jumps=[0..1), chunk_size=1048576, total_size=42497821
  Restarting team from jump 0. nSwept==0 quoteRule==2
  jumps=[0..1), chunk_size=1048576, total_size=42497821
Read 183131 rows x 14 columns from 40.53MB (42497961 bytes) file in 00:00.360 wall clock time
[12] Finalizing the datatable
  Type counts:
         2 : int32     '5'
         1 : float64   '7'
         2 : int32     'A'
         9 : string    'C'
=============================
   0.001s (  0%) Memory map 0.040GB file
   0.000s (  0%) sep=';' ncol=14 and header detection
   0.000s (  0%) Column type detection using 100 sample rows
   0.001s (  0%) Allocation of 183131 rows x 14 cols (0.000GB) of which 183131 (100%) rows used
   0.358s ( 99%) Reading 1 chunks (0 swept) of 1.000MB (each chunk 183131 rows) using 1 threads
   +    0.226s ( 63%) Parse to row-major thread buffers (grown 38 times)
   +    0.125s ( 35%) Transpose
   +    0.007s (  2%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   0.360s        Total
Warning message:
In fread(file = "file.csv", sep = ";", verbose = T, nrows = 183131) :
  Found and resolved improper quoting out-of-sample. First healed line 183132: <<76.535.764/0001-43;2018-03-31;1;OI S.A.;011312;DF Consolidado - Balanço Patrimonial Passivo;REAL;MIL;PENÚLTIMO;2017-12-31;2.03.02.13;"Senior Notes" Reestruturados;0.0000000000;N>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

Session Info is the same as before - first command run on R on Command Prompt, second on RStudio.

e-nascimento · 2021-06-12T14:37:19Z

Hmm, I had not tried it without fill == TRUE because I am always worried about unequal row length (esp. when I am not sure about any schema).
It works fine except for the warning message.

The issue seems to be related to the quoting I highlighted here (row 183132):
<<76.535.764/0001-43;2018-03-31;1;OI S.A.;011312;DF Consolidado - Balanço Patrimonial Passivo;REAL;MIL;PENÚLTIMO;2017-12-31;2.03.02.13;"Senior Notes" Reestruturados;0.0000000000;N>>

I removed all quotes in the original file and successfully tried:

DT <- fread(file = "file.csv", sep = ";", fill = T)
So the issue must be related to the improper quoting.

I also tried with the original file:

it works with warning (as you mentioned)

DT <- fread(file = "file.csv", sep = ";")

it works fine

DT <- fread(file = "file.csv", sep = ";", quote = "", fill = T)

It seems that you have to consciously disable quoting (to suppress any warnings) to enable fill == T to work.
Apparently, improver quoting plus fill = T are crashing R.

Any thoughts?

tlapak · 2021-06-14T21:43:17Z

For now, in order to read the file, disable quoting as you did in your last line of code. The default value of quote is "\"" and fread will attempt to fix any improperly quoted fields. In combination with fill = TRUE this currently leads to a segfault if the field had already been guessed to be of type character.

Best practice is to escape all literal quote characters in your input file. Failing that, turn off quoting. Otherwise you may end up with unexpected results even if it doesn't crash.

Simpler reprex:

fread(paste0(paste(rep(c('a; b'), 100), collapse = '\n'), c('\n"a" 2;b')), fill = TRUE, quote = '\"')

(I've located the bug but will need another day or two to push a fix, depending on when I find the time.)

MichaelChirico · 2021-06-14T21:46:47Z

Vaclav FYI if you've found a fix, please search the issue tracker for related issues -- I have a sense that there's another bug related to fread segfaulting with fill=TRUE. Would be great if multiple bugs are fixed together!

…

On Mon, Jun 14, 2021, 2:43 PM Václav Tlapák ***@***.***> wrote: Reopened #5041 <#5041>. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5041 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2BA5OCDEXKFHHYRYPPSNDTSZZYXANCNFSM46ROL3FQ> .

trashbirdecology · 2023-01-10T16:21:20Z

Could someone dumb the issue and probable bug down a little? I am running into the same problem (a crash on using fill=TRUE). The reprex above still crashes the instance.

fread(paste0(paste(rep(c('1; 2'), 100), collapse = '\n'), c('\n"a";2\n1; 2\n'), paste(rep(c('1; 2'), 100), collapse = '\n')),
fill = TRUE, verbose = TRUE, header = FALSE)

ben-schwen · 2023-01-10T18:53:55Z

fread(paste0(paste(rep(c('1; 2'), 100), collapse = '\n'), c('\n"a";2\n1; 2\n'), paste(rep(c('1; 2'), 100), collapse = '\n')),
fill = TRUE, verbose = TRUE, header = FALSE)

@trashbirdecology Example is working in current dev. You can update to dev version using data.table:::update.dev.pkg() (or data.table:::update_dev_pkg() depending on which data.table version you have installed locally)

trashbirdecology · 2023-01-10T22:06:09Z

Benjamin -- okay thanks, I will try that! apologies if i totally missed that above *______________________________________________________* Jessica L. Burnett (she/her) Github <http://github.com/trashbirdecology> ORCID: 0000-0002-0896-5099 <https://orcid.org/0000-0002-0896-5099> *______________________________________________________* Join the Community for Data Integration <https://www.usgs.gov/centers/cdi>! All welcome. It's cool.

…

On Tue, Jan 10, 2023 at 1:54 PM Benjamin Schwendinger < ***@***.***> wrote: fread(paste0(paste(rep(c('1; 2'), 100), collapse = '\n'), c('\n"a";2\n1; 2\n'), paste(rep(c('1; 2'), 100), collapse = '\n')), fill = TRUE, verbose = TRUE, header = FALSE) @trashbirdecology <https://github.com/trashbirdecology> Example is working in current dev. You can update to dev version using data.table:::update.dev.pkg() (or data.table:::update_dev_pkg() depending on which data.table version you have installed locally) — Reply to this email directly, view it on GitHub <#5041 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACL2TNMWAG6DKWWIU5JFNV3WRWV47ANCNFSM46ROL3FQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jangorecki added the fread label Jun 12, 2021

tlapak self-assigned this Jun 14, 2021

tlapak added the segfault label Jun 14, 2021

tlapak closed this as completed Jun 14, 2021

tlapak reopened this Jun 14, 2021

tlapak mentioned this issue Jun 16, 2021

Fix fread crashes #5046

Merged

mattdowle added this to the 1.14.1 milestone Jun 21, 2021

mattdowle closed this as completed in #5046 Jun 21, 2021

MichaelChirico mentioned this issue Jul 2, 2021

fread seg fault #5062

Closed

jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fread crashes R #5041

fread crashes R #5041

e-nascimento commented Jun 11, 2021

avimallu commented Jun 12, 2021 •

edited

Loading

e-nascimento commented Jun 12, 2021

avimallu commented Jun 12, 2021

e-nascimento commented Jun 12, 2021

tlapak commented Jun 14, 2021 •

edited

Loading

MichaelChirico commented Jun 14, 2021 via email

trashbirdecology commented Jan 10, 2023

ben-schwen commented Jan 10, 2023

trashbirdecology commented Jan 10, 2023 via email

fread crashes R #5041

fread crashes R #5041

Comments

e-nascimento commented Jun 11, 2021

avimallu commented Jun 12, 2021 • edited Loading

e-nascimento commented Jun 12, 2021

avimallu commented Jun 12, 2021

e-nascimento commented Jun 12, 2021

tlapak commented Jun 14, 2021 • edited Loading

MichaelChirico commented Jun 14, 2021 via email

trashbirdecology commented Jan 10, 2023

ben-schwen commented Jan 10, 2023

trashbirdecology commented Jan 10, 2023 via email

avimallu commented Jun 12, 2021 •

edited

Loading

tlapak commented Jun 14, 2021 •

edited

Loading