fread occasionally reads in differently rounded non-exact fp numbers than base R #4461

gmbecker · 2020-05-19T21:54:23Z

This is likely a non-issue (I do understand that these numbers are not meaningfully different). And Apologies if I missed mention of this in the documentation or prevoius issues (I did look).

There are certain values (I ran into one in the wild) where fread and read.table (which agrees with R's parser) parse a string representing a floating point number into equivalent but non-identical byte-representations.

Note this will mean that caching cannot be trusted to stay non-stale when upgrading read.table calls to fread, where the docs and a naive-understanding of what is happening would suggest they could.

Reproducible example:

library(data.table)
## data.table 1.12.8 using 12 threads (see ?getDTthreads).  Latest news: r-datatable.com
exchar = "0.8060667366"
exnum = 0.8060667366
rtres = read.table(text = exchar)
rtres
##          V1
## 1 0.8060667
rtval = rtres[1,1]

identical(rtval, exnum)
## [1] TRUE

frres = fread(text = c(exchar, exchar))
frres
##           V1
## 1: 0.8060667
## 2: 0.8060667
frval = frres[1,V1]

identical(frval, exnum)
## [1] FALSE

sprintf("%1.17f", rtval)
## [1] "0.80606673659999994"
sprintf("%1.17f", frval)
## [1] "0.80606673660000006"

> sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)

## Matrix products: default
## BLAS:   <snip>/R-3.6.1/lib64/R/lib/libRblas.so
## LAPACK: <snip>/R-3.6.1/lib64/R/lib/libRlapack.so

## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     

## other attached packages:
## [1] data.table_1.12.8

## loaded via a namespace (and not attached):
## [1] compiler_3.6.1 tools_3.6.1

MichaelChirico · 2020-05-20T06:45:38Z

Confirmed on master, which means that #4165 doesn't solve this

MichaelChirico · 2020-05-20T09:35:52Z

seven31 is helpful here -- there's an off-by-one error:

seven31::compare(frval, exnum)
0 01111111110 (-1) 1001110010110100110001111000000000101110010011110000 : frval 
0 01111111110 (-1) 1001110010110100110001111000000000101110010011101111 : exnum

Also FWIW i tried to figure out the closest representable number to 0.8060667366 and IINM the exnum value is the right one

MichaelChirico · 2020-05-20T10:35:01Z

The discrepancy happens here:

#fread.c:parse_double_regular:788
# before: r=8060667366.000000, pow10lookup[e] = 1.0E-10L
#   double double multiplication is breaking the output
r *= pow10lookup[e];

MichaelChirico · 2020-05-20T10:57:17Z

The parser seems to be accurate about 99.99% of the time:

isequal = matrix(NA, 1e5, 2L)
for (ii in 0:99999) {
  exchar = sprintf("0.80606%04d", ii)
  exnum = eval(parse(text=exchar))
  rtval = read.table(text = exchar)[1L, 1L]
  isequal[ii+1L, 1L] = identical(rtval, exnum)
  
  frval = fread(text = c(exchar, exchar))[1, V1]
  isequal[ii+1L, 2L] = identical(frval, exnum)
}

table(isequal[ , 2L])
# FALSE  TRUE 
#    13 99987

And there's something of a pattern in the erroneous cases:

diff(which(!isequal[ , 2]))
#  [1] 13635  9017  5327  3690  5327  9017 11231  9017  5327  3690  5327  9017

mattdowle · 2021-08-27T18:46:27Z

Great investigation and fix @MichaelChirico!

Btw, I would have thought positive powers of 10 up to 10^15 could be stored precisely because 2^52 ~ 4.5e15 (all integers up to that value can be stored precisely). But apparently it's up to 10^22 : https://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/

jangorecki added the fread label May 19, 2020

MichaelChirico added the consistency label May 20, 2020

MichaelChirico mentioned this issue May 20, 2020

fix floating point parsing precision in some rare cases #4463

Merged

mattdowle added this to the 1.14.1 milestone Aug 27, 2021

mattdowle closed this as completed in #4463 Aug 27, 2021

SimonCMills mentioned this issue Jun 17, 2022

Differences in numeric representation between read.csv and data.table::fread #5406

Open

jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fread occasionally reads in differently rounded non-exact fp numbers than base R #4461

fread occasionally reads in differently rounded non-exact fp numbers than base R #4461

gmbecker commented May 19, 2020

MichaelChirico commented May 20, 2020

MichaelChirico commented May 20, 2020 •

edited

Loading

MichaelChirico commented May 20, 2020 •

edited

Loading

MichaelChirico commented May 20, 2020 •

edited

Loading

mattdowle commented Aug 27, 2021

fread occasionally reads in differently rounded non-exact fp numbers than base R #4461

fread occasionally reads in differently rounded non-exact fp numbers than base R #4461

Comments

gmbecker commented May 19, 2020

MichaelChirico commented May 20, 2020

MichaelChirico commented May 20, 2020 • edited Loading

MichaelChirico commented May 20, 2020 • edited Loading

MichaelChirico commented May 20, 2020 • edited Loading

mattdowle commented Aug 27, 2021

MichaelChirico commented May 20, 2020 •

edited

Loading

MichaelChirico commented May 20, 2020 •

edited

Loading

MichaelChirico commented May 20, 2020 •

edited

Loading