Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread occasionally reads in differently rounded non-exact fp numbers than base R #4461

Closed
gmbecker opened this issue May 19, 2020 · 5 comments · Fixed by #4463
Closed

fread occasionally reads in differently rounded non-exact fp numbers than base R #4461

gmbecker opened this issue May 19, 2020 · 5 comments · Fixed by #4463

Comments

@gmbecker
Copy link

This is likely a non-issue (I do understand that these numbers are not meaningfully different). And Apologies if I missed mention of this in the documentation or prevoius issues (I did look).

There are certain values (I ran into one in the wild) where fread and read.table (which agrees with R's parser) parse a string representing a floating point number into equivalent but non-identical byte-representations.

Note this will mean that caching cannot be trusted to stay non-stale when upgrading read.table calls to fread, where the docs and a naive-understanding of what is happening would suggest they could.

Reproducible example:

library(data.table)
## data.table 1.12.8 using 12 threads (see ?getDTthreads).  Latest news: r-datatable.com
exchar = "0.8060667366"
exnum = 0.8060667366
rtres = read.table(text = exchar)
rtres
##          V1
## 1 0.8060667
rtval = rtres[1,1]

identical(rtval, exnum)
## [1] TRUE

frres = fread(text = c(exchar, exchar))
frres
##           V1
## 1: 0.8060667
## 2: 0.8060667
frval = frres[1,V1]

identical(frval, exnum)
## [1] FALSE

sprintf("%1.17f", rtval)
## [1] "0.80606673659999994"
sprintf("%1.17f", frval)
## [1] "0.80606673660000006"

> sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)

## Matrix products: default
## BLAS:   <snip>/R-3.6.1/lib64/R/lib/libRblas.so
## LAPACK: <snip>/R-3.6.1/lib64/R/lib/libRlapack.so

## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     

## other attached packages:
## [1] data.table_1.12.8

## loaded via a namespace (and not attached):
## [1] compiler_3.6.1 tools_3.6.1   
@MichaelChirico
Copy link
Member

Confirmed on master, which means that #4165 doesn't solve this

@MichaelChirico
Copy link
Member

MichaelChirico commented May 20, 2020

seven31 is helpful here -- there's an off-by-one error:

seven31::compare(frval, exnum)
0 01111111110 (-1) 1001110010110100110001111000000000101110010011110000 : frval 
0 01111111110 (-1) 1001110010110100110001111000000000101110010011101111 : exnum 

Screenshot 2020-05-20 at 5 35 26 PM

Also FWIW i tried to figure out the closest representable number to 0.8060667366 and IINM the exnum value is the right one

@MichaelChirico
Copy link
Member

MichaelChirico commented May 20, 2020

The discrepancy happens here:

#fread.c:parse_double_regular:788
# before: r=8060667366.000000, pow10lookup[e] = 1.0E-10L
#   double double multiplication is breaking the output
r *= pow10lookup[e];

@MichaelChirico
Copy link
Member

MichaelChirico commented May 20, 2020

The parser seems to be accurate about 99.99% of the time:

isequal = matrix(NA, 1e5, 2L)
for (ii in 0:99999) {
  exchar = sprintf("0.80606%04d", ii)
  exnum = eval(parse(text=exchar))
  rtval = read.table(text = exchar)[1L, 1L]
  isequal[ii+1L, 1L] = identical(rtval, exnum)
  
  frval = fread(text = c(exchar, exchar))[1, V1]
  isequal[ii+1L, 2L] = identical(frval, exnum)
}

table(isequal[ , 2L])
# FALSE  TRUE 
#    13 99987

And there's something of a pattern in the erroneous cases:

diff(which(!isequal[ , 2]))
#  [1] 13635  9017  5327  3690  5327  9017 11231  9017  5327  3690  5327  9017

@mattdowle
Copy link
Member

Great investigation and fix @MichaelChirico!

Btw, I would have thought positive powers of 10 up to 10^15 could be stored precisely because 2^52 ~ 4.5e15 (all integers up to that value can be stored precisely). But apparently it's up to 10^22 : https://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants