Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread only allows URLs on input, not file #4952

Closed
pnacht opened this issue Apr 13, 2021 · 3 comments · Fixed by #5097
Closed

fread only allows URLs on input, not file #4952

pnacht opened this issue Apr 13, 2021 · 3 comments · Fixed by #5097
Milestone

Comments

@pnacht
Copy link

pnacht commented Apr 13, 2021

The documentation for fread's argument file states

[...] a URL starting http:// [...]

However, if I try reading a file via HTTP, I can only get it to work by passing it to the generic input, not file.

library(data.table)
packageVersion("data.table")
#> [1] ‘1.14.0’

x <- "http://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202012.csv"

fread(x, sep = ";")
#>         TP_FUNDO         CNPJ_FUNDO  DT_COMPTC VL_TOTAL  VL_QUOTA VL_PATRIM_LIQ
#>      1:       FI 00.017.024/0001-53 2020-12-01  1099408 27.493767       1098370
#>      2:       FI 00.017.024/0001-53 2020-12-02  1099497 27.493801       1098371
#>      [...]

fread(file = x, sep = ";")
#> Error in fread(file = x, sep = ";"): File 'http://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202012.csv' does not exist or is non-readable. getwd()=='[...]'

Created on 2021-04-12 by the reprex package (v1.0.0)

# Output of sessionInfo()

sessionInfo()
#> R version 4.0.5 (2021-03-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19041)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] fansi_0.4.2       crayon_1.4.1      utf8_1.2.1        digest_0.6.27    
#>  [5] backports_1.2.1   lifecycle_1.0.0   reprex_1.0.0      magrittr_2.0.1   
#>  [9] evaluate_0.14     pillar_1.5.1      highr_0.8         stringi_1.5.3    
#> [13] rlang_0.4.10      fs_1.5.0          vctrs_0.3.7       ellipsis_0.3.1   
#> [17] rmarkdown_2.7     styler_1.3.2      tools_4.0.5       stringr_1.4.0    
#> [21] glue_1.4.2        purrr_0.3.4       xfun_0.22         yaml_2.2.1       
#> [25] compiler_4.0.5    pkgconfig_2.0.3   htmltools_0.5.1.1 knitr_1.31       
#> [29] tibble_3.1.0
@avimallu
Copy link
Contributor

The problem lies in the conditional branch for the input argument in fread.R:

data.table/R/fread.R

Lines 50 to 117 in ec1259a

if (input=="" || length(grep('\\n|\\r', input))) {
# input is data itself containing at least one \n or \r
} else {
if (substring(input,1L,1L)==" ") {
stop("input= contains no \\n or \\r, but starts with a space. Please remove the leading space, or use text=, file= or cmd=")
}
str6 = substring(input,1L,6L) # avoid grepl() for #2531
str7 = substring(input,1L,7L)
str8 = substring(input,1L,8L)
if (str7=="ftps://" || str8=="https://") {
# nocov start
if (!requireNamespace("curl", quietly = TRUE))
stop("Input URL requires https:// connection for which fread() requires 'curl' package which cannot be found. Please install 'curl' using 'install.packages('curl')'.") # nocov
tmpFile = tempfile(fileext = paste0(".",tools::file_ext(input)), tmpdir=tmpdir) # retain .gz extension in temp filename so it knows to be decompressed further below
curl::curl_download(input, tmpFile, mode="wb", quiet = !showProgress)
file = tmpFile
on.exit(unlink(tmpFile), add=TRUE)
# nocov end
}
else if (str6=="ftp://" || str7== "http://" || str7=="file://") {
# nocov start
method = if (str7=="file://") "internal" else getOption("download.file.method", default="auto")
# force "auto" when file:// to ensure we don't use an invalid option (e.g. wget), #1668
tmpFile = tempfile(fileext = paste0(".",tools::file_ext(input)), tmpdir=tmpdir)
download.file(input, tmpFile, method=method, mode="wb", quiet=!showProgress)
# In text mode on Windows-only, R doubles up \r to make \r\r\n line endings. mode="wb" avoids that. See ?connections:"CRLF"
file = tmpFile
on.exit(unlink(tmpFile), add=TRUE)
# nocov end
}
else if (length(grep(' ', input, fixed = TRUE)) && !file.exists(input)) { # file name or path containing spaces is not a command
cmd = input
if (input_has_vars && getOption("datatable.fread.input.cmd.message", TRUE)) {
message("Taking input= as a system command ('",cmd,"') and a variable has been used in the expression passed to `input=`. Please use fread(cmd=...). There is a security concern if you are creating an app, and the app could have a malicious user, and the app is not running in a secure environment; e.g. the app is running as root. Please read item 5 in the NEWS file for v1.11.6 for more information and for the option to suppress this message.")
}
}
else {
file = input # filename
}
}
}
if (!is.null(cmd)) {
(if (.Platform$OS.type == "unix") system else shell)(paste0('(', cmd, ') > ', tmpFile<-tempfile(tmpdir=tmpdir)))
file = tmpFile
on.exit(unlink(tmpFile), add=TRUE)
}
if (!is.null(file)) {
file_info = file.info(file)
if (is.na(file_info$size)) stop("File '",file,"' does not exist or is non-readable. getwd()=='", getwd(), "'")
if (isTRUE(file_info$isdir)) stop("File '",file,"' is a directory. Not yet implemented.") # dir.exists() requires R v3.2+, #989
if (!file_info$size) {
warning("File '", file, "' has size 0. Returning a NULL ",
if (data.table) 'data.table' else 'data.frame', ".")
return(if (data.table) data.table(NULL) else data.frame(NULL))
}
ext2 = substring(file, nchar(file)-2L, nchar(file)) # last 3 characters ".gz"
ext3 = substring(file, nchar(file)-3L, nchar(file)) # last 4 characters ".bz2"
if (ext2==".gz" || ext3==".bz2") {
if (!requireNamespace("R.utils", quietly = TRUE))
stop("To read gz and bz2 files directly, fread() requires 'R.utils' package which cannot be found. Please install 'R.utils' using 'install.packages('R.utils')'.") # nocov
FUN = if (ext2==".gz") gzfile else bzfile
R.utils::decompressFile(file, decompFile<-tempfile(tmpdir=tmpdir), ext=NULL, FUN=FUN, remove=FALSE) # ext is not used by decompressFile when destname is supplied, but isn't optional
file = decompFile # don't use 'tmpFile' symbol again, as tmpFile might be the http://domain.org/file.csv.gz download
on.exit(unlink(decompFile), add=TRUE)
}
file = enc2native(file) # CfreadR cannot handle UTF-8 if that is not the native encoding, see #3078.
input = file

It looks like the pattern matching using substring for URLs (https://, ftp://, http:// etc.) is performed only for the input argument and not done for file.

Judging by the code, I think the intent was to have input cater to URLs, and not have it handled in the file argument, and is perhaps an oversight in the documentation.

To others: should I file a pull request to correct the documentation reference? That's something within my capability.

@pnacht
Copy link
Author

pnacht commented Apr 13, 2021

I noticed that in the source code, but I'm not sure if that is the actual intent or if it was an oversight when writing the code. After all, as I understand it, input is simply meant to be a delegator between text and file, with no special features of its own.

To be defined by those in the know, I suppose.

@mattdowle
Copy link
Member

mattdowle commented May 26, 2021

Great spot! I agree it was an oversight and file= should accept URLs; it's a bug.

@mattdowle mattdowle added this to the 1.14.1 milestone May 26, 2021
@mattdowle mattdowle added the bug label May 26, 2021
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants