fread only allows URLs on input, not file #4952

pnacht opened this issue Apr 13, 2021 · 3 comments · Fixed by #5097

pnacht commented Apr 13, 2021

The documentation for fread's argument file states

[...] a URL starting http:// [...]

However, if I try reading a file via HTTP, I can only get it to work by passing it to the generic input, not file.

#> [1] ‘1.14.0’

x <- ""

fread(x, sep = ";")
#>      1:       FI 00.017.024/0001-53 2020-12-01  1099408 27.493767       1098370
#>      2:       FI 00.017.024/0001-53 2020-12-02  1099497 27.493801       1098371
#>      [...]

fread(file = x, sep = ";")
#> Error in fread(file = x, sep = ";"): File '' does not exist or is non-readable. getwd()=='[...]'

Created on 2021-04-12 by the reprex package (v1.0.0)

# Output of sessionInfo()

#> R version 4.0.5 (2021-03-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19041)
The problem lies in the conditional branch for the input argument in fread.R:


Lines 50 to 117 in ec1259a

if (input=="" || length(grep('\\n|\\r', input))) {
# input is data itself containing at least one \n or \r
} else {
if (substring(input,1L,1L)==" ") {
stop("input= contains no \\n or \\r, but starts with a space. Please remove the leading space, or use text=, file= or cmd=")
str6 = substring(input,1L,6L) # avoid grepl() for #2531
str7 = substring(input,1L,7L)
str8 = substring(input,1L,8L)
if (str7=="ftps://" || str8=="https://") {
# nocov start
if (!requireNamespace("curl", quietly = TRUE))
stop("Input URL requires https:// connection for which fread() requires 'curl' package which cannot be found. Please install 'curl' using 'install.packages('curl')'.") # nocov
tmpFile = tempfile(fileext = paste0(".",tools::file_ext(input)), tmpdir=tmpdir) # retain .gz extension in temp filename so it knows to be decompressed further below
curl::curl_download(input, tmpFile, mode="wb", quiet = !showProgress)
file = tmpFile
on.exit(unlink(tmpFile), add=TRUE)
# nocov end
else if (str6=="ftp://" || str7== "http://" || str7=="file://") {
# nocov start
method = if (str7=="file://") "internal" else getOption("download.file.method", default="auto")
# force "auto" when file:// to ensure we don't use an invalid option (e.g. wget), #1668
tmpFile = tempfile(fileext = paste0(".",tools::file_ext(input)), tmpdir=tmpdir)
download.file(input, tmpFile, method=method, mode="wb", quiet=!showProgress)
# In text mode on Windows-only, R doubles up \r to make \r\r\n line endings. mode="wb" avoids that. See ?connections:"CRLF"
file = tmpFile
on.exit(unlink(tmpFile), add=TRUE)
# nocov end
else if (length(grep(' ', input, fixed = TRUE)) && !file.exists(input)) { # file name or path containing spaces is not a command
cmd = input
if (input_has_vars && getOption("datatable.fread.input.cmd.message", TRUE)) {
message("Taking input= as a system command ('",cmd,"') and a variable has been used in the expression passed to `input=`. Please use fread(cmd=...). There is a security concern if you are creating an app, and the app could have a malicious user, and the app is not running in a secure environment; e.g. the app is running as root. Please read item 5 in the NEWS file for v1.11.6 for more information and for the option to suppress this message.")
else {
file = input # filename
if (!is.null(cmd)) {
(if (.Platform$OS.type == "unix") system else shell)(paste0('(', cmd, ') > ', tmpFile<-tempfile(tmpdir=tmpdir)))
file = tmpFile
on.exit(unlink(tmpFile), add=TRUE)
if (!is.null(file)) {
file_info =
if ($size)) stop("File '",file,"' does not exist or is non-readable. getwd()=='", getwd(), "'")
if (isTRUE(file_info$isdir)) stop("File '",file,"' is a directory. Not yet implemented.") # dir.exists() requires R v3.2+, #989
if (!file_info$size) {
warning("File '", file, "' has size 0. Returning a NULL ",
if (data.table) 'data.table' else 'data.frame', ".")
return(if (data.table) data.table(NULL) else data.frame(NULL))
ext2 = substring(file, nchar(file)-2L, nchar(file)) # last 3 characters ".gz"
ext3 = substring(file, nchar(file)-3L, nchar(file)) # last 4 characters ".bz2"
if (ext2==".gz" || ext3==".bz2") {
if (!requireNamespace("R.utils", quietly = TRUE))
stop("To read gz and bz2 files directly, fread() requires 'R.utils' package which cannot be found. Please install 'R.utils' using 'install.packages('R.utils')'.") # nocov
FUN = if (ext2==".gz") gzfile else bzfile
R.utils::decompressFile(file, decompFile<-tempfile(tmpdir=tmpdir), ext=NULL, FUN=FUN, remove=FALSE) # ext is not used by decompressFile when destname is supplied, but isn't optional
file = decompFile # don't use 'tmpFile' symbol again, as tmpFile might be the download
on.exit(unlink(decompFile), add=TRUE)
file = enc2native(file) # CfreadR cannot handle UTF-8 if that is not the native encoding, see #3078.
input = file

It looks like the pattern matching using substring for URLs (https://, ftp://, http:// etc.) is performed only for the input argument and not done for file.

Judging by the code, I think the intent was to have input cater to URLs, and not have it handled in the file argument, and is perhaps an oversight in the documentation.

To others: should I file a pull request to correct the documentation reference? That's something within my capability.

pnacht commented Apr 13, 2021

I noticed that in the source code, but I'm not sure if that is the actual intent or if it was an oversight when writing the code. After all, as I understand it, input is simply meant to be a delegator between text and file, with no special features of its own.

To be defined by those in the know, I suppose.

mattdowle commented May 26, 2021

Great spot! I agree it was an oversight and file= should accept URLs; it's a bug.

@mattdowle mattdowle added this to the 1.14.1 milestone May 26, 2021
@mattdowle mattdowle added the bug label May 26, 2021
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
