-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'Stopped early on line...' when data is large #5378
Comments
Ty for the report. I just ran your example on current dev and The goods news: It seems that the issue is fixed in current dev, while it still stops early on
|
Ok bisection shows that #4802 fixed your example. Does the error appear on reading from a file or using |
Checking the line in tmpfile with using Seems like you found a mysterious bug in N_large = 5000000
ii = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\tbb\tccccccccc\tdddddd\tee\tffff\tg\thhhhhh\tiiii\tjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj\tkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk\tllllll\tmmmmmmmmmmmmmm\tnnnnnnnnn\toooooooo\tppppppp\tqqqqqqq'
tmp = tempfile()
cat(rep(ii,N_large), file=tmp, sep="\n")
system(command = sprintf("cat %s | uniq -c", tmp))
# 4836674 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa bb ccccccccc dddddd ee ffff g hhhhhh iiii jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk llllll mmmmmmmmmmmmmm nnnnnnnnn oooooooo ppppppp qqqqqqq
# 1
# 163326 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa bb ccccccccc dddddd ee ffff g hhhhhh iiii jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk llllll mmmmmmmmmmmmmm nnnnnnnnn oooooooo ppppppp qqqqqqq |
This is fixed in current devel of R. There is also a separate fix for Feel free to reopen if the issue reoccurs. |
This can still happen when the number of columns is highly variable. For instance this file which has some rows with 3 columns and some with >3000. Is this still the R base issue? > annot = fread("AdultBrain.genes.annot", header=F, fill=T)
Warning message:
In fread("AdultBrain.genes.annot", header = F, fill = T) :
Stopped early on line 496. Expected 1008 fields but found 1074. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<ENSG00000036448 8:1993155:2093380 rs13272630 rs149509713 rs112324702 rs13264775 rs55913329 rs66739903 rs35389636 rs151271118 rs1562925 rs62488537 rs9650539 rs115417135 rs11778096 rs17740775 rs17740799 rs17668154 rs17668166 rs6983981 rs187829154 rs7000148 rs7004042 rs2600487 rs2701901 rs12677463 rs6559207 rs187395437 rs4735980 rs4735981rs71516150 rs6559208 rs34510005 rs143661164 rs2701899 rs62487807 rs62487810 rs145992668 rs138692710 rs142828803 rs146094290 rs187725070 rs193298979 rs75669442 rs7>> R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringi_1.7.12 data.table_1.14.8
loaded via a namespace (and not attached):
[1] compiler_4.2.2 tools_4.2.2 R.methodsS3_1.8.2 R.utils_2.12.2 R.oo_1.25.0 |
@rbutleriii thanks for your example which is also not working on current dev. There is already a PR #5119 tackling this issue. Using the version of the PR I can read your file with Warning message:
In fread("~/Downloads/Adult_brain.genes.annot", fill = TRUE) :
Stopped early on line 496. Expected 1008 fields but found 1074. Consider fill=1074 or even higher ncol estimate. First discarded non-empty line:
Warning message:
In fread("~/Downloads/Adult_brain.genes.annot", fill = 1074) :
Stopped early on line 517. Expected 1074 fields but found 1358. Consider fill=1358 or even higher ncol estimate. First discarded non-empty line:
Warning message:
In fread("~/Downloads/Adult_brain.genes.annot", fill = 1358) :
Stopped early on line 542. Expected 1358 fields but found 1525. Consider fill=1525 or even higher ncol estimate. First discarded non-empty line:
Warning message:
In fread("~/Downloads/Adult_brain.genes.annot", fill = 1525) :
Stopped early on line 968. Expected 1525 fields but found 3435. Consider fill=3435 or even higher ncol estimate. First discarded non-empty line: |
is it possible to prevent |
I have used |
Thanks for your reply, I uploaded a file Products.txt downloaded from Drugs@FDA. However, I am unable to use a branch solely on GitHub because I intend to submit my package to Bioconductor, which requires the use of packages hosted on CRAN or Bioconductor's repositories. I finally choose import
data.table::fread("Products.txt", fill = TRUE)
#> Warning in data.table::fread("Products.txt", fill = TRUE): Stopped early on
#> line 35466. Expected 8 fields but found 9. Consider fill=TRUE and
#> comment.char=. First discarded non-empty line: <<206029 001 TABLET;ORAL EQ 1MG
#> BASE 0 PITAVASTATIN CALCIUM PITAVASTATIN CALCIUM >>
#> ApplNo ProductNo Form Strength
#> <int> <int> <char> <char>
#> 1: 4 4 SOLUTION/DROPS;OPHTHALMIC 1%
#> 2: 159 1 TABLET;ORAL 500MG
#> 3: 552 1 INJECTABLE;INJECTION 20,000 UNITS/ML
#> 4: 552 2 INJECTABLE;INJECTION 40,000 UNITS/ML
#> 5: 552 3 INJECTABLE;INJECTION 5,000 UNITS/ML
#> ---
#> 35460: 206026 1 GAS;INHALATION N/A
#> 35461: 206028 1 CAPSULE, EXTENDED RELEASE;ORAL 7MG
#> 35462: 206028 2 CAPSULE, EXTENDED RELEASE;ORAL 14MG
#> 35463: 206028 3 CAPSULE, EXTENDED RELEASE;ORAL 21MG
#> 35464: 206028 4 CAPSULE, EXTENDED RELEASE;ORAL 28MG
#> ReferenceDrug DrugName ActiveIngredient
#> <int> <char> <char>
#> 1: 0 PAREDRINE HYDROXYAMPHETAMINE HYDROBROMIDE
#> 2: 0 SULFAPYRIDINE SULFAPYRIDINE
#> 3: 0 LIQUAEMIN SODIUM HEPARIN SODIUM
#> 4: 0 LIQUAEMIN SODIUM HEPARIN SODIUM
#> 5: 0 LIQUAEMIN SODIUM HEPARIN SODIUM
#> ---
#> 35460: 0 HELIUM, USP HELIUM
#> 35461: 0 MEMANTINE HYDROCHLORIDE MEMANTINE HYDROCHLORIDE
#> 35462: 0 MEMANTINE HYDROCHLORIDE MEMANTINE HYDROCHLORIDE
#> 35463: 0 MEMANTINE HYDROCHLORIDE MEMANTINE HYDROCHLORIDE
#> 35464: 0 MEMANTINE HYDROCHLORIDE MEMANTINE HYDROCHLORIDE
#> ReferenceStandard
#> <int>
#> 1: 0
#> 2: 0
#> 3: 0
#> 4: 0
#> 5: 0
#> ---
#> 35460: NA
#> 35461: 0
#> 35462: 0
#> 35463: 0
#> 35464: 0 Created on 2023-11-07 with reprex v2.0.2 Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.1 (2023-06-16)
#> os Ubuntu 22.04.3 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language en
#> collate C.UTF-8
#> ctype C.UTF-8
#> tz Asia/Shanghai
#> date 2023-11-07
#> pandoc 2.9.2.1 @ /usr/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.1)
#> data.table 1.14.9 2023-11-03 [1] Github (Rdatatable/data.table@e6076b0)
#> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.1)
#> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.1)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.1)
#> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.1)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.1)
#> htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.1)
#> knitr 1.43 2023-05-25 [1] CRAN (R 4.3.1)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.1)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.1)
#> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.3.1)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.1)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.1)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.1)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.1)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.1)
#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.1)
#> rmarkdown 2.23 2023-07-01 [1] CRAN (R 4.3.1)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.1)
#> styler 1.10.1 2023-06-05 [1] CRAN (R 4.3.1)
#> vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.1)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.1)
#> xfun 0.39 2023-04-20 [1] CRAN (R 4.3.1)
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.1)
#>
#> [1] /home/yun/Rlibrary/4.3
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library
#>
#> ────────────────────────────────────────────────────────────────────────────── |
@Yunuuuu Ty. Your example works with the branch in #5119 with In fread("Products.txt", fill = TRUE) :
Stopped early on line 35466. Expected 8 fields but found 9. Consider fill=9 or even higher ncol estimate. First discarded non-empty line: <<206029 001 TABLET;ORAL EQ 1MG BASE 0 PITAVASTATIN CALCIUM PITAVASTATIN CALCIUM >> |
Thank you, @ben-schwen, for your assistance! I appreciate your help in testing the example using the branch in #5119. It's great to hear that it works and fills the purpose well. I understand that it may take some time for this feature to be included in the version available on CRAN. Nevertheless, your contribution is valuable, and I look forward to seeing this feature implemented in a future update. Presently, I shall avail myself of the vroom utility to facilitate the package integrated into Bioconductor, though I want to solely employ data.table for this purpose. |
Hi,
Here is a weird issue (probably a bug) when using
fread
on a large data set. I have reproduced the issue on different computers and systems.This is my code:
I alway get a warning message:
It stopped on line 4836675 and the following lines were not read in. If I use
fill=TRUE
, the warning message disappears but an empty line inserts in the data table and the total rows will beN_large + 1
. I also triedread.table()
and I didn't see such an issue.However, when data is small, e.g.
N_large = 500
, no errors, no warnings,fread
works well.Here is my R info:
Here is the fread verbose output:
The text was updated successfully, but these errors were encountered: