Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read fallback to ntasks=1 not always working #1157

Open
Crefok opened this issue Mar 10, 2025 · 0 comments
Open

read fallback to ntasks=1 not always working #1157

Crefok opened this issue Mar 10, 2025 · 0 comments

Comments

@Crefok
Copy link

Crefok commented Mar 10, 2025

Hey Julia Community,
I am very new to Julia, but what I saw so far is very good. On a Project I read some csv-files with the csv.jl. The Threads.nthreads() is set to 60.
Some CSV-Files could not be read and the Programm exit with the following error Message:

ERROR: TaskFailedException

    nested task error: thread = 38 fatal error, encountered an invalidly quoted field while parsing around row = 127, col = 1: ""outcviusqgvvejbjwrbumoedfhtdyiorvqueekyfhwzegowxkzomzskinamwxiimajggitwcymyxnjtpuhtbwngpunlwelyfkpfo
    vqosvsysvoqkxgzaepzvrbrbneqpidrcrhgsmglapotilebnkntoecqywbxwaiiticlzbpslhkkyjvujwddoduzmixjpznipcptb
    
    ", error=INVALID: OK | QUOTED | EOF | INVALID_QUOTED_FIELD , check your `quotechar` arguments or manually fix the field in the file itself
    
    Stacktrace:
     [1] fatalerror(buf::Vector{UInt8}, pos::Int64, len::Int64, code::Int16, row::Int64, col::Int64)
       @ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:596
     [2] parsevalue!(::Type{…}, buf::Vector{…}, pos::Int64, len::Int64, row::Int64, rowoffset::Int64, i::Int64, col::CSV.Column, ctx::CSV.Context)
       @ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:804
     [3] parserow
       @ ~/.julia/packages/CSV/XLcqT/src/file.jl:646 [inlined]
     [4] parsefilechunk!(ctx::CSV.Context, pos::Int64, len::Int64, rowsguess::Int64, rowoffset::Int64, columns::Vector{…}, ::Type{…})
       @ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:556
     [5] multithreadparse(ctx::CSV.Context, pertaskcolumns::Vector{…}, rowchunkguess::Int64, i::Int64, rows::Vector{…}, wholecolumnslock::ReentrantLock)
       @ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:366
     [6] (::CSV.var"#34#39"{CSV.Context, Vector{Vector{CSV.Column}}, Int64, Int64, Vector{Int64}, ReentrantLock})()
       @ CSV ~/.julia/packages/WorkerUtilities/ey0fP/src/WorkerUtilities.jl:384
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:455
 [2] macro expansion
   @ ./task.jl:487 [inlined]
 [3] CSV.File(ctx::CSV.Context, chunking::Bool)
   @ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:240
 [4] File
   @ ~/.julia/packages/CSV/XLcqT/src/file.jl:227 [inlined]
 [5] #File#32
   @ ~/.julia/packages/CSV/XLcqT/src/file.jl:223 [inlined]
 [6] #read#118
   @ ~/.julia/packages/CSV/XLcqT/src/CSV.jl:117 [inlined]
 [7] read
   @ ~/.julia/packages/CSV/XLcqT/src/CSV.jl:113 [inlined]
 [8] top-level scope
   @ ./REPL[208]:3
Some type information was truncated. Use `show(err)` to see complete types.

The data I am using here is generated by the following code:

using Random
using CSV

factor = 100
open(joinpath(@__DIR__, "test.csv"), "w") do file
    write(file, "a;b;c;d\n")
    write(file, randstring('a':'z', 6*factor)*";"*randstring('a':'z', 6*factor)*";"*randstring('a':'z', 6*factor)*";"*randstring('a':'z', 6*factor)*"\n")
    for i in 1:1000
        write(file, "\""*randstring('a':'z', 1*factor)*"\n"*randstring('a':'z', 1*factor)*"\n"*"\n"*randstring('a':'z', 1*factor)*";"*randstring('a':'z', 1*factor)*"\""
        *";"*randstring('a':'z', 6*factor)*";"*randstring('a':'z', 6*factor)*";"*randstring('a':'z', 6*factor)*"\n")
    end
end

I tried to generate a csv-file which looks similar to the real world data I am facing. There are a lot more columns in the real world data but that doesn't matter. The Problem is caused by splitting the input-file into several chunks and read them in parallel. Thats a very big advantage of this library and results in a lot of speed when it comes to reading csv files. Simple workaround is to set ntasks to one so the file could easily be read and parsed as a DataFrame.

on one execution I got the following Message

┌ Error: Multithreaded parsing failed and fell back to single-threaded parsing. This can happen if the input contains multi-line fields; otherwise, please report this issue.
└ @ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:579

after I saw this Message I want to share my results of the findings and ask why this fallback method isn't used every time?

My code to read the csv file:

for i in 1:60
    println(i)
    CSV.read(joinpath(@__DIR__,"test.csv"), DataFrame ;quotechar='"', escapechar='"', delim=';', ntasks=i)
end

i do this in a for loop to find the crashing ntasks parameter currently it is the 8 but that depends on the inputdata (I would guess)

I am currently using Julia in Version 1.10.7
and the CSV (v0.10.15) and DataFrames (v1.7.0) Package with the SHA1 Hash:
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant