Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package files are sometimes truncated #795

Closed
ararslan opened this issue Oct 4, 2018 · 23 comments
Closed

Package files are sometimes truncated #795

ararslan opened this issue Oct 4, 2018 · 23 comments

Comments

@ararslan
Copy link
Member

ararslan commented Oct 4, 2018

Often we (Invenia) have run into an issue where, when installing packages on 0.7 and 1.0, random files in a particular package will be truncated in random places. Most often we've seen this with AWSSDK, which has a lot of very large files, though the issue is not limited to that package. We noticed this because we've gotten an error on using AWSSDK that reports that, for example, a docstring or module is unterminated.

Some data points:

  • When we set the default concurrency in Context to 1 instead of 8, the problem disappears.

  • We know that the files are intact after the tarballs are extracted.

  • When we replace the mv from the extracted tarball location to the package tree location with Julia's cp, the problem persists, but if we replace it with a run(`cp ...`), the problem goes away.

  • We have only observed this on Debian Stretch runners on our internal GitLab CI, on both 32- and 64-bit systems, but more often on 32-bit. We have not seen it on our Mac CI or on Amazon Linux.

  • We've seen one instance of it happening locally to @rofinn on Elementary OS (based on Ubuntu which is in turn based on Debian) when installing GitHub.jl.

  • The @asyncs in apply_versions are never @synced, and the Channels are never closed.

  • We use multiple registries: the public General registry and a private one for internal packages.

@ararslan ararslan added the bug label Oct 4, 2018
@StefanKarpinski
Copy link
Member

How large are the files? Any chance that JuliaLang/julia#14574 is related?

@ararslan
Copy link
Member Author

ararslan commented Oct 4, 2018

The smallest file we've seen truncated was supposed to be 108,741 bytes, and was truncated at 65,536 bytes. Likely unrelated to that issue given that it's well under 2 GB.

@KristofferC
Copy link
Member

When we replace the mv from the extracted tarball location to the package tree location with Julia's cp, the problem persists, but if we replace it with a run(cp ...), the problem goes away.

Seems like the most damning to me.

Since it only happens with multiple tasks, perhaps a task switch inside julias cp or mv corrupts the file somehow (doesn't finish the transfer)?

@KristofferC
Copy link
Member

Potential repro is to start a bunch of tasks that does cp and see if they all arrive as they should.

@ararslan
Copy link
Member Author

ararslan commented Oct 4, 2018

We tried that, this was our attempt to reproduce:

using Random

function demo(dest; num=10, size=100_000)
    files = 1:num
    num_concurrent_downloads = 8
    channel = Channel(num_concurrent_downloads)
    results = Channel(num_concurrent_downloads)
    @async for i in files
        file = "$(i).txt"
        put!(channel, file)
    end

    isdir(dest) || mkdir(dest)

    for i in 1:num_concurrent_downloads
        @async begin
            for file in channel
                # Write file to pwd
                open(file, "w+") do io
                    for i in 1:size
                        write(io, randstring(1))
                    end
                end
                output = joinpath(dest, file)
                cp(file, output; force=true)
                put!(results, output)
            end
            println("Game over")
        end

    end

    for i in eachindex(files)
        path = take!(results)
        file_size = lstat(path).size
        # if file_size != size
        println("$path: $file_size")
        # end
    end
end

@StefanKarpinski
Copy link
Member

and... does it reproduce?

@omus
Copy link
Member

omus commented Oct 4, 2018

No, we can't get it to reproduce the issue.

@omus
Copy link
Member

omus commented Oct 4, 2018

I can reproduce the issue on Debian Stretch using:

# Setup Julia
if [ "$(uname -m)" = "x86_64" ]; then
    curl -L https://julialang-s3.julialang.org/bin/linux/x64/1.0/julia-1.0.1-linux-x86_64.tar.gz > julia.tar.gz
else
    curl -L https://julialang-s3.julialang.org/bin/linux/x86/1.0/julia-1.0.1-linux-i686.tar.gz > julia.tar.gz
fi
tar xvf julia.tar.gz &> /dev/null
export PATH="$(pwd)/julia-1.0.1/bin:${PATH}"

# Install packages into a fresh depot
export JULIA_DEPOT_PATH="$(pwd)/depot"
rm -rf "$JULIA_DEPOT_PATH"
julia -e 'using Pkg; Pkg.add(["AWSSDK", "FilePaths"]); using AWSSDK'

@omus
Copy link
Member

omus commented Oct 4, 2018

It doesn't seem to be reproducible on Docker.

@KristofferC
Copy link
Member

Didn't repro on my Mac.

@ararslan
Copy link
Member Author

ararslan commented Oct 4, 2018

I'll note that we only seem to be able to reproduce this when writing across filesystem boundaries. So Pkg downloads and extracts on /tmp then moves the files to where our CI builds happen in /mnt, which is on a separate volume. When we have it move on the same volume it seems to work.

@KristofferC
Copy link
Member

KristofferC commented Oct 4, 2018

Seems like an upstream (Julia) bug ;)

@omus
Copy link
Member

omus commented Oct 4, 2018

We'll try to work around this by using TMPDIR to write to the same volume as our CI build.

@ararslan
Copy link
Member Author

ararslan commented Oct 4, 2018

We haven't tried this yet, but it might be possible to reproduce this by using creating separate temporary volumes in a 32-bit Debian VM and setting TMPDIR to point to one of them.

@ararslan
Copy link
Member Author

ararslan commented Oct 4, 2018

Seems like an upstream (Julia) bug ;)

Yeah perhaps. In particular it sounds like libuv.

@KristofferC
Copy link
Member

JuliaLang/julia#26907 was also when crossing file system borders.

@staticfloat
Copy link
Member

When crossing filesystem boundaries, there's some logic to switch from e.g. rename() to cp() since you can't rename a file across a filesystem boundary. Seems like a natural place to look for a bad code path; but from what I understand most of that code is within libuv itself, and I have no idea why you'd only get the first 64K of a file.

Do the truncated lengths follow a particular pattern? E.g. does it look like you always get multiples of 16K or something like that?

@ararslan
Copy link
Member Author

ararslan commented Oct 5, 2018

Do the truncated lengths follow a particular pattern? E.g. does it look like you always get multiples of 16K or something like that?

That's very interesting. The most files we've seen cut off in a single run is 5, and when we recorded the truncated sizes, there were all multiples of 64K (and thus also 16K):

File Size
AWSSDK/dAby7/src/EC2.jl 65536
AWSSDK/dAby7/src/GameLift.jl 65536
AWSSDK/dAby7/src/Greengrass.jl 65536
AWSSDK/dAby7/src/IAM.jl 196608
AWSSDK/dAby7/src/OpsWorks.jl 131072

@KristofferC
Copy link
Member

KristofferC commented Oct 5, 2018

@vtjnash, sorry for the ping but perhaps you have an idea of what is going on here.

@ararslan
Copy link
Member Author

ararslan commented Oct 5, 2018

So our current working theory is that it seems like the libuv async stuff is interfering with the libuv copy operation on slower filesystems. So if the copy is doing it in chunks of a certain size (seems like 64K), it would make sense that a task switch might mean it doesn't transfer the remainder of the chunks.

That would explain:

  • The stochastic nature of it (the task switch needs to happen during a copy)
  • It happens to us on Debian (our runners happen to have slower disks)
  • When we reduce the concurrency to 1, the problem goes away (no task switch during copy)
  • Using the system's cp doesn't have this problem (libuv can't interrupt it on task switch)
  • Only files larger than 64K are truncated (as far as we've seen)
  • Files are truncated to multiples of 64K

@StefanKarpinski
Copy link
Member

That seems like a very likely analysis. So now the question is why the libuv copy work isn't getting continued. @vtjnash and @Keno are the resident libuv experts. Any ideas, guys?

bors bot added a commit that referenced this issue Oct 8, 2018
797: Add an environment variable to control concurrency r=fredrikekre a=ararslan

This adds an environment variable, tentatively called `JULIA_PKG_CONCURRENCY`, which allows the user to control the number of concurrent downloads that Pkg uses when installing packages, using the previous default of 8 if the variable is not specified.

Setting the number of concurrent downloads to 1 using this variable provides a temporary workaround for #795.

Currently this PR does not document the variable. Suggestions for where that documentation should live would be welcome.

Co-authored-by: Alex Arslan <[email protected]>
@ararslan
Copy link
Member Author

ararslan commented Nov 6, 2018

I've reported the issue upstream: JuliaLang/julia#29944. Feel free to close this issue here.

@StefanKarpinski
Copy link
Member

I suppose since we don't need to do anything here; once this is fixed we'll get the fix automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants