-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRITICAL BUG // DefaultLimitNOFILESoft=1024 #5953
Comments
Hey @ladar.
|
So I don't know definitively what caused the breakage between the 20220717 and 20220722 images. Nothing in the repo commit commit log looked like a culprit. So the issue could be at a lower level. I don't know which change fixed the problem, but I wasn't able to get the workflow to run reliability until I increased the swap size from 4096 to 8192. Some of my other changes may have also helped, as they reduced the memory footprint. In particular The entire fixes: - name: Increase Limits
run: |
sudo sysctl -q vm.overcommit_ratio=100
sudo sysctl -q net.unix.max_dgram_qlen=64
sudo prlimit --pid $$ --nproc=65536:65536
sudo prlimit --pid $$ --nofile=500000:500000
printf "DefaultLimitNPROC=65536:65536\n" | sudo tee -a /etc/systemd/user.conf > /dev/null || exit 1
printf "DefaultLimitNPROC=65536:65536\n" | sudo tee -a /etc/systemd/system.conf > /dev/null || exit 1
printf "DefaultLimitNOFILE=500000:500000\n" | sudo tee -a /etc/systemd/user.conf > /dev/null || exit 1
printf "DefaultLimitNOFILE=500000:500000\n" | sudo tee -a /etc/systemd/system.conf > /dev/null || exit 1
sudo systemctl daemon-reload
systemctl --user daemon-reload
- name: Increase Swap
run: |
sudo dd if=/dev/zero of=/swap bs=1M count=4096 status=none
sudo chmod 600 /swap
sudo mkswap /swap
sudo swapon /swap I had to rerun the prlimit increase commands for each build step like so: - name: Demonstrate Limit Increases for a Build Step
env:
GOGC: 50
GOMAXPROCS: 1
run: |
date +"%nStarting a sample/example build step at %r on %x%n"
sudo prlimit --pid $$ --nproc=65536:65536
sudo prlimit --pid $$ --nofile=500000:500000 |
@al-cheb I saw you closed the bug. Were you able to figure out what changed between the 20220717 and 20220721 images? |
I may be wrong, but I came across this issue while trying to remember how I helped a co-worker correct a problem with the open files soft limit previously. What I recall was that while the kernel was configured to support a sufficiently large ulimit maximum, the GitHub Runner Action was spawned from a process in some ancestor unit file that was either explicitly setting or implicitly accepting a reduced soft limit. The way these things inherit, you can always opt into a smaller limit, but you cannot raise your limit higher than what you inherit at process creation time. In order for our own set of sufficiently large ulimit -nfiles to be effective, we had to modify this ancestor unit file to not restrict itself and its children quite as much. I don't have that running example environment or access to one similar just now, so while I am sure we fixed his by raising the soft ulimit for open files in some ancestor systemd unit file, I cannot seem to recall which one we had to target for GitHub actions to inherit the correctly all the way back to PID=1. Anyhow, I see this thread is relatively recent so thought I'd share what I had retained in case it is of some help/value. |
This can be a hard issue to track down, because it runs contrary to how nix systems historically handled the open file limit (aka a hard max for the kernel and via the ulimit interface). It's also an issue that gets buried inside the morass that is systemd, which is still an opaque MacGuffin to the average system administrator. The quickest fix is:
This will increase the default limits for everything, and save you lots of grief. Of course this assumes your not on a system shared by many, that is resource constricted. If that is the case, you'll need a more granular approach. To see where the limits are, run this: echo 'user'; systemctl --user show | grep NOFILE ; echo 'system' ;sudo sudo systemctl show | grep NOFILE And don't forget, the kernel maximum, and the ulimit values still apply, so update those as well, if necessary. |
Description
My GitHub actions started failing on 7/22/22... after lots of troubleshooting, I realized a new virtual environment was pushed between my successful run on the 21st and the failures starting on the 22nd.
I can't check what the limits were on the previous Ubuntu image, but for the current one, the build steps are launching with systemd limits of:
Note the soft limit of 1024. After lots of troubleshooting, I believe it's that limit which is causing my jobs to fail. I'm testing Packer templates, and that tool uses Go. Every step launches a new process. Because my templates have gotten rather larger, this results in several hundred processes, all of which use unix sockets ot communicate with each other. The result is thousands of open file handles. I believe the limiter is what is causing the templates to fail, since 1024 is far below what is needed.
Platforms affected
Virtual environments affected
Image version and build link
Environment: ubuntu-20.04
Version: 20220717.1
Included Software: https://github.com/actions/virtual-environments/blob/ubuntu20/20220717.1/images/linux/Ubuntu2004-Readme.md
Image Release: https://github.com/actions/virtual-environments/releases/tag/ubuntu20%2F20220717.1
Is it regression?
Yes. 20220717.1
Expected behavior
My GitHub actions should pass!
Actual behavior
My GitHub actions now fail!
Repro steps
This example would need to checkout the
lavabit/robox
repository.The text was updated successfully, but these errors were encountered: