CRITICAL BUG // DefaultLimitNOFILESoft=1024 #5953

ladar · 2022-07-23T11:57:19Z

Description

My GitHub actions started failing on 7/22/22... after lots of troubleshooting, I realized a new virtual environment was pushed between my successful run on the 21st and the failures starting on the 22nd.

I can't check what the limits were on the previous Ubuntu image, but for the current one, the build steps are launching with systemd limits of:

DefaultLimitNOFILE=65536
DefaultLimitNOFILESoft=1024

Note the soft limit of 1024. After lots of troubleshooting, I believe it's that limit which is causing my jobs to fail. I'm testing Packer templates, and that tool uses Go. Every step launches a new process. Because my templates have gotten rather larger, this results in several hundred processes, all of which use unix sockets ot communicate with each other. The result is thousands of open file handles. I believe the limiter is what is causing the templates to fail, since 1024 is far below what is needed.

Platforms affected

Azure DevOps
GitHub Actions

Virtual environments affected

Image version and build link

Environment: ubuntu-20.04
Version: 20220717.1
Included Software: https://github.com/actions/virtual-environments/blob/ubuntu20/20220717.1/images/linux/Ubuntu2004-Readme.md
Image Release: https://github.com/actions/virtual-environments/releases/tag/ubuntu20%2F20220717.1

Is it regression?

Yes. 20220717.1

Expected behavior

My GitHub actions should pass!

Actual behavior

My GitHub actions now fail!

Repro steps

This example would need to checkout the lavabit/robox repository.

name: Robox Validate

jobs:
  Build:
    runs-on: ubuntu-20.04
    env:
        LANG: en_US.UTF-8
        LANGUAGE: en_US:en
        LC_ALL: en_US.UTF-8
    steps:
    - uses: actions/checkout@master
    - name: Install Dependencies
      env: 
        DEBIAN_FRONTEND: noninteractive
        DEBCONF_NONINTERACTIVE_SEEN: true
      run: |
        curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
        sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
        sudo apt-get update
        sudo apt-get --assume-yes install packer
    - name: Validate Generic Box Configurations
      env:
        GOGC: 50
        PACKER_LOG: 1
        GOMAXPROCS: 1
        VERSION: 1.0.0
      run: |
        date +"%nStarting generic box validation at %r on %x%n"
        export PACKER_LOG_PATH=generic-docker.txt ; packer validate generic-docker.json &>> packer-validate.txt && printf "File  + generic-docker.json\n" || { printf "File  - generic-docker.json\n" ; exit 1 ; }
        export PACKER_LOG_PATH=generic-hyperv.txt ; packer validate generic-hyperv.json &>> packer-validate.txt && printf "File  + generic-hyperv.json\n" || { printf "File  - generic-hyperv.json\n" ; exit 1 ; }
        export PACKER_LOG_PATH=generic-libvirt.txt ; packer validate generic-libvirt.json &>> packer-validate.txt && printf "File  + generic-libvirt.json\n" || { printf "File  - generic-libvirt.json\n" ; exit 1 ; }
        export PACKER_LOG_PATH=generic-parallels.txt ; packer validate generic-parallels.json &>> packer-validate.txt && printf "File  + generic-parallels.json\n" || { printf "File  - generic-parallels.json\n" ; exit 1 ; }
        export PACKER_LOG_PATH=generic-virtualbox.txt ; packer validate generic-virtualbox.json &>> packer-validate.txt && printf "File  + generic-virtualbox.json\n" || { printf "File  - generic-virtualbox.json\n" ; exit 1 ; }
        export PACKER_LOG_PATH=generic-vmware.txt ; packer validate generic-vmware.json &>> packer-validate.txt && printf "File  + generic-vmware.json\n" || { printf "File  - generic-vmware.json\n" ; exit 1 ; }
        export PACKER_LOG_PATH=generic-libvirt-x32.txt ; packer validate generic-libvirt-x32.json &>> packer-validate.txt && printf "File  + generic-libvirt-x32.json\n" || { printf "File  - generic-libvirt-x32.json\n" ; exit 1 ; }
        export PACKER_LOG_PATH=generic-virtualbox-x32.txt ; packer validate generic-virtualbox-x32.json &>> packer-validate.txt && printf "File  + generic-virtualbox-x32.json\n" || { printf "File  - generic-virtualbox-x32.json\n" ; exit 1 ; }
        date +"%nFinished generic box validation at %r on %x%n"

The text was updated successfully, but these errors were encountered:

al-cheb · 2022-07-23T12:53:21Z

Hey @ladar.
You can change this limit:
#4683 (comment)
#3738 (comment)

Default settings:

- name: ulimit
  run: |
      sudo prlimit --pid $$ --nofile=500000:500000
      ulimit -n

ladar · 2022-07-23T14:28:57Z

So I don't know definitively what caused the breakage between the 20220717 and 20220722 images. Nothing in the repo commit commit log looked like a culprit. So the issue could be at a lower level.

I don't know which change fixed the problem, but I wasn't able to get the workflow to run reliability until I increased the swap size from 4096 to 8192. Some of my other changes may have also helped, as they reduced the memory footprint. In particular net.unix.max_dgram_qlen combines with net.core.wmem_default and net.core.wmem_max to limit the backlog for unix sockets. Critically though, the Linux kernel requires the entire memory block (qlen*wmem) to be a contiguous for unix sockets. If the host running the image faces memory contention, it might not be able to find a contiguous block, which can cause the symptoms I saw.

The entire fixes:

- name: Increase Limits
  run: |
    sudo sysctl -q vm.overcommit_ratio=100
    sudo sysctl -q net.unix.max_dgram_qlen=64
    sudo prlimit --pid $$ --nproc=65536:65536
    sudo prlimit --pid $$ --nofile=500000:500000
    printf "DefaultLimitNPROC=65536:65536\n" | sudo tee -a /etc/systemd/user.conf > /dev/null || exit 1
    printf "DefaultLimitNPROC=65536:65536\n" | sudo tee -a /etc/systemd/system.conf > /dev/null || exit 1
    printf "DefaultLimitNOFILE=500000:500000\n" | sudo tee -a /etc/systemd/user.conf > /dev/null || exit 1
    printf "DefaultLimitNOFILE=500000:500000\n" | sudo tee -a /etc/systemd/system.conf > /dev/null || exit 1
    sudo systemctl daemon-reload
    systemctl --user daemon-reload
- name: Increase Swap
  run: |
    sudo dd if=/dev/zero of=/swap bs=1M count=4096 status=none
    sudo chmod 600 /swap
    sudo mkswap /swap
    sudo swapon /swap

I had to rerun the prlimit increase commands for each build step like so:

- name: Demonstrate Limit Increases for a Build Step
  env:
    GOGC: 50
    GOMAXPROCS: 1
  run: |
    date +"%nStarting a sample/example build step at %r on %x%n"
    sudo prlimit --pid $$ --nproc=65536:65536
    sudo prlimit --pid $$ --nofile=500000:500000

ladar · 2022-07-29T07:50:17Z

@al-cheb I saw you closed the bug. Were you able to figure out what changed between the 20220717 and 20220721 images?

john-heinnickel · 2023-01-12T21:56:31Z

I may be wrong, but I came across this issue while trying to remember how I helped a co-worker correct a problem with the open files soft limit previously. What I recall was that while the kernel was configured to support a sufficiently large ulimit maximum, the GitHub Runner Action was spawned from a process in some ancestor unit file that was either explicitly setting or implicitly accepting a reduced soft limit. The way these things inherit, you can always opt into a smaller limit, but you cannot raise your limit higher than what you inherit at process creation time.

In order for our own set of sufficiently large ulimit -nfiles to be effective, we had to modify this ancestor unit file to not restrict itself and its children quite as much.

I don't have that running example environment or access to one similar just now, so while I am sure we fixed his by raising the soft ulimit for open files in some ancestor systemd unit file, I cannot seem to recall which one we had to target for GitHub actions to inherit the correctly all the way back to PID=1.

Anyhow, I see this thread is relatively recent so thought I'd share what I had retained in case it is of some help/value.

ladar · 2023-02-23T06:29:21Z

This can be a hard issue to track down, because it runs contrary to how nix systems historically handled the open file limit (aka a hard max for the kernel and via the ulimit interface). It's also an issue that gets buried inside the morass that is systemd, which is still an opaque MacGuffin to the average system administrator.

The quickest fix is:

sudo sed -i '/DefaultLimitNOFILE=/d' /etc/systemd/user.conf
printf "DefaultLimitNOFILE=65536\n" | sudo tee -a /etc/systemd/user.conf > /dev/null

sudo sed -i '/DefaultLimitNOFILE=/d' /etc/systemd/system.conf
printf "DefaultLimitNOFILE=65536:524288\n" | sudo tee -a /etc/systemd/system.conf > /dev/null

This will increase the default limits for everything, and save you lots of grief. Of course this assumes your not on a system shared by many, that is resource constricted. If that is the case, you'll need a more granular approach.

To see where the limits are, run this:

 echo 'user'; systemctl --user show | grep NOFILE ; echo 'system' ;sudo sudo systemctl show  | grep NOFILE

And don't forget, the kernel maximum, and the ulimit values still apply, so update those as well, if necessary.

ladar added bug report needs triage labels Jul 23, 2022

al-cheb added OS: Ubuntu Area: Image administration investigate Collect additional information, like space on disk, other tool incompatibilities etc. and removed needs triage labels Jul 23, 2022

al-cheb closed this as completed Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRITICAL BUG // DefaultLimitNOFILESoft=1024 #5953

CRITICAL BUG // DefaultLimitNOFILESoft=1024 #5953

ladar commented Jul 23, 2022

al-cheb commented Jul 23, 2022

ladar commented Jul 23, 2022

ladar commented Jul 29, 2022

john-heinnickel commented Jan 12, 2023

ladar commented Feb 23, 2023

CRITICAL BUG // DefaultLimitNOFILESoft=1024 #5953

CRITICAL BUG // DefaultLimitNOFILESoft=1024 #5953

Comments

ladar commented Jul 23, 2022

Description

Platforms affected

Virtual environments affected

Image version and build link

Is it regression?

Expected behavior

Actual behavior

Repro steps

al-cheb commented Jul 23, 2022

ladar commented Jul 23, 2022

ladar commented Jul 29, 2022

john-heinnickel commented Jan 12, 2023

ladar commented Feb 23, 2023