Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIX] Fix hangs during testing #137967

Merged
merged 1 commit into from
Mar 11, 2025
Merged

Conversation

mustartt
Copy link
Contributor

@mustartt mustartt commented Mar 3, 2025

Fixes all current test hangs experienced during CI runs.

  1. ipv6 link-local (the loopback device) gets assigned an automatic zone id of 1, causing the assert to fail and hang in library/std/src/net/udp/tests.rs
  2. Const alloc does not fail gracefully
  3. Debuginfo test has problem with gdb auto load safe path

@rustbot
Copy link
Collaborator

rustbot commented Mar 3, 2025

r? @ChrisDenton

rustbot has assigned @ChrisDenton.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Mar 3, 2025
Copy link
Contributor

@daltenty daltenty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from the AIX perspective. These test cases hang the test run indefinitely at the moment, so this unblocks regular runs.

@@ -2,6 +2,7 @@
// on 32bit and 16bit platforms it is plausible that the maximum allocation size will succeed
// FIXME (#135952) In some cases on AArch64 Linux the diagnostic does not trigger
//@ ignore-aarch64-unknown-linux-gnu
//@ ignore-aix: FIXME(#137966)
Copy link
Member

@workingjubilee workingjubilee Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the system behaves badly on large allocations, then there is nothing to fix here.

Suggested change
//@ ignore-aix: FIXME(#137966)
//@ ignore-aix: alloc failure on AIX can result in SIGKILL instead of nullptr

@workingjubilee
Copy link
Member

@daltenty Do you have any idea why it sometimes hangs and sometimes SIGKILLs?

@mustartt
Copy link
Contributor Author

mustartt commented Mar 4, 2025

@daltenty Do you have any idea why it sometimes hangs and sometimes SIGKILLs?

It not exactly an "hang". mmap and zero initializing the mapped region takes quite a while on our dev machines which either times out our CI or get SIGKILL'd after a very long time.

@workingjubilee
Copy link
Member

...Is the problem that you literally have 128TiB of RAM?

@workingjubilee
Copy link
Member

workingjubilee commented Mar 4, 2025

Hm, wait... laziness in paging due to overcommit, resulting in the system accepting an allocation that can't possibly be respected if called but assuming that no one will actually call that bluff?

@@ -2,6 +2,9 @@
// on 32bit and 16bit platforms it is plausible that the maximum allocation size will succeed
// FIXME (#135952) In some cases on AArch64 Linux the diagnostic does not trigger
//@ ignore-aarch64-unknown-linux-gnu
// AIX will allow allow the allocation to go through, and get SIGKILL when zero initializing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// AIX will allow allow the allocation to go through, and get SIGKILL when zero initializing
// AIX will allow the allocation to go through, and get SIGKILL when zero initializing

@@ -2,6 +2,9 @@
// on 32bit and 16bit platforms it is plausible that the maximum allocation size will succeed
// FIXME (#135952) In some cases on AArch64 Linux the diagnostic does not trigger
//@ ignore-aarch64-unknown-linux-gnu
// AIX will allow allow the allocation to go through, and get SIGKILL when zero initializing
// the overcommited page.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// the overcommited page.
// the overcommitted page.

Comment on lines 8 to 9
// AIX will allow allow the allocation to go through, and get SIGKILL when zero initializing
// the overcommited page.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// AIX will allow allow the allocation to go through, and get SIGKILL when zero initializing
// the overcommited page.
// AIX will allow the allocation to go through, and get SIGKILL when zero initializing
// the overcommitted page.

@workingjubilee
Copy link
Member

address nits, squash, and then r=me

@mustartt mustartt force-pushed the fix-aix-test-hangs branch from bcf78ad to 2a7ad95 Compare March 4, 2025 15:07
@mustartt
Copy link
Contributor Author

mustartt commented Mar 4, 2025

Yes the sigkills are from lazy paging and OS overcommitting.

Addressed nit and squashed.

@mustartt mustartt requested a review from workingjubilee March 4, 2025 16:38
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, wait... what are the actual mismatches on the scope ID, exactly?

They have the same scope, but a different zone index? Because it's assigned the 0 ZoneID...? But we create the IP with a scope_id of 0...

Copy link
Contributor Author

@mustartt mustartt Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think included the wrong scope id in the post. The peer we get back contains a scope id of 1 for the loopback.
(edited PR description to scope id to reflect it)

(gdb) list
41              });
42      
43              let server = t!(UdpSocket::bind(&server_ip));
44              tx1.send(()).unwrap();
45              let mut buf = [0];
46              let (nread, src) = t!(server.recv_from(&mut buf));
47              assert_eq!(nread, 1);
48              assert_eq!(buf[0], 99);
49              assert_eq!(compare_ignore_zoneid(&src, &client_ip), true);
50              rx2.recv().unwrap();
(gdb) p src
$2 = core::net::socket_addr::SocketAddr::V6(core::net::socket_addr::SocketAddrV6 {ip: core::net::ip_addr::Ipv6Addr {octets: [0 <repeats 15 times>, 1]}, port: 19603, flowinfo: 0, scope_id: 1})
(gdb) p client_ip
$3 = core::net::socket_addr::SocketAddr::V6(core::net::socket_addr::SocketAddrV6 {ip: core::net::ip_addr::Ipv6Addr {octets: [0 <repeats 15 times>, 1]}, port: 19603, flowinfo: 0, scope_id: 0})

bash-5.2$ cat /etc/hosts | grep ::1
::1                     loopback localhost      # IPv6 loopback (lo0) name/address
bash-5.2$ ibm-clang++_r test.cpp -o test
bash-5.2$ cat test.cpp
#include <net/if.h>
#include <iostream>
#include <sysexits.h>

int main(void)
{
        auto scope_id = if_nametoindex("lo0");
        std::cout << scope_id << std::endl;
        return EX_OK;
}
bash-5.2$ ./test 
1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha.

We probably should be creating these with a scope ID of 1 for the loopback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now this change is fine though.

@mustartt
Copy link
Contributor Author

Fixed nits
@bors r=workingjubilee

@bors
Copy link
Contributor

bors commented Mar 10, 2025

@mustartt: 🔑 Insufficient privileges: Not in reviewers

@workingjubilee
Copy link
Member

Oh, sorry

@bors r+ rollup

@bors
Copy link
Contributor

bors commented Mar 11, 2025

📌 Commit 2a7ad95 has been approved by workingjubilee

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 11, 2025
jieyouxu added a commit to jieyouxu/rust that referenced this pull request Mar 11, 2025
…kingjubilee

[AIX] Fix hangs during testing

Fixes all current test hangs experienced during CI runs.
1. ipv6 link-local (the loopback device) gets assigned an automatic zone id of 1, causing the assert to fail and hang in `library/std/src/net/udp/tests.rs`
2. Const alloc does not fail gracefully
3. Debuginfo test has problem with gdb auto load safe path
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 11, 2025
Rollup of 18 pull requests

Successful merges:

 - rust-lang#126856 (remove deprecated tool `rls`)
 - rust-lang#137314 (change definitely unproductive cycles to error)
 - rust-lang#137504 (Move methods from Map to TyCtxt, part 4.)
 - rust-lang#137701 (Convert `ShardedHashMap` to use `hashbrown::HashTable`)
 - rust-lang#137967 ([AIX] Fix hangs during testing)
 - rust-lang#138002 (Disable CFI for weakly linked syscalls)
 - rust-lang#138052 (strip `-Wlinker-messages` wrappers from `rust-lld` rmake test)
 - rust-lang#138063 (Improve `-Zunpretty=hir` for parsed attrs)
 - rust-lang#138109 (make precise capturing args in rustdoc Json typed)
 - rust-lang#138147 (Add maintainers for powerpc64le-unknown-linux-gnu)
 - rust-lang#138245 (stabilize `ci_rustc_if_unchanged_logic` test for local environments)
 - rust-lang#138296 (Remove `AdtFlags::IS_ANONYMOUS` and `Copy`/`Clone` condition for anonymous ADT)
 - rust-lang#138300 (add tracking issue for unqualified_local_imports)
 - rust-lang#138307 (Allow specifying glob patterns for try jobs)
 - rust-lang#138313 (Update books)
 - rust-lang#138315 (use next_back() instead of last() on DoubleEndedIterator)
 - rust-lang#138318 (Rustdoc: remove a bunch of `@ts-expect-error` from main.js)
 - rust-lang#138330 (Remove unnecessary `[lints.rust]` sections.)

Failed merges:

 - rust-lang#137147 (Add exclude to config.toml)

r? `@ghost`
`@rustbot` modify labels: rollup
Kobzol added a commit to Kobzol/rust that referenced this pull request Mar 11, 2025
…kingjubilee

[AIX] Fix hangs during testing

Fixes all current test hangs experienced during CI runs.
1. ipv6 link-local (the loopback device) gets assigned an automatic zone id of 1, causing the assert to fail and hang in `library/std/src/net/udp/tests.rs`
2. Const alloc does not fail gracefully
3. Debuginfo test has problem with gdb auto load safe path
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 11, 2025
Rollup of 11 pull requests

Successful merges:

 - rust-lang#135987 (Clarify iterator by_ref docs)
 - rust-lang#137967 ([AIX] Fix hangs during testing)
 - rust-lang#138063 (Improve `-Zunpretty=hir` for parsed attrs)
 - rust-lang#138147 (Add maintainers for powerpc64le-unknown-linux-gnu)
 - rust-lang#138288 (Document -Z crate-attr)
 - rust-lang#138300 (add tracking issue for unqualified_local_imports)
 - rust-lang#138307 (Allow specifying glob patterns for try jobs)
 - rust-lang#138315 (use next_back() instead of last() on DoubleEndedIterator)
 - rust-lang#138330 (Remove unnecessary `[lints.rust]` sections.)
 - rust-lang#138335 (Fix post-merge workflow)
 - rust-lang#138343 (Enable `f16` tests for `powf`)

r? `@ghost`
`@rustbot` modify labels: rollup
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 11, 2025
Rollup of 10 pull requests

Successful merges:

 - rust-lang#135987 (Clarify iterator by_ref docs)
 - rust-lang#137967 ([AIX] Fix hangs during testing)
 - rust-lang#138063 (Improve `-Zunpretty=hir` for parsed attrs)
 - rust-lang#138147 (Add maintainers for powerpc64le-unknown-linux-gnu)
 - rust-lang#138288 (Document -Z crate-attr)
 - rust-lang#138300 (add tracking issue for unqualified_local_imports)
 - rust-lang#138307 (Allow specifying glob patterns for try jobs)
 - rust-lang#138315 (use next_back() instead of last() on DoubleEndedIterator)
 - rust-lang#138330 (Remove unnecessary `[lints.rust]` sections.)
 - rust-lang#138335 (Fix post-merge workflow)

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit 95d9ade into rust-lang:master Mar 11, 2025
6 checks passed
@rustbot rustbot added this to the 1.87.0 milestone Mar 11, 2025
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Mar 11, 2025
Rollup merge of rust-lang#137967 - mustartt:fix-aix-test-hangs, r=workingjubilee

[AIX] Fix hangs during testing

Fixes all current test hangs experienced during CI runs.
1. ipv6 link-local (the loopback device) gets assigned an automatic zone id of 1, causing the assert to fail and hang in `library/std/src/net/udp/tests.rs`
2. Const alloc does not fail gracefully
3. Debuginfo test has problem with gdb auto load safe path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants