-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with CITGM job on AIX #1798
Comments
machine 3 has the least disk usage of any of the three test machines... however, it also has the smallest ramdisk of any of the three machines. I recommend not running citgm on that machine.
|
The second thing is a (non-fatal) bug with the citgm job. It's calling |
The GNU assumption looks like its in the "Post build task":
I tried to figure out how machines are selected. It looks like all machines wit the @targos do you have permissions to fix those 2 things given that information? If not, I'll do it if someone more familiar with citgm than me gives a thumbs up. Aside: am I the only one that finds that attempts to scroll while viewing a jenkins job configuration page in Chrome causes Chrome to max CPU and freeze for a minute or so? |
No, I don't have permission to modify citgm jobs. |
@richardlau WDYT? Should I make one or both of the config changes I suggest above? |
We use labels instead of node names generally for Jenkins jobs as it allows us to add/remove nodes and things still work (provided the correct labels are applied). It also allows for redundancy. If we limit citgm to just one AIX node in Jenkins then things are going to queue up. Things like the version selector groovy script and select compiler scripts are also based on labels -- I think they might still work with the node names but we'd need to double check. I'm okay with removing the |
The problem with the labels is they currently label nodes that can do node.js builds, but don't have sufficient ramdisk space for citgm builds. So, do we need a citgm specific label? Or do we need to increase the ramdisk size for test-osuosl-aix61-ppc64_be-3? Assuming that is even possible. I don't know if it would require more physical RAM installed. |
Hm. I take it back, I can edit the script box with the xargs call, but there is no save button. Maybe I don't have permissions to change this? |
And, fwiw, I looked at the disk space usage, and as far as I can tell its all reasonable, there are no huge temp files or anything. Mostly its the remains of node builds in /home/iojs/build//workspace/node-test-commit-aix, and while |
¯\_(ツ)_/¯ I'll defer to the Build WG (@nodejs/build) as to whether we should introduce labels for things like CITGM. For this specific issue is the smaller ramdisk actually a problem? Issue 1 in the OP is a failure to git clone into the workspace which, AFAIK, isn't on the ramdisk? (The ramdisk is used by citgm as the temporary working space for module install/tests.) I've removed the non-standard arguments to |
@richardlau Could you be talking about a different ramdisk? See #1798 (comment), |
@mhdawson Opinions? Do you know if its possible to increase the size of /home/iojs/build (aka /dev/ramdisk1)? |
ah sorry, yes, I was referring to |
I was not in the loop on setting up the ram disks. It was Gibson/George. If the machine has the same amount of memory as the others then we should be able to... How it is setup should be in the manual part of the instructions. |
It seems no CITGM run completed properly on AIX recently. There seem to be multiple issues. In the most recent run, all tests time out https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1854/nodes=aix61-ppc64/. |
In between all the timeout there might be one hint:
Does anyone know if there a change in tar semantics in CITGM/npm/yarn? |
@refack that is likely nodejs/citgm#688 and only about the specific module. |
There's no change. The |
In respect to the yarn issues I think we need to just exclude the modules that yarn needs on AIX until we can get together 7.1 systems in place. Open to better ideas if anybody has any. @sam-github ? |
Skipping sounds reasonable to me. |
The machines all have same RAM available, so they should have all the same RAMDISK size. Once I get back from summit, I'll rebuild the ramdisks to be the same size, which should make them work for citgm-smoker. |
A snippet I cobbled up from old mkramdisk 10000000
mkfs -V jfs2 -o log=INLINE /dev/ramdisk0
mount -V jfs2 -o log=/dev/ramdisk0 /dev/ramdisk0 /ramdisk0
chown iojs:staff /ramdisk0/
mkramdisk 23000000
mkfs -V jfs2 -o log=INLINE /dev/ramdisk1
mount -V jfs2 -o log=/dev/ramdisk1 /dev/ramdisk1 /home/iojs/build
chown iojs:staff /home/iojs/build
But if we're doing a re-think, seems like they could both be part of a single 30_000_000 block ramdisk... |
The 3 aix nodes all have different setups, I just change -3 to be the same as -2. For the record, initial state, and what I did below. I'll now reenable the node, and see how it goes.
|
@devsnek is (inadvertently) the first tester of this node, though I'm confused, it looks to me that his job started before I brought the node back online. Anyhow, I'll watch https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-3/builds for a while, and https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/23922/ specifically. @targos If you'd like to run a citgm-smoker, we could see if this has resolved its ramdisk size issues. Once things are stable for a few days, I will reorganize the -1 machine's ramdisks as well, so all machines will have the same setup. |
citgm-smoker run: https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1881/ |
OK, not going well. Examined the last dozen failures in https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-3/builds, seeing a lot of:
Maybe lack of space is causing linker failures. Will go check. |
The ramdisks are no longer mounted!? Will take offline and try again. |
Remounted them. No idea what is going on here, but I will reenable machine and watch it closely. |
Been a couple green builds since, they are still mounted, and I kicked off a citgm-smoker. https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1884/ Absolutely baffled how the ramdisks became dismounted last time. Will keep watching. |
I think node 3 is OK, but the citgm-smoker job is not doing so well. From https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1884/nodes=aix61-ppc64/console
This is node 1, it has a smaller ramdisk for /home/iojs/build than the other two. I'll up it. I also will look at the config scripts,
I mirrored locally, it has an 81MB pack file. Not sure if that's an issue. |
You can try setting |
@richardlau yes, that fixed out of memory |
Should I add that to
? |
I'm updating the ramdisk config for test-osuosl-aix61-ppc64_be-1, taking it offline. |
Bringing online, will kick some jobs off: https://ci.nodejs.org/job/citgm-smoker/1885/nodes=aix61-ppc64/ |
If we're seeing out of memory errors using |
From #node-build 03:51:42 | <Trott> | Consistent failures on AIX which seem to be build related. Anyone AIX-y around to look at it?
03:52:18 | <Trott> | https://www.irccloud.com/pastebin/jYsdwJJj/
03:52:34 | <Trott> | That's on test-osuosl-aix61-ppc64_be-3.
03:53:04 | <Trott> | But it seems to be happening on other hosts to, like be-1.
03:59:44 | <Trott> | I tried restarting the Jenkins client with `systemctl restart jenkins` on test-softlayer-ubuntu1604-x64-1.
04:00:23 | <Trott> | Re-running a CI now to see if that fixes it or not...
04:00:38 | <Trott> | Ugh. Wait, wrong host....sheesh....
04:01:39 | <Trott> | I was wondering why systemctl was working on AIX... <facepalm>
04:02:35 | <Trott> | All right, I'm going to reboot the two AIX hosts and hope for the best. |
I fell asleep, woke up, checked things...
|
@joyeecheung You took https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-2/ offline, why? There is no build history anymore (maybe because its offline?), so I can't look to see what the problems may have been. @Trott Which hosts did you reboot? Without knowing what is wrong, I'm not sure what to fix. If I don't hear back, I'll just renable the host and watch it for problems. |
test-osuosl-aix61-ppc64_be-3 and test-osuosl-aix61-ppc64_be-1 |
Makes sense. Those are the hosts that lost there ramdisks. I'll set them up again, and figure out out how to make them get setup on restart. Also, we've been told AIX is the kind of big old Unix that doesn't like to be restarted, it should run for years or decades, so lets not restart them. I'll add to the login message to remind us about that. |
FWIW, I don't usually read the login message carefully, but I do check https://github.com/nodejs/build/blob/e389bafd235baca950e356ba35e9509f540f4aca/doc/jenkins-guide.md#restart-the-machine when I go to restart a machine, so maybe adding a note there might be a good idea? Probably a good idea to add lots of AIX info there. For example, I don't think the documented suggestions for how to restart the Jenkins agent (the section directly above the one I linked to above) applies to AIX. |
For that matter, that content document might better be moved to a TROUBLESHOOTING.md doc. I always have to search and and be surprised that the info is in jenkins-guide.md. And yes, I can probably do some of these changes myself. 😀 |
https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-1/ is back online with ramdisks |
https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-3/ is back online with ramdisks |
@sam-github Nice timing! We're probably about to get a whole bunch of CI runs from the NodeConf Colombia Code & Learn event. |
Ouph. On the pluse side, I'll have lots of CI jobs to check for progress. |
I had looked at the passing/failing jobs on be-3 and the job that was timing out ram for just over 9 mins and the failing one timed out at 10. This would seem to suggest the problem was just the job taking longer than normal. Since this could be explained by the lack of ramdisks, I'm optimistic that adding the ramdisks back in will address the timeouts. |
Last run on be-1 the step that was timing out only took seconds... |
Build on be-3 if this one is fast/does not timeout I think we can conclude the ramdisks were the cause of the timeouts: https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/23992/ |
Ok so the test run I started on be-3 did not hit the persistent timeout so I think the reboot/lack of ramdisks was the cause of most of the recent red. It did however hit the intermittent tty issue. Depending on how frequent that failure is we might want to consider marking it as flaky while we investigate. |
I kicked off a couple more aix-ppc only builds. I think this is fixed: https://ci.nodejs.org/job/citgm-smoker/1891/nodes=aix61-ppc64/ ran without disk failures |
Run: https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1848/nodes=aix61-ppc64/console
Issue 1:
Issue 2:
The text was updated successfully, but these errors were encountered: