Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with CITGM job on AIX #1798

Closed
targos opened this issue May 14, 2019 · 52 comments
Closed

Problems with CITGM job on AIX #1798

targos opened this issue May 14, 2019 · 52 comments

Comments

@targos
Copy link
Member

targos commented May 14, 2019

Run: https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1848/nodes=aix61-ppc64/console

Issue 1:

Receiving objects:  58% (293725/506422), 311.83 MiB | 15.06 MiB/s   
fatal: write error: There is not enough space in the file system.

Issue 2:

01:02:52 + ps awwx
01:02:52 + grep citgm
01:02:52 + grep -v grep
01:02:52 + awk '{print $1}'
10:02:52 + xargs -rl kill -9
10:02:52 xargs: The -r flag is not valid.
10:02:52 Usage: xargs [-ptx] [-e[EndOfFileString]] [-E EndOfFileString]
10:02:52         [-i[ReplacementString]] [-I ReplacementString | -L Number |-n Number ]
10:02:52        [-l[Number]] [-s Size] [Command [Argument ...]]
@sam-github
Copy link
Contributor

sam-github commented May 14, 2019

machine 3 has the least disk usage of any of the three test machines... however, it also has the smallest ramdisk of any of the three machines. I recommend not running citgm on that machine.

% parallel-ssh -ih hosts.test-osuosl-aix61-ppc64_be du -sk /home/iojs/build/workspace                                                        
[1] 12:26:23 [SUCCESS] test-osuosl-aix61-ppc64_be-3
2032864 /home/iojs/build/workspace
[3] 12:26:30 [SUCCESS] test-osuosl-aix61-ppc64_be-2
4736800 /home/iojs/build/workspace
[2] 12:26:26 [SUCCESS] test-osuosl-aix61-ppc64_be-1
2696936 /home/iojs/build/workspace
% parallel-ssh -ih hosts.test-osuosl-aix61-ppc64_be "df|grep iojs"                                                                           
[1] 12:27:05 [SUCCESS] test-osuosl-aix61-ppc64_be-3
/dev/ramdisk1   15000000    584008   97%   162889    18% /home/iojs/build
[2] 12:27:05 [SUCCESS] test-osuosl-aix61-ppc64_be-2
/dev/ramdisk0   23000000   9157912   61%   470210    21% /home/iojs/build
[3] 12:27:05 [SUCCESS] test-osuosl-aix61-ppc64_be-1
/dev/ramdisk0   20000000  11570360   43%   204105    10% /home/iojs/build

@sam-github
Copy link
Contributor

The second thing is a (non-fatal) bug with the citgm job. It's calling xargs with the non-standard -r option, a GNU extension. That won't work on non-GNU systems, like AIX, or OS X. If it wasn't present, it would just mean that in the case there are no citgm processes, kill -9 would be run, and fail with an error, the -r should probably just be removed from that script.

@sam-github
Copy link
Contributor

The GNU assumption looks like its in the "Post build task":

#!/bin/bash -x

ps awwx | grep citgm | grep -v grep | awk '{print $1}' | xargs -rl kill -9 || true
rm -Rf node || true

I tried to figure out how machines are selected. It looks like all machines wit the aix61-ppc64 label are selected for the build now. I could unselect the label, and selet the specific AIX node that has the most free disk space, but it would be my first change to a jenkins config page. I'd like someone else to confirm that is reasonable.

@targos do you have permissions to fix those 2 things given that information? If not, I'll do it if someone more familiar with citgm than me gives a thumbs up.

Aside: am I the only one that finds that attempts to scroll while viewing a jenkins job configuration page in Chrome causes Chrome to max CPU and freeze for a minute or so?

@targos
Copy link
Member Author

targos commented May 14, 2019

No, I don't have permission to modify citgm jobs.

@sam-github
Copy link
Contributor

@richardlau WDYT? Should I make one or both of the config changes I suggest above?

@richardlau
Copy link
Member

We use labels instead of node names generally for Jenkins jobs as it allows us to add/remove nodes and things still work (provided the correct labels are applied). It also allows for redundancy. If we limit citgm to just one AIX node in Jenkins then things are going to queue up. Things like the version selector groovy script and select compiler scripts are also based on labels -- I think they might still work with the node names but we'd need to double check.

I'm okay with removing the -r from xargs.

@sam-github
Copy link
Contributor

@richardlau

The problem with the labels is they currently label nodes that can do node.js builds, but don't have sufficient ramdisk space for citgm builds.

So, do we need a citgm specific label? Or do we need to increase the ramdisk size for test-osuosl-aix61-ppc64_be-3? Assuming that is even possible. I don't know if it would require more physical RAM installed.

@sam-github
Copy link
Contributor

Hm. I take it back, I can edit the script box with the xargs call, but there is no save button. Maybe I don't have permissions to change this?

@sam-github
Copy link
Contributor

And, fwiw, I looked at the disk space usage, and as far as I can tell its all reasonable, there are no huge temp files or anything. Mostly its the remains of node builds in /home/iojs/build//workspace/node-test-commit-aix, and while 175668 /home/iojs/build//workspace/citgm-smoker-nobuild is largish, its only a tenth the size of node-test-commit-aix.

@richardlau
Copy link
Member

@richardlau

The problem with the labels is they currently label nodes that can do node.js builds, but don't have sufficient ramdisk space for citgm builds.

So, do we need a citgm specific label? Or do we need to increase the ramdisk size for test-osuosl-aix61-ppc64_be-3? Assuming that is even possible. I don't know if it would require more physical RAM installed.

¯\_(ツ)_/¯ I'll defer to the Build WG (@nodejs/build) as to whether we should introduce labels for things like CITGM.

For this specific issue is the smaller ramdisk actually a problem? Issue 1 in the OP is a failure to git clone into the workspace which, AFAIK, isn't on the ramdisk? (The ramdisk is used by citgm as the temporary working space for module install/tests.)

I've removed the non-standard arguments to xargs from the job.

@sam-github
Copy link
Contributor

@richardlau Could you be talking about a different ramdisk? See #1798 (comment), /home/iojs/build is on /dev/ramdisk1, according to df.

@sam-github
Copy link
Contributor

@mhdawson Opinions? Do you know if its possible to increase the size of /home/iojs/build (aka /dev/ramdisk1)?

@richardlau
Copy link
Member

@richardlau Could you be talking about a different ramdisk? See #1798 (comment), /home/iojs/build is on /dev/ramdisk1, according to df.

ah sorry, yes, I was referring to /ramdisk0/ (referenced by the job).

@mhdawson
Copy link
Member

I was not in the loop on setting up the ram disks. It was Gibson/George. If the machine has the same amount of memory as the others then we should be able to... How it is setup should be in the manual part of the instructions.

@BridgeAR
Copy link
Member

It seems no CITGM run completed properly on AIX recently. There seem to be multiple issues. In the most recent run, all tests time out https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1854/nodes=aix61-ppc64/.

@refack
Copy link
Contributor

refack commented May 21, 2019

In between all the timeout there might be one hint:

09:52:27 warn: pug yarn-install:   | error https://registry.yarnpkg.com/argparse/-/argparse-1.0.10.tgz: Extracting tar content of undefined failed, the file appears to be corrupt: "ENOSYS: function not implemented, futime"

Does anyone know if there a change in tar semantics in CITGM/npm/yarn?

@BridgeAR
Copy link
Member

@refack that is likely nodejs/citgm#688 and only about the specific module.

@richardlau
Copy link
Member

In between all the timeout there might be one hint:

09:52:27 warn: pug yarn-install:   | error https://registry.yarnpkg.com/argparse/-/argparse-1.0.10.tgz: Extracting tar content of undefined failed, the file appears to be corrupt: "ENOSYS: function not implemented, futime"

Does anyone know if there a change in tar semantics in CITGM/npm/yarn?

There's no change. The ENOSYS error referenced is, as pointed out, nodejs/citgm#688 and only affects the modules being tested in CITGM that require yarn. The issue is that we're actually running on an unsupported level of AIX (Node.js 8 bumped the minimum supported level of AIX to 7.1) and yarn uses an API (fs.futimes) that only works on AIX 7.1 and later. It should not be affecting the installation of the other modules that do not use yarn.

@mhdawson
Copy link
Member

In respect to the yarn issues I think we need to just exclude the modules that yarn needs on AIX until we can get together 7.1 systems in place. Open to better ideas if anybody has any. @sam-github ?

@sam-github
Copy link
Contributor

Skipping sounds reasonable to me.

@sam-github
Copy link
Contributor

core/build (centos7-python-3.7 % u=) % parallel-ssh -i -h hosts/hosts.test-osuosl-aix61-ppc64_be "lsattr -El mem0"                                                           
[1] 11:41:06 [SUCCESS] test-osuosl-aix61-ppc64_be-1
ent_mem_cap          I/O memory entitlement in Kbytes           False
goodsize       32768 Amount of usable physical memory in Mbytes False
mem_exp_factor       Memory expansion factor                    False
size           32768 Total amount of physical memory in Mbytes  False
var_mem_weight       Variable memory capacity weight            False
[2] 11:41:06 [SUCCESS] test-osuosl-aix61-ppc64_be-2
ent_mem_cap          I/O memory entitlement in Kbytes           False
goodsize       32768 Amount of usable physical memory in Mbytes False
mem_exp_factor       Memory expansion factor                    False
size           32768 Total amount of physical memory in Mbytes  False
var_mem_weight       Variable memory capacity weight            False
[3] 11:41:06 [SUCCESS] test-osuosl-aix61-ppc64_be-3
ent_mem_cap          I/O memory entitlement in Kbytes           False
goodsize       32768 Amount of usable physical memory in Mbytes False
mem_exp_factor       Memory expansion factor                    False
size           32768 Total amount of physical memory in Mbytes  False
var_mem_weight       Variable memory capacity weight            False

The machines all have same RAM available, so they should have all the same RAMDISK size. Once I get back from summit, I'll rebuild the ramdisks to be the same size, which should make them work for citgm-smoker.

@refack
Copy link
Contributor

refack commented Jun 2, 2019

A snippet I cobbled up from old history files in the three machines:

mkramdisk 10000000
mkfs -V jfs2 -o log=INLINE /dev/ramdisk0
mount -V jfs2 -o log=/dev/ramdisk0 /dev/ramdisk0 /ramdisk0
chown iojs:staff /ramdisk0/

mkramdisk 23000000
mkfs -V jfs2 -o log=INLINE /dev/ramdisk1
mount -V jfs2 -o log=/dev/ramdisk1 /dev/ramdisk1 /home/iojs/build
chown iojs:staff /home/iojs/build
  • /ramdisk0 is used by the CITGM job
  • /home/iojs/build is used by all other jobs

But if we're doing a re-think, seems like they could both be part of a single 30_000_000 block ramdisk...

@sam-github
Copy link
Contributor

The 3 aix nodes all have different setups, I just change -3 to be the same as -2. For the record, initial state, and what I did below. I'll now reenable the node, and see how it goes.

  node       mounted        mounted over    vfs       date        options
-------- ---------------  ---------------  ------ ------------ ---------------
test-osuosl-aix61-ppc64_be-1:
         /dev/ramdisk0    /home/iojs/build jfs2   Feb 18 10:05 rw,log=/dev/ramdisk0

test-osuosl-aix61-ppc64_be-2:
         /dev/ramdisk0    /ramdisk0        jfs2   Jun 02 14:20 rw,log=/dev/ramdisk0
	 /dev/ramdisk1    /home/iojs/build jfs2   Jun 02 14:20 rw,log=/dev/ramdisk1

test-osuosl-aix61-ppc64_be-3:
         /dev/ramdisk0    /home/iojs/build jfs2   May 18 10:23 rw,log=/dev/ramdisk0
	 /dev/ramdisk1    /ramdisk0        jfs2   May 18 10:25 rw,log=NULL

Filesystem    512-blocks      Free %Used    Iused %Iused Mounted on
test-osuosl-aix61-ppc64_be-1:
/dev/ramdisk0   20000000   2675936   87%   385410    28% /home/iojs/build

test-osuosl-aix61-ppc64_be-2:
/dev/ramdisk0   10000000   9699344    4%        5     1% /ramdisk0
/dev/ramdisk1   23000000  17368832   25%   140356     7% /home/iojs/build

test-osuosl-aix61-ppc64_be-3:
/dev/ramdisk0   15000000    267320   99%   285838    42% /home/iojs/build
/dev/ramdisk1    8388608   8012200    5%      225     1% /ramdisk0


ls -ld /ramdisk0                                                                              
test-osuosl-aix61-ppc64_be-1:
drwxr-xr-x    3 iojs     staff           256 Mar 14 18:53 /ramdisk0
test-osuosl-aix61-ppc64_be-2:
drwxr-xr-x    4 iojs     staff           256 Jun 03 07:39 /ramdisk0
test-osuosl-aix61-ppc64_be-3:
drwxr-xr-x    4 iojs     staff           256 May 18 10:28 /ramdisk0


# mv /home/iojs/build/tools /home/iojs/build.tools
# umount /home/iojs/build
# umount /ramdisk0
# rmramdisk rramdisk0
# rmramdisk rramdisk1
# ls -l /dev/ramdisk*
ls: 0653-341 The file /dev/ramdisk* does not exist.
# mkramdisk 10000000
/dev/rramdisk0
# mkfs -V jfs2 -o log=INLINE /dev/ramdisk0
mkfs: destroy /dev/ramdisk0 (yes)? y
logform: Format inline log for  <y>?
File system created successfully.
4979164 kilobytes total disk space.
Device /dev/ramdisk0:
  Standard empty filesystem
  Size:           9958328 512-byte (DEVBLKSIZE) blocks
# mount -V jfs2 -o log=/dev/ramdisk0 /dev/ramdisk0 /ramdisk0
# chown iojs:staff /ramdisk0/
# mkramdisk 23000000
/dev/rramdisk1
# mkfs -V jfs2 -o log=INLINE /dev/ramdisk1
mkfs: destroy /dev/ramdisk1 (yes)?
logform: Format inline log for  <y>?
File system created successfully.
11454388 kilobytes total disk space.
Device /dev/ramdisk1:
  Standard empty filesystem
  Size:           22908776 512-byte (DEVBLKSIZE) blocks
# mount -V jfs2 -o log=/dev/ramdisk1 /dev/ramdisk1 /home/iojs/build
# chown iojs:staff /home/iojs/build
# ls -l /dev/*ramdisk*
brw-------    1 root     system       36,  0 Jun 18 12:13 /dev/ramdisk0
brw-------    1 root     system       36,  1 Jun 18 12:14 /dev/ramdisk1
crw-------    1 root     system       36,  0 Jun 18 12:13 /dev/rramdisk0
crw-------    1 root     system       36,  1 Jun 18 12:14 /dev/rramdisk1
# df | grep ram
/dev/ramdisk0   10000000   9956864    1%        4     1% /ramdisk0
/dev/ramdisk1   23000000  22905728    1%        4     1% /home/iojs/build
# mount | grep ram
         /dev/ramdisk0    /ramdisk0        jfs2   Jun 18 12:13 rw,log=/dev/ramdisk0
         /dev/ramdisk1    /home/iojs/build jfs2   Jun 18 12:14 rw,log=/dev/ramdisk1

@sam-github
Copy link
Contributor

@devsnek is (inadvertently) the first tester of this node, though I'm confused, it looks to me that his job started before I brought the node back online. Anyhow, I'll watch https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-3/builds for a while, and https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/23922/ specifically.

@targos If you'd like to run a citgm-smoker, we could see if this has resolved its ramdisk size issues.

Once things are stable for a few days, I will reorganize the -1 machine's ramdisks as well, so all machines will have the same setup.

@targos
Copy link
Member Author

targos commented Jun 18, 2019

citgm-smoker run: https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1881/

@sam-github
Copy link
Contributor

OK, not going well. Examined the last dozen failures in https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-3/builds, seeing a lot of:

10:27:00 g++: internal compiler error: Terminated (program collect2)

collect2: error: ld returned 8 exit status

02:40:14 as: There is not enough space in the file system.

Maybe lack of space is causing linker failures. Will go check.

@sam-github
Copy link
Contributor

The ramdisks are no longer mounted!? Will take offline and try again.

@sam-github
Copy link
Contributor

bash-4.3# df | grep ram; mount | grep ram
/dev/ramdisk0   10000000   9956864    1%        4     1% /ramdisk0
/dev/ramdisk1   23000000  22905728    1%        4     1% /home/iojs/build
         /dev/ramdisk0    /ramdisk0        jfs2   Jun 19 14:50 rw,log=/dev/ramdisk0
         /dev/ramdisk1    /home/iojs/build jfs2   Jun 19 16:02 rw,log=/dev/ramdisk1

Remounted them. No idea what is going on here, but I will reenable machine and watch it closely.

@sam-github
Copy link
Contributor

Been a couple green builds since, they are still mounted, and I kicked off a citgm-smoker.

https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1884/

Absolutely baffled how the ramdisks became dismounted last time. Will keep watching.

@sam-github
Copy link
Contributor

I think node 3 is OK, but the citgm-smoker job is not doing so well. From https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1884/nodes=aix61-ppc64/console

21:02:12 + mkdir /ramdisk0/citgm
21:02:12 mkdir: 0653-358 Cannot create /ramdisk0/citgm.
21:02:12 /ramdisk0/citgm: Do not specify an existing file.
21:02:12 + true
21:02:12 + '[' false == true ']'
21:02:12 + eval citgm-all -J '--nodedir=/home/iojs/build/workspace/citgm-smoker/nodes/aix61-ppc64/node -v warn -x /home/iojs/build/workspace/citgm-smoker/nodes/aix61-ppc64/smoker/report.xml -q error --tmpDir /ramdisk0/citgm'
21:02:12 ++ citgm-all -J --nodedir=/home/iojs/build/workspace/citgm-smoker/nodes/aix61-ppc64/node -v warn -x /home/iojs/build/workspace/citgm-smoker/nodes/aix61-ppc64/smoker/report.xml -q error --tmpDir /ramdisk0/citgm
21:03:35 warn: acorn npm-install:  | npm ERR! code       
21:03:35 warn: acorn npm-install:  | 128                                                                                                                                                                                        
21:03:35 warn:                     | npm ERR! Command failed: git clone --mirror -q https://github.com/tc39/test262.git /home/iojs/build/workspace/citgm-smoker/nodes/aix61-ppc64/npm_cache/_cacache/tmp/git-clone-978df336/.git
21:03:35 warn:                     | npm ERR! warning: templates not found /ramdisk0/citgm/282fb81e-b51b-4355-9e6d-8fcb97830558/npm_config_tmp/pacote-git-template-tmp/git-clone-803d9067                                       
21:03:35 warn:                     | npm ERR! fatal: write error: There is not enough space in the file system.                                                                                                                 
21:03:35 warn:                     | npm ERR! fatal: index-pack failed                                                  

This is node 1, it has a smaller ramdisk for /home/iojs/build than the other two. I'll up it.

I also will look at the config scripts, rm -rf '/ramdisk0/citgm/*' looks wrong (its quoting the glob, so I don't think it will get expanded. Also, I can't reprot the git clone failure, I get a different one:

$ git clone --mirror -q https://github.com/tc39/test262.git /home/iojs/build/
workspace/citgm-smoker/nodes/aix61-ppc64/npm_cache/_cacache/tmp/git-clone-978df336/.gi
t
fatal: Out of memory, malloc failed (tried to allocate 11008943 bytes)
fatal: index-pack failed

I mirrored locally, it has an 81MB pack file. Not sure if that's an issue.

@richardlau
Copy link
Member

Also, I can't reprot the git clone failure, I get a different one:

$ git clone --mirror -q https://github.com/tc39/test262.git /home/iojs/build/
workspace/citgm-smoker/nodes/aix61-ppc64/npm_cache/_cacache/tmp/git-clone-978df336/.gi
t
fatal: Out of memory, malloc failed (tried to allocate 11008943 bytes)
fatal: index-pack failed

I mirrored locally, it has an 81MB pack file. Not sure if that's an issue.

You can try setting export LDR_CNTRL=MAXDATA=0x80000000@DSA to see if that fixes the out of memory.

@sam-github
Copy link
Contributor

@richardlau yes, that fixed out of memory

@sam-github
Copy link
Contributor

Should I add that to

if [ `uname -s`	 = "AIX" ]; then
	temp=/ramdisk0/citgm
    export npm_config_tmp='/ramdisk0/citgm'
#    unset LIBPATH
	echo $LIBPATH
fi

?

@sam-github
Copy link
Contributor

I'm updating the ramdisk config for test-osuosl-aix61-ppc64_be-1, taking it offline.

@sam-github
Copy link
Contributor

sam-github commented Jun 20, 2019

Bringing online, will kick some jobs off: https://ci.nodejs.org/job/citgm-smoker/1885/nodes=aix61-ppc64/

@richardlau
Copy link
Member

Should I add that to

if [ `uname -s`	 = "AIX" ]; then
	temp=/ramdisk0/citgm
    export npm_config_tmp='/ramdisk0/citgm'
#    unset LIBPATH
	echo $LIBPATH
fi

?

If we're seeing out of memory errors using git (which must be 32-bit if it's affected by the environment variable) on the CI then yes.

@richardlau
Copy link
Member

From #node-build

03:51:42 | <Trott> | Consistent failures on AIX which seem to be build related. Anyone AIX-y around to look at it?
03:52:18 | <Trott> | https://www.irccloud.com/pastebin/jYsdwJJj/
03:52:34 | <Trott> | That's on test-osuosl-aix61-ppc64_be-3.
03:53:04 | <Trott> | But it seems to be happening on other hosts to, like be-1.
03:59:44 | <Trott> | I tried restarting the Jenkins client with `systemctl restart jenkins` on test-softlayer-ubuntu1604-x64-1.
04:00:23 | <Trott> | Re-running a CI now to see if that fixes it or not...
04:00:38 | <Trott> | Ugh. Wait, wrong host....sheesh....
04:01:39 | <Trott> | I was wondering why systemctl was working on AIX... <facepalm>
04:02:35 | <Trott> | All right, I'm going to reboot the two AIX hosts and hope for the best.

@Trott
Copy link
Member

Trott commented Jun 21, 2019

From #node-build

03:51:42 | <Trott> | Consistent failures on AIX which seem to be build related. Anyone AIX-y around to look at it?
03:52:18 | <Trott> | https://www.irccloud.com/pastebin/jYsdwJJj/
03:52:34 | <Trott> | That's on test-osuosl-aix61-ppc64_be-3.
03:53:04 | <Trott> | But it seems to be happening on other hosts to, like be-1.
03:59:44 | <Trott> | I tried restarting the Jenkins client with `systemctl restart jenkins` on test-softlayer-ubuntu1604-x64-1.
04:00:23 | <Trott> | Re-running a CI now to see if that fixes it or not...
04:00:38 | <Trott> | Ugh. Wait, wrong host....sheesh....
04:01:39 | <Trott> | I was wondering why systemctl was working on AIX... <facepalm>
04:02:35 | <Trott> | All right, I'm going to reboot the two AIX hosts and hope for the best.

I fell asleep, woke up, checked things...

That fixed it for one run on one machine and zero runs on the other and now the error is different.
I guess this will wait for when mhdawson___ and Sam are awake. Apologies if I made things worse.

@sam-github
Copy link
Contributor

@joyeecheung You took https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-2/ offline, why? There is no build history anymore (maybe because its offline?), so I can't look to see what the problems may have been.

@Trott Which hosts did you reboot?

Without knowing what is wrong, I'm not sure what to fix. If I don't hear back, I'll just renable the host and watch it for problems.

@Trott
Copy link
Member

Trott commented Jun 21, 2019

@Trott Which hosts did you reboot?

test-osuosl-aix61-ppc64_be-3 and test-osuosl-aix61-ppc64_be-1

@sam-github
Copy link
Contributor

Makes sense. Those are the hosts that lost there ramdisks.

I'll set them up again, and figure out out how to make them get setup on restart.

Also, we've been told AIX is the kind of big old Unix that doesn't like to be restarted, it should run for years or decades, so lets not restart them. I'll add to the login message to remind us about that.

@Trott /cc @mhdawson

@Trott
Copy link
Member

Trott commented Jun 21, 2019

I'll add to the login message to remind us about that.

FWIW, I don't usually read the login message carefully, but I do check https://github.com/nodejs/build/blob/e389bafd235baca950e356ba35e9509f540f4aca/doc/jenkins-guide.md#restart-the-machine when I go to restart a machine, so maybe adding a note there might be a good idea? Probably a good idea to add lots of AIX info there. For example, I don't think the documented suggestions for how to restart the Jenkins agent (the section directly above the one I linked to above) applies to AIX.

@Trott
Copy link
Member

Trott commented Jun 21, 2019

For that matter, that content document might better be moved to a TROUBLESHOOTING.md doc. I always have to search and and be surprised that the info is in jenkins-guide.md.

And yes, I can probably do some of these changes myself. 😀

@sam-github
Copy link
Contributor

https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-1/ is back online with ramdisks

@sam-github
Copy link
Contributor

https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-3/ is back online with ramdisks

@Trott
Copy link
Member

Trott commented Jun 21, 2019

@sam-github Nice timing! We're probably about to get a whole bunch of CI runs from the NodeConf Colombia Code & Learn event.

@sam-github
Copy link
Contributor

Ouph. On the pluse side, I'll have lots of CI jobs to check for progress.

@mhdawson
Copy link
Member

I had looked at the passing/failing jobs on be-3 and the job that was timing out ram for just over 9 mins and the failing one timed out at 10. This would seem to suggest the problem was just the job taking longer than normal. Since this could be explained by the lack of ramdisks, I'm optimistic that adding the ramdisks back in will address the timeouts.

@mhdawson
Copy link
Member

Last run on be-1 the step that was timing out only took seconds...

@mhdawson
Copy link
Member

Build on be-3 if this one is fast/does not timeout I think we can conclude the ramdisks were the cause of the timeouts: https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/23992/

@mhdawson
Copy link
Member

Ok so the test run I started on be-3 did not hit the persistent timeout so I think the reboot/lack of ramdisks was the cause of most of the recent red. It did however hit the intermittent tty issue. Depending on how frequent that failure is we might want to consider marking it as flaky while we investigate.

@sam-github
Copy link
Contributor

I kicked off a couple more aix-ppc only builds.

I think this is fixed: https://ci.nodejs.org/job/citgm-smoker/1891/nodes=aix61-ppc64/ ran without disk failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants