-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test-requireio-osx1010-x64-1 is borked #861
Comments
fwiw the lack of a git mirror isn't essential and should just be a quiet fail, while it'll slow it down if it's not in the right place it still should be fine. I'm still messing with this machine, some weird network problems, can't even update the git mirror or clone a new one using I would try updating Java but a recent VMWare Fusion upgrade on that machine messed up access to the consoles, I only have black screens and have to use SSH to do everything like everyone else. I'm out for 12+ hours but will look back at this soon. I'm not going to leave this machine offline because failures are better than a big queue in jenkins. If anyone else wants to tinker on the machine please do so. |
@nodejs/build did anyone install a new jenkins plugin recently? |
ok, really heading off now, my remaining theory is that a plugin was installed recently that's inteferring. If someone installed one in the last week or so, I'd appreciate you either drop a note in here or experiment with removing it and running node-test-commit-osx. |
The fact that there are only two updates available suggests that the plugins were updated recently, so it might have been an update that changed things. I haven't installed anything since the rebuild plugin (which was apparently in July #672). |
The node-test-commit-osx jobs seem to be trying to build on test-packetnet-ubuntu1604-x64-1 which seems to minimally be a labeling error, but maybe a full-on larger problem? See https://ci.nodejs.org/job/node-test-commit-osx/12132/ for an example. |
@Trott is that a problem? The parent job is running there because it has the |
Nope, not a problem if it works. :-D |
@gibfahn I updated the plugins while trying to sort this out, which is why they look like that. Didn't help. Interestingly there have been a few greens today and the last run only disconnected as it was already into the test running. I'll see if I can figure out how to determine which plugins were most recently installed and start removing them to see if that makes a difference. Another alternative might be to try and get a different JVM installed but as I've already mentioned that's going to be tricky without a GUI! |
some more weird https errors on my local infra, this time on pi1's building binaries: https://ci-release.nodejs.org/job/iojs+release/1990/nodes=pi1-raspbian-wheezy/console (only accessible to people with release access)
I haven't experienced anything similar on my own use of this network (I'm on it now) but this is very similar to what I was experiencing on the macOS box on the same network! I've done a cleanup on those two release machines and will keep an eye on those builds, I could conceive of a way this might be local to those machines but it's pretty suspicious on the face of it. Re macOS, I've downloaded the latest Oracle JRE macOS tarball and unpacked it into /usr/local/lib with a /usr/local/bin/java symlink and changed the slave.jar startup to use that binary and it's running now. It's going @ https://ci.nodejs.org/job/node-test-commit-osx/12159/nodes=osx1010/console but taking a very long time to even get started which makes me still suspect a problematic plugin. Will keep an eye on all of this and continue to work out what on earth is up. Again, any help or suggestions would be appreciated! |
@rvagg is there anything in the master jenkins logs? If it's to do with the git clone it might be an issue moving clones around. Not sure what else could be going on here, haven't been seeing this on our internal Jenkins instance at all. |
Hm, yeah, a bunch of these just now for that host only Maybe I just need to and shut everything down, including the internet connection and start it all up again slowly. |
RE GUI, did you try to VNC to the VM? I have some success with that... |
@refack got some command-line-fu to enable screen sharing? it's not enabled on the VMs unfortunately. |
@joaocgreis can you look at the shared binary repo and see if you can make it prune any more aggressively? It's at 3.2G at the moment and may be contributing somewhat to network problems (not likely the cause of this particular problem though). |
✋ beaut! that worked nicely, I'm on the console now. |
https://ci.nodejs.org/job/node-test-commit-osx/12160/nodes=osx1010/console More of the same, those errors are in jenkins.log in the master so I'm assuming they come from there rather than being reported from the slave. So, on the slave we have new java (same on master and slave), restarted (a few times), completely cycled all network hardware (none of the Pi's are even running as I'm writing this). Other than rogue plugins, the only other thing I can think of that might coincide with this is the updated SSL cert on master. The timing might explain this and the errors are SSL related. But why just this host? Not even the OSX host on ci-release is behaving this way. |
I've tried removing, reconfiguring, resetting network hardware in vmware for that VM. I've tried using NAT on the host instead of bridged direct onto the network. I've tried making new nodes on Jenkins and pointing it to them (test-requireio-osx1010-x64-2, test-requireio-osx1010-x64-3) just in case there's latent config garbage on the master. Same. |
Seeing similar errors now on almost all of the machines hosted on the same network, most of the Pi's are offline at the moment due to this and are labelled as "Ping response time is too long or timed out.". I did tweak the ping and timeout to lengthen it but apparently this hasn't helped matters, maybe even made it worse!? |
okie dokie, making progress finally Installed an entirely new VM, from scratch, OSX 10.10 and we now have a string of greens on https://ci.nodejs.org/job/node-test-commit-osx/nodes=osx1010/ with no disconnects (yet). Crossing fingers that the problems were isolated to that VM and will go away now 🤞. I'm seeing similar disconnection messages from other hosts that are similar now to the ones I was worried about in my last message so I think it's just a new form of Jenkins message I'm not used to seeing, so for now I'll avoid assuming there's something bigger at play on the local network. While installing the VM I came up with a new script that could be ansiblised. I avoided the UI as much as possible for this. These instructions will need to be used (and maybe improved) by whoever gets far enough with the MacStadium hosts that replace this single machine. (Note for people not in the know: the MacStadium setup is super complicated, vSphere stuff that we're having to level-up on, it's stalled because none of us have had the time to spend the extra time needed to get further than we have. I believe IBM may be assigning some staff to help with this.) Steps listed below, I need to put these into the new doc/process/non-ansible-configuration-notes.md doc (or someone else is welcome to).
(edit: these two were run later as a result of @refack's comment below)
start.sh
/Library/LaunchDaemons/org.nodejs.osx.jenkins.plist
|
An old favorite is raising it's head:
Need to disable the crash reporter nodejs/node#13527 (comment)
|
P.S. it's rejecting the test key:
|
@rvagg I cleaned the binary repo (can only do manually, because there's no way to tell if a branch is in use or not). Many branches were left behind when jobs were aborted, so I took a look and was able to change the way branches are deleted in |
@refack thanks, just ran those commands via ssh (no security complaints, didn't need to do it via UI!), this re-run of a previously failed job (due to that error) has an unrelated failure: https://ci.nodejs.org/job/node-test-commit-osx/12210/nodes=osx1010/console, so I think we might be good now. Will watch today's daily jobs to see if we get more green. I've updated my scripted list above to include these. 🤞 |
1 green job https://ci.nodejs.org/job/node-test-commit-osx/12211/ |
2nd green job in a row https://ci.nodejs.org/job/node-test-commit-osx/12212/ |
P.S. Let's get those commands added to ansible or whatever, yeah? |
Something's still up. The 3 failures in that list of most recent builds from today are disconnects during compile. Not super-early like before but still the same kind of disconnect. Still mostly green but regular disconnects it seems. I haven't looked any further back at the failures other than today's, I only just caught the machine offline so went for a quick dig. If anyone else wants to, go to https://ci.nodejs.org/computer/test-requireio-osx1010-x64-1/builds, wait for it to load and click through to each of them and look at the console output to see if it's a test failure or disconnection. I don't have an answer here except that we need to get on with our mac IaaS work and have more of these for redundancy and hopefully stability. |
@rvagg @joaocgreis |
I can't connect either. Missing |
yes, sorry, new rebuild missed out on key, fixed now, reopen if it doesn't work for y'all |
I took it offline for now https://ci.nodejs.org/computer/test-requireio-osx1010-x64-1/
It started with perma:
I logged in and made the mistake of rimrafing the workspace. The job setup is broken as it refers to
/home/iojs/io.js.reference
when the local mirror is actually at/Users/iojs/git/io.js.reference
.So I cloned the local mirror manually and got one step further to.
But there I'm stuck...
/cc @rvagg @joaocgreis
The text was updated successfully, but these errors were encountered: