-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI needs upgrading #200
Comments
Definitely no objections if that's the easiest path for you. How much of the UI sugar that Jenkins offers are you considering exposing? Looking at what firedrake currently offers, there are things I'd request with e.g. the test logging, although I realise everything has security implications. |
I think it probably is - I've done three Jenkins deployments recently and not looked much at buildbot for the last six months, so I'm much more up to speed with Jenkins. I was planning to be as bells-and-whistles as possible for this so open to anything people would like including. |
Sounds like the best way forward indeed. Also much prefer to stick with something we (and in particularly you :-)) are already familiar with if it is up to the job. |
Making progress here: I think this is probably not a bad point to move the information in the docker repo to within Fluidity and simplify the builds a bit. |
When you reach the stage that it's building, then if you're willing to add |
Updating on this - I think the Jenkins setup is mostly done now, running through some tidying up now and extending to cover the set of builds we're interested in. |
Is now a good time for feedback/opinions/feature requests on the setup, or do you still have tweaks you know you need to do? |
Now is a great time :-) I'm working on speeding it up a bit now - things like starting the mediumtests at the same time as the unit and short tests. Also adding in a parallel test stream for make makefiles, packaging, etc. And need to add the OpenMP and debug builders. Time at the moment looks about the same as buildbot was giving. |
I think my points are all UI/UX related, and several of them would need consensus.
|
@jrper: Could you clarify no 1? What do you mean by "not successful vs. failed" and how do you want them marked? Agree with the other points, in particular no. 2. @tmbgreaves : for speeding up, I think it might really be worth it to improve the dumb scheduling testharness does with parallel tests. See my comments on #182. This might also improve reliability as I suspect some intermittent failures may be due to being unlucky in the sense of testharness attempting to run all the most beefy parallel jobs at the same time. |
Jenkins has a fairly fuzzy, but extensible model of "code health", which is connected to, but not exactly the same as the build result. If we want the buildbot behaviour back, then Jenkins can be made to understand the test results, and mark a build in which one test failed as failed itself. Currently, it's marking any build in which the tests complete as successful, regardless of test failures. |
Am I interpreting correctly that the way to deal with test failures being passed back up to Jenkins is to call testharness with --exit-failure-count ? |
Re. splitting up into stages - I agree with @jrper and am figuring out the neatest way to split up the config/build/test. |
There is also the question of how useful or not it is to rebuild the base container from scratch on a per-build basis. It's a fairly high cost thing to do which is why I'd extracted it from the routine build on buildbot; I'm tempted to just have one sitting on dockerhub updated by cron on the jenkins master rather than wrapping it into Jenkins. |
As far as I remember, on the last iteration we made it that
in the Jenkinsfile. |
Ah ok, thanks that clarifies - yes, I agree any single test failure should be marked as an overall failure and yes, if Jenkins decides that based on a nonzero exit code, than --exit-failure-count should work. Is rebuilding the container the bit where it installs required packages starting from a bare image? If so then, yes, I don't see a need to redo that every time. As long as fluidity itself gets rebuilt. Then if the container is rebuilt, triggered by cron, maybe it's a good idea to have that trigger a master rebuild with the new container - and also have a way of identifying in the output of any build which version of the container has been used for it. |
Regarding rebuilding the containers, it's definitely fair to only update periodically. One idea that springs to mind is making the rebuild a separate (minimally wrapped, non-pipeline) Jenkins project, rather than a pure cron task. That way it might be easier to trigger by hand on the odd occasion that it's needed and @tmbgreaves is indisposed. |
Missing a trick here - pull the base image and then update will I think keep us both up to date and make it a lot more efficient on the build. |
Brain dumping into the issue here to make sure I don't forget - need to ensure that if the pipeline repeatedly forks and rejoins, it ensures that workspaces are separate and consistently connected to the right OS for multiple os builds. |
Additional note on this - trying to get internally parallel builds on parallel OS builds is causing all kinds of problems; might be easier to have multiple Jenkins projects, one for each OS ? ... which looks to be possible, with some hackery, to do from a single Jenkinsfile ... |
There is a completed run on xenial now. |
Do you want eyes on the test failures it's thrown up, or are they known issues with the environment? Regarding the log files, it looks like adding |
Also, for the convenience of those watching part time, the new link is https://jenkins.ese.ic.ac.uk:8000/blue/organizations/jenkins/Fluidity%20on%20Xenial/activity |
Just a little note: |
Regarding point 3) in my wishlist above, and as a point of negotiation, in 7ab620a I've added a commit which updates test harness to have a new option Usage would be
I believe there is finer scale control available over how much and how long things get kept for, but I'm not up to full speed with pipeline builds. |
Just jumping on the wish-list bandwagon here… The first is more of a test harness request, but as these seem to have been thrown in to this issue, I though why not…
Request 1: Currently, test-harness assumes mpiexec as the mpi launcher. This is fine for most systems (it clearly works on the systems at IC and local machines). However, a number of systems (including those here in Australia - for example magnus - https://www.pawsey.org.au/our-systems/magnus-technical-specifications/) use aprun or srun (there are probably others too). Would it be possible to update testharness such that a user-defined launcher could be used (with mpiexec being the default)?
Request 2: would it be possible to add builds on the Australian supercomputers (both Raijin, which is in Canberra, and Magnus, which is in Perth) to the Jenkins manifesto? I think that would be quite useful in the long term, given the growing user-base. Happy to liaise with the relevant people at this side to get that sorted - I just need to be told what is required at the Jenkins end!!
Best wishes,
Rhod
…---------------------------------------------------------------------------------------------------
ARC Future Fellow
Research School of Earth Sciences,
The Australian National University,
Mills Road, Acton,
Canberra, ACT, 2601, Australia
T: +61 (0)2 6125 3643
E: [email protected]<mailto:[email protected]>
W: http://rses.anu.edu.au/people/dr-rhodri-davies
CRICOS Provider No. 00120C
---------------------------------------------------------------------------------------------------
On 25 May 2018, at 4:42 am, James Percival <[email protected]<mailto:[email protected]>> wrote:
Regarding point 3) in my wishlist above, and as a point of negotiation, in 7ab620a<7ab620a> I've added a commit which updates test harness to have a new option -a or --archive-failures which takes a filename and writes directories which fail to gzipped tarball under that title. It also modifies the root makefile adding a new variable TESTHARNESS_OPTIONS to expose this.
Usage would be
make TESTHARNESS_OPTIONS=short_test_failures.tar.gz test
make TESTHARNESS_OPTIONS=medium_test_failures.tar.gz medium test
post {
always {
archiveArtifacts '*test_failures.tar.gz'
}
I believe there is finer scale control available over how much and how long things get kept for, but I'm not up to full speed with pipeline builds.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#200 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AHXaKovhcSYhAEo4pvR-n2HWS1QiWDtdks5t1v8ygaJpZM4UC1lO>.
|
Having looked at the threading in testharness a bit, I'm not sure that as written it actually does much except speed up io, due to the python global interpreter lock. I'm having a go at testing a migration from the @drhodrid for pure testharness runs, I'm confident your request 1) is doable. The code already replaces the |
As far as I can tell from the documentation, switching to a managed queue only makes sense if there's a desire to test using a cluster in multi-node configuration. This isn't really the use case anyone has put testharness to so far, which is basically single machine, single process (due to exactly the issues you note). In testing, I found that the @drhodrid See afa8f96 for the changes to make the choice of mpi launcher a variable. This adds a new option, |
Coming back to this now after getting derailed onto opesci for a while. I think this is down to two areas of test failures, both of which pass when run in the same container but interactively, so probably some artifact of the Jenkins environment. Poking at it ... |
Looks like these are OpenMPI issues with /proc/mounts containing very long entries as a result of the Jenkins overlays, and the OpenMPI error is polluting tests which check error files. |
Attempting a workaround for this with some creative use of sed in the test run command. |
That 'fixes' the netcdf test. |
Working my way back up comments - @drhodrid - very happy to work on getting some testing on the ANU systems. I will think through what's involved with that, as it sounds like it could be easier for you to set up something local which interfaces with github to post build outcomes as opposed to trying to tie it in to the existing IC-based master which would need to get through the ANU firewall to talk to your builders. |
Completely baffled for the moment by the three remaining failures on Ubuntu, which I've failed to replicate outside Jenkins - so have disabled them for the moment pending more detailed debugging. If anyone has ideas, please take a look at the run which focuses on these errors - https://jenkins.ese.ic.ac.uk:8000/blue/organizations/jenkins/Fluidity%20on%20Xenial/detail/jenkins-fix-failing-tests/4/pipeline/30 |
@tmbgreaves |
Hi Tim,
The systems I’m referring to are not ANU systems. They’re national clusters and, accordingly, can be accessed outside of the ANU (Steph already accesses Raijin from Imperial, for example). My motivation for having them as part of the main testing repository, rather than something stand-alone within Australia, is that:
1. They provide system environments that are very different to those of cx1/cx2 and others in the UK.
2. We’ve set things up to run with different compiler versions, different versions of external libraries etc…
Taken together, this would allow us to flag issues prior to merges, that would not be picked up otherwise. As an example, see the issues caused by James’ recent change to VTK, which we had to fix post-merge.
A further bonus is that it would allow us to tie in the vast expertise at the NCI and Pawsey facilities within Australia, when it comes to compiler issues, debugging etc… I’m already aware that these people are more than happy to help us in this process.
Best wishes,
Rhod
On 1 Jun 2018, at 1:33 am, Tim Greaves <[email protected]<mailto:[email protected]>> wrote:
Working my way back up comments - @drhodrid<https://github.com/drhodrid> - very happy to work on getting some testing on the ANU systems. I will think through what's involved with that, as it sounds like it could be easier for you to set up something local which interfaces with github to post build outcomes as opposed to trying to tie it in to the existing IC-based master which would need to get through the ANU firewall to talk to your builders.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#200 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AHXaKtKk-OAHpbfGiAE_LC6AdlytpL2wks5t4A1DgaJpZM4UC1lO>.
|
@tmbgreaves Looking at the results of the failing tests, it looks like the c++ code to write the backtrace has flushed to disk, but for some reason the Fortran code before it hasn't. Investigating deeper, the assumption appears to have been made that the fortran error unit is 0, which I guess might not be true inside docker, inside Jenkins. With most recent versions of gfortran we can get these from The other conceivable possibility is that a buffer isn't getting flushed somewhere, but that gets more involved. |
@drhodrid - That sounds great. Once we have a more solid setup on Jenkins I am very happy to work on making it happen. As a side note - I'm not sure if/how we'll get ICHPC tied into the new Jenkins setup, as our buildbot methods really don't suit the Jenkins configuration. I'm considering leaving a skeleton buildbot up to keep ttrack of ICHPC builds and the longtests. @jrper - thanks for looking at it - how recent does gfortran need to be to get that update? I guess since we're seeing issues with post-gcc5 in other areas this is something that will stay on the wish-list for the moment? |
I've filed a PR for the initial Jenkins setup - if there aren't any major objections to how I've set it up I'd be keen to get it int master so we have some CI in place, after which I'll re-branch and set up a more comprehensive testing suite. |
@Patol75 - many thanks for that tip, has cleaned up the 'missing infiniband' output :-) |
Digging further, that seems to be one of the bits of Fortran 2003 which actually made it into gfortran 4, so I think we're ok for an compiler we'd like to claim we support. Regardless, if you're willing to test whether that change alone is enough to sort out that problem, I'm willing to harden the changes against incompatible compilers. More importantly, if it doesn't fix the issue, then we have to treat the error logs on the Jenkins build as always potentially incomplete. |
Brief update: I can get the same test failures locally in a jenkins+ docker setup, so that seems to be the necessary combination. Unfortunately the |
FWIW: I can reproduce the failure locally if I redirect the output of testharness (i.e.
right after the ewrites in flexit_pinpoint and flabort_pinpoint. |
That seems to be enough inside Jenkins as well and is probably A Good Thing™ to do. The downside is that it would only get called an abort/exit call, unless we modify the ewrite macro itself. |
Two minutes of googling suggests that we also want to set the environment variable |
Thanks for the suggestions - I'll act on them for the follow-up Jenkins tweaking PR. |
@tmbgreaves Since you've announced CI as open via email, is there anything we should do to get PRs migrated? |
It might need the Jenkinsfile adding to the branch, but I was hoping that it wouldn't as it tests the branch-into-master merged sourcetree which will contain a Jenkinsfile. It won't trigger a build immediately since there's no change to the PR but I suspect a repo rescan will sort that out - I'll do one now. |
Having poked this, it looks like the Jenkinsfile needs to be merged into a branch before it will trigger a build of the associated PR. |
Ok that sounds reasonable. PRs should ideally be more or less up-to-date with master anyway. |
That seems to have worked automatically using #198 as a guinea pig. Unfortunately, but correctly, it appears the merge invalidates code review checks so @stephankramer is there any chance you can say yes again to it? Assuming it passes, is it worth me wrapping up the changeset to make testharness properly parallel into a fresh PR next, and if so, how many serial threads do we want to target on buildbot? |
I think running four at a time is fairly safe. I would be up for trying eight and seeing if it fell over; the nodes we're testing on are reasonably solid, so here's hoping.... |
@tmbgreaves I'm not sure if this is me misunderstanding something, but it appears that the docker agents on Jenkins are all set up as single physical core/dual logical core containers. This would explain why switching to the multiprocessing module wasn't actually giving any speed up with large numbers of threads, since it's just oversubscribing. I'm trying again, letting test harness run two serial tests at once to see if our scaling curve gets back to what I expected. |
Also, at some point it would be good to have an (offline) discussion about what testing protocols we want to run, since I'm not sure the current one is optimal. |
I'll have a look at the system spec for the workers - that does sound about right as we're not provisioning that high for the docker host. We can easily add higher spec nodes which run the tests in shorter time so cost about the same, I think. I'll have a look at options for that. |
Buildbot 0.8.x isn't playing happily with CentOS 7 any more, and if we want to carry on using buildbot the config needs a rewrite to the 1.x syntax. Since most of the other projects we work with have moved over to using Jenkins, and Jenkins integrates far better with GitHub, I think attempting a migration to Jenkins is probably the best way forward.
I would welcome comments and feedback on this!
The text was updated successfully, but these errors were encountered: