CI needs upgrading #200

tmbgreaves · 2018-05-17T10:49:14Z

Buildbot 0.8.x isn't playing happily with CentOS 7 any more, and if we want to carry on using buildbot the config needs a rewrite to the 1.x syntax. Since most of the other projects we work with have moved over to using Jenkins, and Jenkins integrates far better with GitHub, I think attempting a migration to Jenkins is probably the best way forward.

I would welcome comments and feedback on this!

jrper · 2018-05-17T11:02:45Z

Definitely no objections if that's the easiest path for you. How much of the UI sugar that Jenkins offers are you considering exposing? Looking at what firedrake currently offers, there are things I'd request with e.g. the test logging, although I realise everything has security implications.

tmbgreaves · 2018-05-17T11:07:32Z

I think it probably is - I've done three Jenkins deployments recently and not looked much at buildbot for the last six months, so I'm much more up to speed with Jenkins. I was planning to be as bells-and-whistles as possible for this so open to anything people would like including.

stephankramer · 2018-05-17T12:24:10Z

Sounds like the best way forward indeed. Also much prefer to stick with something we (and in particularly you :-)) are already familiar with if it is up to the job.

tmbgreaves · 2018-05-17T13:01:12Z

Making progress here:
https://jenkins.ese.ic.ac.uk:8000/blue/organizations/jenkins/Fluidity/activity

I think this is probably not a bad point to move the information in the docker repo to within Fluidity and simplify the builds a bit.

jrper · 2018-05-17T13:13:53Z

When you reach the stage that it's building, then if you're willing to add python-junit.xml (not the almost identically named python-junitxml) to the packages being installed, then testharness should spit out junit compatible .xml files summarising the results of the tests.

tmbgreaves · 2018-05-22T20:26:43Z

Updating on this - I think the Jenkins setup is mostly done now, running through some tidying up now and extending to cover the set of builds we're interested in.

jrper · 2018-05-23T10:12:21Z

Is now a good time for feedback/opinions/feature requests on the setup, or do you still have tweaks you know you need to do?

tmbgreaves · 2018-05-23T10:17:11Z

Now is a great time :-) I'm working on speeding it up a bit now - things like starting the mediumtests at the same time as the unit and short tests. Also adding in a parallel test stream for make makefiles, packaging, etc. And need to add the OpenMP and debug builders. Time at the moment looks about the same as buildbot was giving.

jrper · 2018-05-23T10:37:31Z

I think my points are all UI/UX related, and several of them would need consensus.

Given our previous pattern on buildbot, we probably want runs where tests fail to be marked more clearly as not successful. Possibly even as failed.
If it fits in with the structure of a parallel pipeline build, it would be nice to seperate setup, configuration, building and testing as separate Jenkins stages, so that the points of failure are more obvious on the cross tables.
If enough disk space is available, it would be nice to automatically artifact out failing test directories, if not, could we have a readily downloadable version of the individual stage logs.
If we're putting Jenkinsfiles into the git repository, it might be good to have a single user, local system setup as well, to trivialise the process of people rolling their own projects.

stephankramer · 2018-05-23T11:01:40Z

@jrper: Could you clarify no 1? What do you mean by "not successful vs. failed" and how do you want them marked? Agree with the other points, in particular no. 2.

@tmbgreaves : for speeding up, I think it might really be worth it to improve the dumb scheduling testharness does with parallel tests. See my comments on #182. This might also improve reliability as I suspect some intermittent failures may be due to being unlucky in the sense of testharness attempting to run all the most beefy parallel jobs at the same time.

jrper · 2018-05-23T11:10:34Z

Jenkins has a fairly fuzzy, but extensible model of "code health", which is connected to, but not exactly the same as the build result. If we want the buildbot behaviour back, then Jenkins can be made to understand the test results, and mark a build in which one test failed as failed itself. Currently, it's marking any build in which the tests complete as successful, regardless of test failures.

tmbgreaves · 2018-05-23T11:36:25Z

Am I interpreting correctly that the way to deal with test failures being passed back up to Jenkins is to call testharness with --exit-failure-count ?

tmbgreaves · 2018-05-23T11:39:12Z

Re. splitting up into stages - I agree with @jrper and am figuring out the neatest way to split up the config/build/test.

tmbgreaves · 2018-05-23T12:15:24Z

There is also the question of how useful or not it is to rebuild the base container from scratch on a per-build basis. It's a fairly high cost thing to do which is why I'd extracted it from the routine build on buildbot; I'm tempted to just have one sitting on dockerhub updated by cron on the jenkins master rather than wrapping it into Jenkins.

jrper · 2018-05-23T12:25:58Z

As far as I remember, on the last iteration we made it that --exit-failure-count will mean the test harness return code is the number of test failures. That provides coarse-grained coupling enough to fail fast on the tests. If we want finer grained coupling, you'd need to install python-junit.xml inside the containers, the unit-plugin on the jenkins host, copy the test_results.xml and test_results_medium.xml back to the host at the end of the test script and include a

post {
        always {
            unit 'test_results.xml'
        }
    }

in the Jenkinsfile.

stephankramer · 2018-05-23T12:28:27Z

Ah ok, thanks that clarifies - yes, I agree any single test failure should be marked as an overall failure and yes, if Jenkins decides that based on a nonzero exit code, than --exit-failure-count should work.

Is rebuilding the container the bit where it installs required packages starting from a bare image? If so then, yes, I don't see a need to redo that every time. As long as fluidity itself gets rebuilt. Then if the container is rebuilt, triggered by cron, maybe it's a good idea to have that trigger a master rebuild with the new container - and also have a way of identifying in the output of any build which version of the container has been used for it.

jrper · 2018-05-23T12:30:50Z

Regarding rebuilding the containers, it's definitely fair to only update periodically. One idea that springs to mind is making the rebuild a separate (minimally wrapped, non-pipeline) Jenkins project, rather than a pure cron task. That way it might be easier to trigger by hand on the odd occasion that it's needed and @tmbgreaves is indisposed.

tmbgreaves · 2018-05-23T13:10:13Z

Missing a trick here - pull the base image and then update will I think keep us both up to date and make it a lot more efficient on the build.

tmbgreaves · 2018-05-23T14:52:00Z

Brain dumping into the issue here to make sure I don't forget - need to ensure that if the pipeline repeatedly forks and rejoins, it ensures that workspaces are separate and consistently connected to the right OS for multiple os builds.

tmbgreaves · 2018-05-23T15:16:33Z

Additional note on this - trying to get internally parallel builds on parallel OS builds is causing all kinds of problems; might be easier to have multiple Jenkins projects, one for each OS ?

... which looks to be possible, with some hackery, to do from a single Jenkinsfile ...

tmbgreaves · 2018-05-24T09:56:24Z

There is a completed run on xenial now.

jrper · 2018-05-24T14:07:00Z

Do you want eyes on the test failures it's thrown up, or are they known issues with the environment?

Regarding the log files, it looks like adding OMPI_MCA_btl=^openib to the shell environment will prevent it trying to use infiniband when it can't. If anyone knows a good way to silence the warnings due to the length of /proc/mounts in docker, then I'd like to hear it.

jrper · 2018-05-24T14:11:04Z

Also, for the convenience of those watching part time, the new link is https://jenkins.ese.ic.ac.uk:8000/blue/organizations/jenkins/Fluidity%20on%20Xenial/activity

stephankramer · 2018-05-24T16:54:55Z

Just a little note: make -j8 test and make -j8 mediumtest are probably not doing what you expect. The -j8 just enables threading for targets as make sees them, but from its perspective there is only a single testharness call to be executed. If you want multiple tests to run simultaneously you need make THREADS=8 test but that does come with the testharness-being-dumb-as problem: it will run 8 tests simultaneously regardless how many cores each one of them uses.

jrper · 2018-05-24T18:42:53Z

Regarding point 3) in my wishlist above, and as a point of negotiation, in 7ab620a I've added a commit which updates test harness to have a new option -a or --archive-failures which takes a filename and writes directories which fail to gzipped tarball under that title. It also modifies the root makefile adding a new variable TESTHARNESS_OPTIONS to expose this.

Usage would be

make TESTHARNESS_OPTIONS="-a short_test_failures.tar.gz" test
make TESTHARNESS_OPTIONS="-a medium_test_failures.tar.gz" medium test

 post {
        always {
            archiveArtifacts '*test_failures.tar.gz'
    }

I believe there is finer scale control available over how much and how long things get kept for, but I'm not up to full speed with pipeline builds.

drhodrid · 2018-05-24T23:08:39Z

Just jumping on the wish-list bandwagon here… The first is more of a test harness request, but as these seem to have been thrown in to this issue, I though why not… Request 1: Currently, test-harness assumes mpiexec as the mpi launcher. This is fine for most systems (it clearly works on the systems at IC and local machines). However, a number of systems (including those here in Australia - for example magnus - https://www.pawsey.org.au/our-systems/magnus-technical-specifications/) use aprun or srun (there are probably others too). Would it be possible to update testharness such that a user-defined launcher could be used (with mpiexec being the default)? Request 2: would it be possible to add builds on the Australian supercomputers (both Raijin, which is in Canberra, and Magnus, which is in Perth) to the Jenkins manifesto? I think that would be quite useful in the long term, given the growing user-base. Happy to liaise with the relevant people at this side to get that sorted - I just need to be told what is required at the Jenkins end!! Best wishes, Rhod

…

--------------------------------------------------------------------------------------------------- ARC Future Fellow Research School of Earth Sciences, The Australian National University, Mills Road, Acton, Canberra, ACT, 2601, Australia T: +61 (0)2 6125 3643 E: [email protected]<mailto:[email protected]> W: http://rses.anu.edu.au/people/dr-rhodri-davies CRICOS Provider No. 00120C --------------------------------------------------------------------------------------------------- On 25 May 2018, at 4:42 am, James Percival <[email protected]<mailto:[email protected]>> wrote: Regarding point 3) in my wishlist above, and as a point of negotiation, in 7ab620a<7ab620a> I've added a commit which updates test harness to have a new option -a or --archive-failures which takes a filename and writes directories which fail to gzipped tarball under that title. It also modifies the root makefile adding a new variable TESTHARNESS_OPTIONS to expose this. Usage would be make TESTHARNESS_OPTIONS=short_test_failures.tar.gz test make TESTHARNESS_OPTIONS=medium_test_failures.tar.gz medium test post { always { archiveArtifacts '*test_failures.tar.gz' } I believe there is finer scale control available over how much and how long things get kept for, but I'm not up to full speed with pipeline builds. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#200 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AHXaKovhcSYhAEo4pvR-n2HWS1QiWDtdks5t1v8ygaJpZM4UC1lO>.

jrper · 2018-05-25T07:00:01Z

Having looked at the threading in testharness a bit, I'm not sure that as written it actually does much except speed up io, due to the python global interpreter lock. I'm having a go at testing a migration from the threading module to multiprocessing, and it's promising, but might need another pair of eyes or two.

@drhodrid for pure testharness runs, I'm confident your request 1) is doable. The code already replaces the mpiexec in tests with mpiexec -n <nprocs>, so providing the other launcher has an equivalent syntax, this is just a matter of adding another option. I'm willing to attempt to do the job if you'll undertake to test it. On the other hand, if you're using pbs scripts, it's your own problem.

jrper · 2018-05-28T17:11:08Z

As far as I can tell from the documentation, switching to a managed queue only makes sense if there's a desire to test using a cluster in multi-node configuration. This isn't really the use case anyone has put testharness to so far, which is basically single machine, single process (due to exactly the issues you note). In testing, I found that the multiprocessing.Pool object just wasn't versatile enough for the problem we have here, where the subprocesses themselves can each spin off multiple MPI processes. The remaining difficulty is mostly scheduling. It's possible the right answer is to start with the parallel jobs and then spin up a pool of serial workers as they complete, but that needs a load of testing I'm not sure I have time for.

@drhodrid See afa8f96 for the changes to make the choice of mpi launcher a variable. This adds a new option, --mpi-launcher or -m, used like `testharness -m "srun -n". in general, the object being called has to take a space and then an integer specifying the number of processes.

tmbgreaves · 2018-05-31T13:39:32Z

Coming back to this now after getting derailed onto opesci for a while. I think this is down to two areas of test failures, both of which pass when run in the same container but interactively, so probably some artifact of the Jenkins environment. Poking at it ...

tmbgreaves · 2018-05-31T14:08:42Z

Looks like these are OpenMPI issues with /proc/mounts containing very long entries as a result of the Jenkins overlays, and the OpenMPI error is polluting tests which check error files.

tmbgreaves · 2018-05-31T14:51:11Z

Attempting a workaround for this with some creative use of sed in the test run command.

tmbgreaves · 2018-05-31T15:29:54Z

That 'fixes' the netcdf test.

tmbgreaves · 2018-05-31T15:33:16Z

Working my way back up comments - @drhodrid - very happy to work on getting some testing on the ANU systems. I will think through what's involved with that, as it sounds like it could be easier for you to set up something local which interfaces with github to post build outcomes as opposed to trying to tie it in to the existing IC-based master which would need to get through the ANU firewall to talk to your builders.

tmbgreaves · 2018-05-31T21:49:55Z

Completely baffled for the moment by the three remaining failures on Ubuntu, which I've failed to replicate outside Jenkins - so have disabled them for the moment pending more detailed debugging. If anyone has ideas, please take a look at the run which focuses on these errors - https://jenkins.ese.ic.ac.uk:8000/blue/organizations/jenkins/Fluidity%20on%20Xenial/detail/jenkins-fix-failing-tests/4/pipeline/30
.... any suggestions gratefully received.

Patol75 · 2018-06-01T05:38:55Z

@tmbgreaves
Is the following of any help ?
https://stat.ethz.ch/pipermail/r-sig-hpc/2011-August/001053.html
http://www.olmjo.com/blog/2012/09/30/openmpi-warning/

drhodrid · 2018-06-01T05:47:59Z

Hi Tim, The systems I’m referring to are not ANU systems. They’re national clusters and, accordingly, can be accessed outside of the ANU (Steph already accesses Raijin from Imperial, for example). My motivation for having them as part of the main testing repository, rather than something stand-alone within Australia, is that: 1. They provide system environments that are very different to those of cx1/cx2 and others in the UK. 2. We’ve set things up to run with different compiler versions, different versions of external libraries etc… Taken together, this would allow us to flag issues prior to merges, that would not be picked up otherwise. As an example, see the issues caused by James’ recent change to VTK, which we had to fix post-merge. A further bonus is that it would allow us to tie in the vast expertise at the NCI and Pawsey facilities within Australia, when it comes to compiler issues, debugging etc… I’m already aware that these people are more than happy to help us in this process. Best wishes, Rhod On 1 Jun 2018, at 1:33 am, Tim Greaves <[email protected]<mailto:[email protected]>> wrote: Working my way back up comments - @drhodrid<https://github.com/drhodrid> - very happy to work on getting some testing on the ANU systems. I will think through what's involved with that, as it sounds like it could be easier for you to set up something local which interfaces with github to post build outcomes as opposed to trying to tie it in to the existing IC-based master which would need to get through the ANU firewall to talk to your builders. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#200 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AHXaKtKk-OAHpbfGiAE_LC6AdlytpL2wks5t4A1DgaJpZM4UC1lO>.

jrper · 2018-06-01T14:54:22Z

@tmbgreaves Looking at the results of the failing tests, it looks like the c++ code to write the backtrace has flushed to disk, but for some reason the Fortran code before it hasn't. Investigating deeper, the assumption appears to have been made that the fortran error unit is 0, which I guess might not be true inside docker, inside Jenkins. With most recent versions of gfortran we can get these from iso_fortran_env, see d177b25.

The other conceivable possibility is that a buffer isn't getting flushed somewhere, but that gets more involved.

tmbgreaves · 2018-06-01T22:56:11Z

@drhodrid - That sounds great. Once we have a more solid setup on Jenkins I am very happy to work on making it happen. As a side note - I'm not sure if/how we'll get ICHPC tied into the new Jenkins setup, as our buildbot methods really don't suit the Jenkins configuration. I'm considering leaving a skeleton buildbot up to keep ttrack of ICHPC builds and the longtests.

@jrper - thanks for looking at it - how recent does gfortran need to be to get that update? I guess since we're seeing issues with post-gcc5 in other areas this is something that will stay on the wish-list for the moment?

tmbgreaves · 2018-06-01T22:57:10Z

I've filed a PR for the initial Jenkins setup - if there aren't any major objections to how I've set it up I'd be keen to get it int master so we have some CI in place, after which I'll re-branch and set up a more comprehensive testing suite.

tmbgreaves · 2018-06-01T23:33:08Z

@Patol75 - many thanks for that tip, has cleaned up the 'missing infiniband' output :-)

jrper · 2018-06-03T17:48:40Z

Digging further, that seems to be one of the bits of Fortran 2003 which actually made it into gfortran 4, so I think we're ok for an compiler we'd like to claim we support. Regardless, if you're willing to test whether that change alone is enough to sort out that problem, I'm willing to harden the changes against incompatible compilers. More importantly, if it doesn't fix the issue, then we have to treat the error logs on the Jenkins build as always potentially incomplete.

jrper · 2018-06-04T10:56:43Z

Brief update: I can get the same test failures locally in a jenkins+ docker setup, so that seems to be the necessary combination. Unfortunately the iso_fortran_env change isn't enough to clear the error.

stephankramer · 2018-06-04T12:14:05Z

FWIW: I can reproduce the failure locally if I redirect the output of testharness (i.e. ../tools/testharness.py -f unresolvable_mesh_dependency.xml >& out). The problems for me seems to go away if I put in

  flush(debug_error_unit)
  flush(debug_log_unit)

right after the ewrites in flexit_pinpoint and flabort_pinpoint.

jrper · 2018-06-04T12:40:46Z

That seems to be enough inside Jenkins as well and is probably A Good Thing™ to do. The downside is that it would only get called an abort/exit call, unless we modify the ewrite macro itself.

jrper · 2018-06-04T12:50:41Z

Two minutes of googling suggests that we also want to set the environment variable GFORTRAN_UNBUFFERED_ALL=1 to force unbuffered I/O in all situations.

tmbgreaves · 2018-06-04T21:33:38Z

Thanks for the suggestions - I'll act on them for the follow-up Jenkins tweaking PR.

jrper · 2018-06-06T10:54:58Z

@tmbgreaves Since you've announced CI as open via email, is there anything we should do to get PRs migrated?

tmbgreaves · 2018-06-06T11:00:40Z

It might need the Jenkinsfile adding to the branch, but I was hoping that it wouldn't as it tests the branch-into-master merged sourcetree which will contain a Jenkinsfile. It won't trigger a build immediately since there's no change to the PR but I suspect a repo rescan will sort that out - I'll do one now.

tmbgreaves · 2018-06-06T11:10:04Z

Having poked this, it looks like the Jenkinsfile needs to be merged into a branch before it will trigger a build of the associated PR.

stephankramer · 2018-06-06T11:27:37Z

Ok that sounds reasonable. PRs should ideally be more or less up-to-date with master anyway.

jrper · 2018-06-06T12:35:16Z

That seems to have worked automatically using #198 as a guinea pig. Unfortunately, but correctly, it appears the merge invalidates code review checks so @stephankramer is there any chance you can say yes again to it? Assuming it passes, is it worth me wrapping up the changeset to make testharness properly parallel into a fresh PR next, and if so, how many serial threads do we want to target on buildbot?

tmbgreaves · 2018-06-06T13:46:09Z

I think running four at a time is fairly safe. I would be up for trying eight and seeing if it fell over; the nodes we're testing on are reasonably solid, so here's hoping....

@jrper

Applying suggestion from @jrper in issue #200

jrper · 2018-06-07T19:18:40Z

@tmbgreaves I'm not sure if this is me misunderstanding something, but it appears that the docker agents on Jenkins are all set up as single physical core/dual logical core containers. This would explain why switching to the multiprocessing module wasn't actually giving any speed up with large numbers of threads, since it's just oversubscribing. I'm trying again, letting test harness run two serial tests at once to see if our scaling curve gets back to what I expected.

jrper · 2018-06-07T19:20:24Z

Also, at some point it would be good to have an (offline) discussion about what testing protocols we want to run, since I'm not sure the current one is optimal.

tmbgreaves · 2018-06-07T21:09:55Z

I'll have a look at the system spec for the workers - that does sound about right as we're not provisioning that high for the docker host. We can easily add higher spec nodes which run the tests in shorter time so cost about the same, I think. I'll have a look at options for that.

tmbgreaves added the feedback needed label May 17, 2018

tmbgreaves added the enhancement label May 17, 2018

tmbgreaves self-assigned this May 17, 2018

tmbgreaves added a commit that referenced this issue Jun 6, 2018

Force unbuffered i/o in all situations

e270a84

Applying suggestion from @jrper in issue #200

tmbgreaves closed this as completed Jul 12, 2018

CI needs upgrading #200

CI needs upgrading #200

Comments

tmbgreaves commented May 17, 2018 • edited Loading

jrper commented May 17, 2018

tmbgreaves commented May 17, 2018

stephankramer commented May 17, 2018

tmbgreaves commented May 17, 2018

jrper commented May 17, 2018

tmbgreaves commented May 22, 2018

jrper commented May 23, 2018

tmbgreaves commented May 23, 2018

jrper commented May 23, 2018

stephankramer commented May 23, 2018

jrper commented May 23, 2018

tmbgreaves commented May 23, 2018

tmbgreaves commented May 23, 2018

tmbgreaves commented May 23, 2018

jrper commented May 23, 2018

stephankramer commented May 23, 2018

jrper commented May 23, 2018 • edited Loading

tmbgreaves commented May 23, 2018

tmbgreaves commented May 23, 2018

tmbgreaves commented May 23, 2018 • edited Loading

tmbgreaves commented May 24, 2018

jrper commented May 24, 2018

jrper commented May 24, 2018

stephankramer commented May 24, 2018

jrper commented May 24, 2018 • edited Loading

drhodrid commented May 24, 2018 via email

jrper commented May 25, 2018

jrper commented May 28, 2018

tmbgreaves commented May 31, 2018

tmbgreaves commented May 31, 2018

tmbgreaves commented May 31, 2018

tmbgreaves commented May 31, 2018

tmbgreaves commented May 31, 2018

tmbgreaves commented May 31, 2018 • edited Loading

Patol75 commented Jun 1, 2018

drhodrid commented Jun 1, 2018 via email

jrper commented Jun 1, 2018 • edited Loading

tmbgreaves commented Jun 1, 2018

tmbgreaves commented Jun 1, 2018

tmbgreaves commented Jun 1, 2018

jrper commented Jun 3, 2018

jrper commented Jun 4, 2018

stephankramer commented Jun 4, 2018 • edited Loading

jrper commented Jun 4, 2018 • edited Loading

jrper commented Jun 4, 2018

tmbgreaves commented Jun 4, 2018

jrper commented Jun 6, 2018

tmbgreaves commented Jun 6, 2018

tmbgreaves commented Jun 6, 2018

stephankramer commented Jun 6, 2018

jrper commented Jun 6, 2018 • edited Loading

tmbgreaves commented Jun 6, 2018

jrper commented Jun 7, 2018

jrper commented Jun 7, 2018

tmbgreaves commented Jun 7, 2018

tmbgreaves commented May 17, 2018 •

edited

Loading

jrper commented May 23, 2018 •

edited

Loading

tmbgreaves commented May 23, 2018 •

edited

Loading

jrper commented May 24, 2018 •

edited

Loading

tmbgreaves commented May 31, 2018 •

edited

Loading

jrper commented Jun 1, 2018 •

edited

Loading

stephankramer commented Jun 4, 2018 •

edited

Loading

jrper commented Jun 4, 2018 •

edited

Loading

jrper commented Jun 6, 2018 •

edited

Loading