Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI needs upgrading #200

Closed
tmbgreaves opened this issue May 17, 2018 · 57 comments
Closed

CI needs upgrading #200

tmbgreaves opened this issue May 17, 2018 · 57 comments

Comments

@tmbgreaves
Copy link
Contributor

tmbgreaves commented May 17, 2018

Buildbot 0.8.x isn't playing happily with CentOS 7 any more, and if we want to carry on using buildbot the config needs a rewrite to the 1.x syntax. Since most of the other projects we work with have moved over to using Jenkins, and Jenkins integrates far better with GitHub, I think attempting a migration to Jenkins is probably the best way forward.

I would welcome comments and feedback on this!

@jrper
Copy link
Contributor

jrper commented May 17, 2018

Definitely no objections if that's the easiest path for you. How much of the UI sugar that Jenkins offers are you considering exposing? Looking at what firedrake currently offers, there are things I'd request with e.g. the test logging, although I realise everything has security implications.

@tmbgreaves
Copy link
Contributor Author

I think it probably is - I've done three Jenkins deployments recently and not looked much at buildbot for the last six months, so I'm much more up to speed with Jenkins. I was planning to be as bells-and-whistles as possible for this so open to anything people would like including.

@stephankramer
Copy link
Contributor

Sounds like the best way forward indeed. Also much prefer to stick with something we (and in particularly you :-)) are already familiar with if it is up to the job.

@tmbgreaves
Copy link
Contributor Author

Making progress here:
https://jenkins.ese.ic.ac.uk:8000/blue/organizations/jenkins/Fluidity/activity

I think this is probably not a bad point to move the information in the docker repo to within Fluidity and simplify the builds a bit.

@tmbgreaves tmbgreaves self-assigned this May 17, 2018
@jrper
Copy link
Contributor

jrper commented May 17, 2018

When you reach the stage that it's building, then if you're willing to add python-junit.xml (not the almost identically named python-junitxml) to the packages being installed, then testharness should spit out junit compatible .xml files summarising the results of the tests.

@tmbgreaves
Copy link
Contributor Author

Updating on this - I think the Jenkins setup is mostly done now, running through some tidying up now and extending to cover the set of builds we're interested in.

@jrper
Copy link
Contributor

jrper commented May 23, 2018

Is now a good time for feedback/opinions/feature requests on the setup, or do you still have tweaks you know you need to do?

@tmbgreaves
Copy link
Contributor Author

Now is a great time :-) I'm working on speeding it up a bit now - things like starting the mediumtests at the same time as the unit and short tests. Also adding in a parallel test stream for make makefiles, packaging, etc. And need to add the OpenMP and debug builders. Time at the moment looks about the same as buildbot was giving.

@jrper
Copy link
Contributor

jrper commented May 23, 2018

I think my points are all UI/UX related, and several of them would need consensus.

  1. Given our previous pattern on buildbot, we probably want runs where tests fail to be marked more clearly as not successful. Possibly even as failed.
  2. If it fits in with the structure of a parallel pipeline build, it would be nice to seperate setup, configuration, building and testing as separate Jenkins stages, so that the points of failure are more obvious on the cross tables.
  3. If enough disk space is available, it would be nice to automatically artifact out failing test directories, if not, could we have a readily downloadable version of the individual stage logs.
  4. If we're putting Jenkinsfiles into the git repository, it might be good to have a single user, local system setup as well, to trivialise the process of people rolling their own projects.

@stephankramer
Copy link
Contributor

@jrper: Could you clarify no 1? What do you mean by "not successful vs. failed" and how do you want them marked? Agree with the other points, in particular no. 2.

@tmbgreaves : for speeding up, I think it might really be worth it to improve the dumb scheduling testharness does with parallel tests. See my comments on #182. This might also improve reliability as I suspect some intermittent failures may be due to being unlucky in the sense of testharness attempting to run all the most beefy parallel jobs at the same time.

@jrper
Copy link
Contributor

jrper commented May 23, 2018

Jenkins has a fairly fuzzy, but extensible model of "code health", which is connected to, but not exactly the same as the build result. If we want the buildbot behaviour back, then Jenkins can be made to understand the test results, and mark a build in which one test failed as failed itself. Currently, it's marking any build in which the tests complete as successful, regardless of test failures.

@tmbgreaves
Copy link
Contributor Author

Am I interpreting correctly that the way to deal with test failures being passed back up to Jenkins is to call testharness with --exit-failure-count ?

@tmbgreaves
Copy link
Contributor Author

Re. splitting up into stages - I agree with @jrper and am figuring out the neatest way to split up the config/build/test.

@tmbgreaves
Copy link
Contributor Author

There is also the question of how useful or not it is to rebuild the base container from scratch on a per-build basis. It's a fairly high cost thing to do which is why I'd extracted it from the routine build on buildbot; I'm tempted to just have one sitting on dockerhub updated by cron on the jenkins master rather than wrapping it into Jenkins.

@jrper
Copy link
Contributor

jrper commented May 23, 2018

As far as I remember, on the last iteration we made it that --exit-failure-count will mean the test harness return code is the number of test failures. That provides coarse-grained coupling enough to fail fast on the tests. If we want finer grained coupling, you'd need to install python-junit.xml inside the containers, the unit-plugin on the jenkins host, copy the test_results.xml and test_results_medium.xml back to the host at the end of the test script and include a

post {
        always {
            unit 'test_results.xml'
        }
    }

in the Jenkinsfile.

@stephankramer
Copy link
Contributor

Ah ok, thanks that clarifies - yes, I agree any single test failure should be marked as an overall failure and yes, if Jenkins decides that based on a nonzero exit code, than --exit-failure-count should work.

Is rebuilding the container the bit where it installs required packages starting from a bare image? If so then, yes, I don't see a need to redo that every time. As long as fluidity itself gets rebuilt. Then if the container is rebuilt, triggered by cron, maybe it's a good idea to have that trigger a master rebuild with the new container - and also have a way of identifying in the output of any build which version of the container has been used for it.

@jrper
Copy link
Contributor

jrper commented May 23, 2018

Regarding rebuilding the containers, it's definitely fair to only update periodically. One idea that springs to mind is making the rebuild a separate (minimally wrapped, non-pipeline) Jenkins project, rather than a pure cron task. That way it might be easier to trigger by hand on the odd occasion that it's needed and @tmbgreaves is indisposed.

@tmbgreaves
Copy link
Contributor Author

Missing a trick here - pull the base image and then update will I think keep us both up to date and make it a lot more efficient on the build.

@tmbgreaves
Copy link
Contributor Author

Brain dumping into the issue here to make sure I don't forget - need to ensure that if the pipeline repeatedly forks and rejoins, it ensures that workspaces are separate and consistently connected to the right OS for multiple os builds.

@tmbgreaves
Copy link
Contributor Author

tmbgreaves commented May 23, 2018

Additional note on this - trying to get internally parallel builds on parallel OS builds is causing all kinds of problems; might be easier to have multiple Jenkins projects, one for each OS ?

... which looks to be possible, with some hackery, to do from a single Jenkinsfile ...

@tmbgreaves
Copy link
Contributor Author

There is a completed run on xenial now.

@jrper
Copy link
Contributor

jrper commented May 24, 2018

Do you want eyes on the test failures it's thrown up, or are they known issues with the environment?

Regarding the log files, it looks like adding OMPI_MCA_btl=^openib to the shell environment will prevent it trying to use infiniband when it can't. If anyone knows a good way to silence the warnings due to the length of /proc/mounts in docker, then I'd like to hear it.

@jrper
Copy link
Contributor

jrper commented May 24, 2018

Also, for the convenience of those watching part time, the new link is https://jenkins.ese.ic.ac.uk:8000/blue/organizations/jenkins/Fluidity%20on%20Xenial/activity

@stephankramer
Copy link
Contributor

Just a little note: make -j8 test and make -j8 mediumtest are probably not doing what you expect. The -j8 just enables threading for targets as make sees them, but from its perspective there is only a single testharness call to be executed. If you want multiple tests to run simultaneously you need make THREADS=8 test but that does come with the testharness-being-dumb-as problem: it will run 8 tests simultaneously regardless how many cores each one of them uses.

@jrper
Copy link
Contributor

jrper commented May 24, 2018

Regarding point 3) in my wishlist above, and as a point of negotiation, in 7ab620a I've added a commit which updates test harness to have a new option -a or --archive-failures which takes a filename and writes directories which fail to gzipped tarball under that title. It also modifies the root makefile adding a new variable TESTHARNESS_OPTIONS to expose this.

Usage would be

make TESTHARNESS_OPTIONS="-a short_test_failures.tar.gz" test
make TESTHARNESS_OPTIONS="-a medium_test_failures.tar.gz" medium test
 post {
        always {
            archiveArtifacts '*test_failures.tar.gz'
    }

I believe there is finer scale control available over how much and how long things get kept for, but I'm not up to full speed with pipeline builds.

@drhodrid
Copy link
Contributor

drhodrid commented May 24, 2018 via email

@jrper
Copy link
Contributor

jrper commented May 25, 2018

Having looked at the threading in testharness a bit, I'm not sure that as written it actually does much except speed up io, due to the python global interpreter lock. I'm having a go at testing a migration from the threading module to multiprocessing, and it's promising, but might need another pair of eyes or two.

@drhodrid for pure testharness runs, I'm confident your request 1) is doable. The code already replaces the mpiexec in tests with mpiexec -n <nprocs>, so providing the other launcher has an equivalent syntax, this is just a matter of adding another option. I'm willing to attempt to do the job if you'll undertake to test it. On the other hand, if you're using pbs scripts, it's your own problem.

@jrper
Copy link
Contributor

jrper commented May 28, 2018

As far as I can tell from the documentation, switching to a managed queue only makes sense if there's a desire to test using a cluster in multi-node configuration. This isn't really the use case anyone has put testharness to so far, which is basically single machine, single process (due to exactly the issues you note). In testing, I found that the multiprocessing.Pool object just wasn't versatile enough for the problem we have here, where the subprocesses themselves can each spin off multiple MPI processes. The remaining difficulty is mostly scheduling. It's possible the right answer is to start with the parallel jobs and then spin up a pool of serial workers as they complete, but that needs a load of testing I'm not sure I have time for.

@drhodrid See afa8f96 for the changes to make the choice of mpi launcher a variable. This adds a new option, --mpi-launcher or -m, used like `testharness -m "srun -n". in general, the object being called has to take a space and then an integer specifying the number of processes.

@tmbgreaves
Copy link
Contributor Author

Coming back to this now after getting derailed onto opesci for a while. I think this is down to two areas of test failures, both of which pass when run in the same container but interactively, so probably some artifact of the Jenkins environment. Poking at it ...

@tmbgreaves
Copy link
Contributor Author

Looks like these are OpenMPI issues with /proc/mounts containing very long entries as a result of the Jenkins overlays, and the OpenMPI error is polluting tests which check error files.

@tmbgreaves
Copy link
Contributor Author

Attempting a workaround for this with some creative use of sed in the test run command.

@tmbgreaves
Copy link
Contributor Author

That 'fixes' the netcdf test.

@tmbgreaves
Copy link
Contributor Author

Working my way back up comments - @drhodrid - very happy to work on getting some testing on the ANU systems. I will think through what's involved with that, as it sounds like it could be easier for you to set up something local which interfaces with github to post build outcomes as opposed to trying to tie it in to the existing IC-based master which would need to get through the ANU firewall to talk to your builders.

@tmbgreaves
Copy link
Contributor Author

tmbgreaves commented May 31, 2018

Completely baffled for the moment by the three remaining failures on Ubuntu, which I've failed to replicate outside Jenkins - so have disabled them for the moment pending more detailed debugging. If anyone has ideas, please take a look at the run which focuses on these errors - https://jenkins.ese.ic.ac.uk:8000/blue/organizations/jenkins/Fluidity%20on%20Xenial/detail/jenkins-fix-failing-tests/4/pipeline/30
.... any suggestions gratefully received.

@Patol75
Copy link
Contributor

Patol75 commented Jun 1, 2018

@drhodrid
Copy link
Contributor

drhodrid commented Jun 1, 2018 via email

@jrper
Copy link
Contributor

jrper commented Jun 1, 2018

@tmbgreaves Looking at the results of the failing tests, it looks like the c++ code to write the backtrace has flushed to disk, but for some reason the Fortran code before it hasn't. Investigating deeper, the assumption appears to have been made that the fortran error unit is 0, which I guess might not be true inside docker, inside Jenkins. With most recent versions of gfortran we can get these from iso_fortran_env, see d177b25.

The other conceivable possibility is that a buffer isn't getting flushed somewhere, but that gets more involved.

@tmbgreaves
Copy link
Contributor Author

@drhodrid - That sounds great. Once we have a more solid setup on Jenkins I am very happy to work on making it happen. As a side note - I'm not sure if/how we'll get ICHPC tied into the new Jenkins setup, as our buildbot methods really don't suit the Jenkins configuration. I'm considering leaving a skeleton buildbot up to keep ttrack of ICHPC builds and the longtests.

@jrper - thanks for looking at it - how recent does gfortran need to be to get that update? I guess since we're seeing issues with post-gcc5 in other areas this is something that will stay on the wish-list for the moment?

@tmbgreaves
Copy link
Contributor Author

I've filed a PR for the initial Jenkins setup - if there aren't any major objections to how I've set it up I'd be keen to get it int master so we have some CI in place, after which I'll re-branch and set up a more comprehensive testing suite.

@tmbgreaves
Copy link
Contributor Author

@Patol75 - many thanks for that tip, has cleaned up the 'missing infiniband' output :-)

@jrper
Copy link
Contributor

jrper commented Jun 3, 2018

Digging further, that seems to be one of the bits of Fortran 2003 which actually made it into gfortran 4, so I think we're ok for an compiler we'd like to claim we support. Regardless, if you're willing to test whether that change alone is enough to sort out that problem, I'm willing to harden the changes against incompatible compilers. More importantly, if it doesn't fix the issue, then we have to treat the error logs on the Jenkins build as always potentially incomplete.

@jrper
Copy link
Contributor

jrper commented Jun 4, 2018

Brief update: I can get the same test failures locally in a jenkins+ docker setup, so that seems to be the necessary combination. Unfortunately the iso_fortran_env change isn't enough to clear the error.

@stephankramer
Copy link
Contributor

stephankramer commented Jun 4, 2018

FWIW: I can reproduce the failure locally if I redirect the output of testharness (i.e. ../tools/testharness.py -f unresolvable_mesh_dependency.xml >& out). The problems for me seems to go away if I put in

  flush(debug_error_unit)
  flush(debug_log_unit)

right after the ewrites in flexit_pinpoint and flabort_pinpoint.

@jrper
Copy link
Contributor

jrper commented Jun 4, 2018

That seems to be enough inside Jenkins as well and is probably A Good Thing™ to do. The downside is that it would only get called an abort/exit call, unless we modify the ewrite macro itself.

@jrper
Copy link
Contributor

jrper commented Jun 4, 2018

Two minutes of googling suggests that we also want to set the environment variable GFORTRAN_UNBUFFERED_ALL=1 to force unbuffered I/O in all situations.

@tmbgreaves
Copy link
Contributor Author

Thanks for the suggestions - I'll act on them for the follow-up Jenkins tweaking PR.

@jrper
Copy link
Contributor

jrper commented Jun 6, 2018

@tmbgreaves Since you've announced CI as open via email, is there anything we should do to get PRs migrated?

@tmbgreaves
Copy link
Contributor Author

It might need the Jenkinsfile adding to the branch, but I was hoping that it wouldn't as it tests the branch-into-master merged sourcetree which will contain a Jenkinsfile. It won't trigger a build immediately since there's no change to the PR but I suspect a repo rescan will sort that out - I'll do one now.

@tmbgreaves
Copy link
Contributor Author

Having poked this, it looks like the Jenkinsfile needs to be merged into a branch before it will trigger a build of the associated PR.

@stephankramer
Copy link
Contributor

Ok that sounds reasonable. PRs should ideally be more or less up-to-date with master anyway.

@jrper
Copy link
Contributor

jrper commented Jun 6, 2018

That seems to have worked automatically using #198 as a guinea pig. Unfortunately, but correctly, it appears the merge invalidates code review checks so @stephankramer is there any chance you can say yes again to it? Assuming it passes, is it worth me wrapping up the changeset to make testharness properly parallel into a fresh PR next, and if so, how many serial threads do we want to target on buildbot?

@tmbgreaves
Copy link
Contributor Author

I think running four at a time is fairly safe. I would be up for trying eight and seeing if it fell over; the nodes we're testing on are reasonably solid, so here's hoping....

tmbgreaves added a commit that referenced this issue Jun 6, 2018
Applying suggestion from @jrper in issue #200
@jrper
Copy link
Contributor

jrper commented Jun 7, 2018

@tmbgreaves I'm not sure if this is me misunderstanding something, but it appears that the docker agents on Jenkins are all set up as single physical core/dual logical core containers. This would explain why switching to the multiprocessing module wasn't actually giving any speed up with large numbers of threads, since it's just oversubscribing. I'm trying again, letting test harness run two serial tests at once to see if our scaling curve gets back to what I expected.

@jrper
Copy link
Contributor

jrper commented Jun 7, 2018

Also, at some point it would be good to have an (offline) discussion about what testing protocols we want to run, since I'm not sure the current one is optimal.

@tmbgreaves
Copy link
Contributor Author

I'll have a look at the system spec for the workers - that does sound about right as we're not provisioning that high for the docker host. We can easily add higher spec nodes which run the tests in shorter time so cost about the same, I think. I'll have a look at options for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants