Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to longruns pipeline #634

Merged
merged 5 commits into from
Feb 22, 2024
Merged

Improvements to longruns pipeline #634

merged 5 commits into from
Feb 22, 2024

Conversation

Sbozzolo
Copy link
Member

This PR:

  • Removed the longrun depot. This avoid potential depot corruptions
  • Switches from mpiexec to srun for a more reliable MPI experience.

Copy link
Collaborator

@LenkaNovak LenkaNovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LenkaNovak
Copy link
Collaborator

LGTM, just checked this fix here: https://buildkite.com/clima/climacoupler-longruns/builds/448

Actually, even though the init is now passing, it looks like we need to constrain the Plots dependency again?

@Sbozzolo
Copy link
Member Author

I cleaned up a couple of things in the manifest and it looks like it's working:

https://buildkite.com/clima/climacoupler-longruns/builds/449

@szy21
Copy link
Member

szy21 commented Feb 21, 2024

@szy21
Copy link
Member

szy21 commented Feb 21, 2024

Also all the breaking simulations hang. Maybe we need to add SLURM_KILL_BAD_EXIT: 1 to the pipeline?

@Sbozzolo
Copy link
Member Author

Also all the breaking simulations hang. Maybe we need to add SLURM_KILL_BAD_EXIT: 1 to the pipeline?

Added

There seem to be some artifact errors: https://buildkite.com/clima/climacoupler-longruns/builds/449#018dc853-7e02-4d95-a6d0-8e9acaeb0b47/219-8134

This seems like a random filesystem failure. I am trying again.

@Sbozzolo
Copy link
Member Author

ArgumentError: '/central/scratch/esm/slurm-buildkite/climacoupler-longruns/450/depot/default/artifacts/845995bb777cf5a3920541585d62c087f62e5cb5

I think this is a race condition on downloading artifacts :(

https://buildkite.com/clima/climacoupler-longruns/builds/450#018dcc4a-c9ef-4d0d-84b0-2c7ec9ccddb5

@LenkaNovak
Copy link
Collaborator

ArgumentError: '/central/scratch/esm/slurm-buildkite/climacoupler-longruns/450/depot/default/artifacts/845995bb777cf5a3920541585d62c087f62e5cb5

I think this is a race condition on downloading artifacts :(

https://buildkite.com/clima/climacoupler-longruns/builds/450#018dcc4a-c9ef-4d0d-84b0-2c7ec9ccddb5

Hmm, I haven't seen this one before. Maybe we should deal with the topo files before running the job like the rest of the artifacts here, rather than on the fly when initiating atmos?

@Sbozzolo
Copy link
Member Author

ArgumentError: '/central/scratch/esm/slurm-buildkite/climacoupler-longruns/450/depot/default/artifacts/845995bb777cf5a3920541585d62c087f62e5cb5
I think this is a race condition on downloading artifacts :(
https://buildkite.com/clima/climacoupler-longruns/builds/450#018dcc4a-c9ef-4d0d-84b0-2c7ec9ccddb5

Hmm, I haven't seen this one before. Maybe we should deal with the topo files before running the job like the rest of the artifacts here, rather than on the fly when initiating atmos?

Ideally we should fix that (xref: CliMA/ClimaLand.jl#467), but yes, best to ensure that all the artifacts are downloaded before starting the simulations for the time being.

@Sbozzolo
Copy link
Member Author

Okay, this time everything is looking good:

https://buildkite.com/clima/climacoupler-longruns/builds/451

@Sbozzolo Sbozzolo merged commit 27b07fa into main Feb 22, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants