Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add GPU DYAMOND runs #659

Merged
merged 1 commit into from
Mar 9, 2024
Merged

add GPU DYAMOND runs #659

merged 1 commit into from
Mar 9, 2024

Conversation

juliasloan25
Copy link
Member

@juliasloan25 juliasloan25 commented Mar 1, 2024

Purpose

closes #658

Only adding a longrun, no shortrun.
This run exceeds the memory available on P100s. Caltech's V100s have 16GB and 32 GB options, neither of which is large enough for this job, according to https://www.hpc.caltech.edu/resources. Instead of running on central like the rest of the longruns, this job will run on clima (which has A100s with 80GB of memory). I've opened an issue to address the allocations seen in this run: #683

view run on buildkite here: https://buildkite.com/clima/climacoupler-longruns/builds/480#_

Content

  • add config file based on config/longrun_configs/dyamond_target.yml for longrun
    • set anim: false for gpu-compatibility
  • add longrun using GPU

  • I have read and checked the items on the review checklist.

@LenkaNovak LenkaNovak self-requested a review March 4, 2024 18:33
Copy link
Collaborator

@LenkaNovak LenkaNovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we could request an H100 for this job only? I don't think the allocation enhancements will be addressed anytime soon. If that's possible, I would suggest commenting out the regular CI job for now, but retaining the longrun one, which we only run once a week on Sundays.

@juliasloan25
Copy link
Member Author

Is there a way we could request an H100 for this job only? I don't think the allocation enhancements will be addressed anytime soon. If that's possible, I would suggest commenting out the regular CI job for now, but retaining the longrun one, which we only run once a week on Sundays.

ClimaAtmos has a separate buildkite pipeline that runs target GPU simulations on clima (see the runs and the pipeline.yml itself). I can implement the same thing for us

@LenkaNovak
Copy link
Collaborator

LenkaNovak commented Mar 5, 2024

Is there a way we could request an H100 for this job only? I don't think the allocation enhancements will be addressed anytime soon. If that's possible, I would suggest commenting out the regular CI job for now, but retaining the longrun one, which we only run once a week on Sundays.

ClimaAtmos has a separate buildkite pipeline that runs target GPU simulations on clima (see the runs and the pipeline.yml itself). I can implement the same thing for us

Does this allow us to specify the hardware for just one run though?

@juliasloan25
Copy link
Member Author

Is there a way we could request an H100 for this job only? I don't think the allocation enhancements will be addressed anytime soon. If that's possible, I would suggest commenting out the regular CI job for now, but retaining the longrun one, which we only run once a week on Sundays.

ClimaAtmos has a separate buildkite pipeline that runs target GPU simulations on clima (see the runs and the pipeline.yml itself). I can implement the same thing for us

Does this allow us to specify the hardware for just one run though?

No, it would be a separate pipeline where this job would be run. I think this will be useful for GPU scaling runs too

@juliasloan25 juliasloan25 force-pushed the js/gpu-dyamond branch 2 times, most recently from 4feeb75 to 2b7b7c1 Compare March 6, 2024 02:02
Copy link
Collaborator

@LenkaNovak LenkaNovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you, @juliasloan25. Just had a question about the sim length.

monthly_checkpoint: false
run_name: "gpu_dyamond_target"
start_date: "19790301"
t_end: "1days"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a long run, could we run it for longer (e.g. 50 days) or does the simulation crash? 👀

@juliasloan25 juliasloan25 merged commit f784726 into main Mar 9, 2024
9 checks passed
@juliasloan25 juliasloan25 deleted the js/gpu-dyamond branch March 9, 2024 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

run DYAMOND on GPU
2 participants