Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add coupler summary table #706

Merged
merged 3 commits into from
May 11, 2024
Merged

add coupler summary table #706

merged 3 commits into from
May 11, 2024

Conversation

juliasloan25
Copy link
Member

@juliasloan25 juliasloan25 commented Mar 20, 2024

Purpose

closes #705

status 4/23

view passing build in coupler benchmarks pipeline here
view output to slack here

To do

  • get CPU runs working
  • check slack output
  • verify SYPD calculations (make sure all units are seconds, compare to estimate from buildkite timing)
  • add individual rows for CPU allocations vs GPU allocations (shouldn't be compared directly)
  • include buildkite id in output
  • use 4 GPUs

To do in separate PR(s)

  • save output as .pdf rather than .txt - not a priority
    • use latex backend for PrettyTables.jl?
  • add atmos-only run without diagnostic edmf
  • add cpu/gpu state comparison columns
  • pass in config file to benchmark script
    • use this to get run names (instead of passing these in), resolution & dt (instead of hardcoding these in table)

note: I was using CUDA.memory_status to get the GPU memory usage, but this varied wildly between runs (3-30GB). I'm not sure why this is, but it doesn't seem to be a reliable metric so I won't be using it going forward. Here is the code block calling that function, in case we want to return to it in the future:

    if comms_ctx.device isa ClimaComms.CUDADevice
        # If `io` is `stdout`, print the memory status, otherwise store in buffer to return
        CUDA.memory_status(io)
        if io != stdout
            # Get the memory usage in GB by parsing the output of `CUDA.memory_status`
            str = String(take!(io))
            println(str)
            gpu_allocs_GB = "GPU: " * split(split(str, '(')[2], '/')[1]
        end
    end

@juliasloan25
Copy link
Member Author

In some runs, atmos outputs an SYPD that's much larger than the SYPD calculated here by timing the coupling loop. For example, in the "GPU slabplanet: albedo from function" run here, atmos prints an SYPD around 17, and we calculate it to be 8.15. I'm not sure where this difference is coming from since I'm using CA.EfficiencyStats and CA.simulated_years_per_day to calculate SYPD in the coupler driver

Copy link
Collaborator

@LenkaNovak LenkaNovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for implementing this, @juliasloan25 ! A super useful PR! 🎉 I just had a few comments / suggestions.

.buildkite/benchmarks/pipeline.yml Outdated Show resolved Hide resolved
.buildkite/benchmarks/pipeline.yml Outdated Show resolved Hide resolved
config/model_configs/amip_diagedmf.yml Outdated Show resolved Hide resolved
config/model_configs/amip_diagedmf.yml Outdated Show resolved Hide resolved
config/atmos_configs/aquaplanet_diagedmf.yml Outdated Show resolved Hide resolved
@LenkaNovak
Copy link
Collaborator

LenkaNovak commented Apr 24, 2024

I'm also a little surprised about the SYPD for all runs reported on Slack. For example, this build finished running 0.25 sim years in ~0.25 wallclock days (including initialization), so SYPD should be around 1. I think the config should be the same, but the slack results imply that the benchmark runs are 100x slower. 🤔

@juliasloan25
Copy link
Member Author

In some runs, atmos outputs an SYPD that's much larger than the SYPD calculated here by timing the coupling loop. For example, in the "GPU slabplanet: albedo from function" run here, atmos prints an SYPD around 17, and we calculate it to be 8.15. I'm not sure where this difference is coming from since I'm using CA.EfficiencyStats and CA.simulated_years_per_day to calculate SYPD in the coupler driver

After speaking with Charlier, it sounds like this might be because we're using Base.@elapsed to time the coupling loop, and this doesn't work well with CUDA kernels. I've changed it to use ClimaComms.@elapsed, which will call CUDA.@elapsed when we're running on GPU and should give more accurate numbers

@juliasloan25 juliasloan25 force-pushed the js/table branch 5 times, most recently from 58e514c to d893b17 Compare April 25, 2024 20:19
Copy link
Collaborator

@LenkaNovak LenkaNovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you! 🚀 I have a few comments and the parsed args need a fix. Then, presuming all CI runs are passing without change from the main, I think we can merge this.

config/benchmark_configs/amip_diagedmf.yml Outdated Show resolved Hide resolved
## run the coupled simulation
solve_coupler!(cs);
## run the coupled simulation for one timestep to precompile everything before timing
cs.tspan[2] = Δt_cpl
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume Base.precompile(solve_coupler!, (typeof(cs),)) doesn't work here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried it! I can give that a go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I've just noticed that the latest SYPD is still outputting the older estimate (including the precompilation), so we may need to revert previous version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this some more and chatted with Charlie (see conversation on slack here), and it sounds like precompile only works if it can infer all the types, and even then still may not actually compile all the code. I opened an issue for us to fix the type inference and change to using precompile (#775), but for now I'll just keep it as running for 2 timesteps before running the full simulation

@juliasloan25 juliasloan25 force-pushed the js/table branch 3 times, most recently from f11da89 to b7a8beb Compare May 3, 2024 06:00
@LenkaNovak LenkaNovak self-requested a review May 4, 2024 03:45
Copy link
Collaborator

@LenkaNovak LenkaNovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving this since the table works, but please do check the discrepancy between atmos/coupler allocations and that we're not seeing any behavioral changes on Buildkite once CI is fixed. Thanks for the great work, @juliasloan25 ! 🚀

@juliasloan25
Copy link
Member Author

juliasloan25 commented May 10, 2024

I'm approving this since the table works, but please do check the discrepancy between atmos/coupler allocations and that we're not seeing any behavioral changes on Buildkite once CI is fixed. Thanks for the great work, @juliasloan25 ! 🚀

I fixed CI and all the plots look the same as main 😄

I also changed the functions we're using to calculate allocations. Now there's one metric that's the same between CPU and GPU runs, which calculates the maximum CPU memory used over the course of the simulation (max RSS). The numbers don't look exactly like what I would expect (i.e. there are more allocations for atmos-only than coupled), but I think this is something we can easily change later on if we decide this isn't a reliable metric, and I don't want it to hold up this whole PR. I'll see if I can figure out what's going on with the allocation measurement tomorrow, but if nothing comes up I'm inclined to merge this as-is and investigate later on as we trigger more runs

Update: I added a GC.gc() call right before the coupling loop. Atmos has a gc call right before their solve, so this should produce more comparable results between the coupler and atmos. Calling GC.gc() may decrease reported Sys.maxrss (see slack thread). It looks like this does produce allocation results more like what we would expect, but there's also variation in allocations between runs.

@juliasloan25 juliasloan25 force-pushed the js/table branch 3 times, most recently from 156e40d to 828487b Compare May 10, 2024 22:04
@juliasloan25 juliasloan25 merged commit 98694c3 into main May 11, 2024
8 of 9 checks passed
@juliasloan25 juliasloan25 deleted the js/table branch May 11, 2024 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add coupler summary table
2 participants