add coupler summary table #706

juliasloan25 · 2024-03-20T01:30:06Z

Purpose

closes #705

status 4/23

view passing build in coupler benchmarks pipeline here
view output to slack here

To do

get CPU runs working
check slack output
verify SYPD calculations (make sure all units are seconds, compare to estimate from buildkite timing)
add individual rows for CPU allocations vs GPU allocations (shouldn't be compared directly)
include buildkite id in output
use 4 GPUs

To do in separate PR(s)

save output as .pdf rather than .txt - not a priority
- use latex backend for PrettyTables.jl?
add atmos-only run without diagnostic edmf
- follow ClimaAtmos gpu_aquaplanet_dyamond
add cpu/gpu state comparison columns
pass in config file to benchmark script
- use this to get run names (instead of passing these in), resolution & dt (instead of hardcoding these in table)

note: I was using CUDA.memory_status to get the GPU memory usage, but this varied wildly between runs (3-30GB). I'm not sure why this is, but it doesn't seem to be a reliable metric so I won't be using it going forward. Here is the code block calling that function, in case we want to return to it in the future:

    if comms_ctx.device isa ClimaComms.CUDADevice
        # If `io` is `stdout`, print the memory status, otherwise store in buffer to return
        CUDA.memory_status(io)
        if io != stdout
            # Get the memory usage in GB by parsing the output of `CUDA.memory_status`
            str = String(take!(io))
            println(str)
            gpu_allocs_GB = "GPU: " * split(split(str, '(')[2], '/')[1]
        end
    end

.buildkite/longruns/pipeline.yml

config/model_configs/amip_diagedmf.yml

juliasloan25 · 2024-04-23T21:41:29Z

In some runs, atmos outputs an SYPD that's much larger than the SYPD calculated here by timing the coupling loop. For example, in the "GPU slabplanet: albedo from function" run here, atmos prints an SYPD around 17, and we calculate it to be 8.15. I'm not sure where this difference is coming from since I'm using CA.EfficiencyStats and CA.simulated_years_per_day to calculate SYPD in the coupler driver

config/atmos_configs/aquaplanet_diagedmf.yml

LenkaNovak

Thanks so much for implementing this, @juliasloan25 ! A super useful PR! 🎉 I just had a few comments / suggestions.

.buildkite/benchmarks/pipeline.yml

config/longrun_configs/gpu_longrun_amip_dyamond.yml

config/model_configs/amip_diagedmf.yml

config/atmos_configs/aquaplanet_diagedmf.yml

test/component_model_tests/climaatmos_standalone/atmos_driver.jl

LenkaNovak · 2024-04-24T00:17:09Z

I'm also a little surprised about the SYPD for all runs reported on Slack. For example, this build finished running 0.25 sim years in ~0.25 wallclock days (including initialization), so SYPD should be around 1. I think the config should be the same, but the slack results imply that the benchmark runs are 100x slower. 🤔

juliasloan25 · 2024-04-24T23:50:01Z

In some runs, atmos outputs an SYPD that's much larger than the SYPD calculated here by timing the coupling loop. For example, in the "GPU slabplanet: albedo from function" run here, atmos prints an SYPD around 17, and we calculate it to be 8.15. I'm not sure where this difference is coming from since I'm using CA.EfficiencyStats and CA.simulated_years_per_day to calculate SYPD in the coupler driver

After speaking with Charlier, it sounds like this might be because we're using Base.@elapsed to time the coupling loop, and this doesn't work well with CUDA kernels. I've changed it to use ClimaComms.@elapsed, which will call CUDA.@elapsed when we're running on GPU and should give more accurate numbers

LenkaNovak

Looks good, thank you! 🚀 I have a few comments and the parsed args need a fix. Then, presuming all CI runs are passing without change from the main, I think we can merge this.

config/benchmark_configs/amip_diagedmf.yml

config/benchmark_configs/climaatmos_diagedmf.yml

LenkaNovak · 2024-05-02T22:02:06Z

experiments/AMIP/coupler_driver.jl

-## run the coupled simulation
-solve_coupler!(cs);
+## run the coupled simulation for one timestep to precompile everything before timing
+cs.tspan[2] = Δt_cpl


I presume Base.precompile(solve_coupler!, (typeof(cs),)) doesn't work here?

I haven't tried it! I can give that a go

It works :)

Although I've just noticed that the latest SYPD is still outputting the older estimate (including the precompilation), so we may need to revert previous version.

I looked into this some more and chatted with Charlie (see conversation on slack here), and it sounds like precompile only works if it can infer all the types, and even then still may not actually compile all the code. I opened an issue for us to fix the type inference and change to using precompile (#775), but for now I'll just keep it as running for 2 timesteps before running the full simulation

LenkaNovak

I'm approving this since the table works, but please do check the discrepancy between atmos/coupler allocations and that we're not seeing any behavioral changes on Buildkite once CI is fixed. Thanks for the great work, @juliasloan25 ! 🚀

juliasloan25 · 2024-05-10T00:42:49Z

I'm approving this since the table works, but please do check the discrepancy between atmos/coupler allocations and that we're not seeing any behavioral changes on Buildkite once CI is fixed. Thanks for the great work, @juliasloan25 ! 🚀

I fixed CI and all the plots look the same as main 😄

I also changed the functions we're using to calculate allocations. Now there's one metric that's the same between CPU and GPU runs, which calculates the maximum CPU memory used over the course of the simulation (max RSS). The numbers don't look exactly like what I would expect (i.e. there are more allocations for atmos-only than coupled), but I think this is something we can easily change later on if we decide this isn't a reliable metric, and I don't want it to hold up this whole PR. I'll see if I can figure out what's going on with the allocation measurement tomorrow, but if nothing comes up I'm inclined to merge this as-is and investigate later on as we trigger more runs

Update: I added a GC.gc() call right before the coupling loop. Atmos has a gc call right before their solve, so this should produce more comparable results between the coupler and atmos. Calling GC.gc() may decrease reported Sys.maxrss (see slack thread). It looks like this does produce allocation results more like what we would expect, but there's also variation in allocations between runs.

juliasloan25 force-pushed the js/table branch 3 times, most recently from 663c3b0 to 581b0f9 Compare March 22, 2024 23:17

juliasloan25 force-pushed the js/table branch 5 times, most recently from b2534c3 to ee86b48 Compare April 9, 2024 20:09

juliasloan25 force-pushed the js/table branch 7 times, most recently from 52870d1 to bc42d38 Compare April 22, 2024 23:24

juliasloan25 force-pushed the js/table branch from bc42d38 to 0350c1d Compare April 23, 2024 16:48

juliasloan25 requested a review from LenkaNovak April 23, 2024 21:27

juliasloan25 commented Apr 23, 2024

View reviewed changes

.buildkite/longruns/pipeline.yml Show resolved Hide resolved

juliasloan25 commented Apr 23, 2024

View reviewed changes

config/model_configs/amip_diagedmf.yml Outdated Show resolved Hide resolved

juliasloan25 commented Apr 23, 2024

View reviewed changes

config/model_configs/amip_diagedmf.yml Outdated Show resolved Hide resolved

juliasloan25 commented Apr 23, 2024

View reviewed changes

config/atmos_configs/aquaplanet_diagedmf.yml Outdated Show resolved Hide resolved

LenkaNovak reviewed Apr 24, 2024

View reviewed changes

juliasloan25 force-pushed the js/table branch 5 times, most recently from 58e514c to d893b17 Compare April 25, 2024 20:19

juliasloan25 requested a review from LenkaNovak May 2, 2024 21:24

juliasloan25 force-pushed the js/table branch from eff2106 to 0f00201 Compare May 2, 2024 21:26

LenkaNovak reviewed May 2, 2024

View reviewed changes

juliasloan25 force-pushed the js/table branch 3 times, most recently from f11da89 to b7a8beb Compare May 3, 2024 06:00

LenkaNovak self-requested a review May 4, 2024 03:45

LenkaNovak approved these changes May 4, 2024

View reviewed changes

juliasloan25 force-pushed the js/table branch 8 times, most recently from aa9e574 to 902bf36 Compare May 9, 2024 00:33

juliasloan25 mentioned this pull request May 9, 2024

fix inference and use Base.precompile #775

Open

juliasloan25 force-pushed the js/table branch 2 times, most recently from e10648c to f5b066c Compare May 9, 2024 23:06

juliasloan25 force-pushed the js/table branch 3 times, most recently from 156e40d to 828487b Compare May 10, 2024 22:04

juliasloan25 added 3 commits May 10, 2024 17:44

add coupler output summary table

53d8bb4

make diagnostics optional

210ffff

use atmos config in coupler

c96101a

juliasloan25 force-pushed the js/table branch from 828487b to c96101a Compare May 11, 2024 00:47

juliasloan25 merged commit 98694c3 into main May 11, 2024
8 of 9 checks passed

juliasloan25 deleted the js/table branch May 11, 2024 07:40

juliasloan25 mentioned this pull request May 13, 2024

look into allocations #691

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add coupler summary table #706

add coupler summary table #706

juliasloan25 commented Mar 20, 2024 •

edited

Loading

juliasloan25 commented Apr 23, 2024

LenkaNovak left a comment

LenkaNovak commented Apr 24, 2024 •

edited

Loading

juliasloan25 commented Apr 24, 2024

LenkaNovak left a comment

LenkaNovak May 2, 2024

juliasloan25 May 2, 2024

juliasloan25 May 3, 2024

LenkaNovak May 3, 2024

juliasloan25 May 10, 2024

LenkaNovak left a comment

juliasloan25 commented May 10, 2024 •

edited

Loading

add coupler summary table #706

add coupler summary table #706

Conversation

juliasloan25 commented Mar 20, 2024 • edited Loading

Purpose

status 4/23

To do

To do in separate PR(s)

juliasloan25 commented Apr 23, 2024

LenkaNovak left a comment

Choose a reason for hiding this comment

LenkaNovak commented Apr 24, 2024 • edited Loading

juliasloan25 commented Apr 24, 2024

LenkaNovak left a comment

Choose a reason for hiding this comment

LenkaNovak May 2, 2024

Choose a reason for hiding this comment

juliasloan25 May 2, 2024

Choose a reason for hiding this comment

juliasloan25 May 3, 2024

Choose a reason for hiding this comment

LenkaNovak May 3, 2024

Choose a reason for hiding this comment

juliasloan25 May 10, 2024

Choose a reason for hiding this comment

LenkaNovak left a comment

Choose a reason for hiding this comment

juliasloan25 commented May 10, 2024 • edited Loading

juliasloan25 commented Mar 20, 2024 •

edited

Loading

LenkaNovak commented Apr 24, 2024 •

edited

Loading

juliasloan25 commented May 10, 2024 •

edited

Loading