Skip to content

Add mesh connectivity in atlas#983

Merged
travissluka merged 20 commits intodevelopfrom
feature/atlas
Dec 12, 2023
Merged

Add mesh connectivity in atlas#983
travissluka merged 20 commits intodevelopfrom
feature/atlas

Conversation

@travissluka
Copy link
Contributor

@travissluka travissluka commented Dec 6, 2023

Description

"evaporates the point cloud functionspace" 🌧️

Builds a proper atlas functionspace with mesh connectivity.

  • Geometry constructor has been changed to pull the mesh information from the Fortran side, then pass the constructed functionspace back to Fortran
  • remove some stuff that is no longer needed (e.g. toFieldSetAD(), passing lonlat from Fortran to C++, the weird hacks that were needed to avoid duplicate halo points)
  • added an option to save gmsh data when the grid is generated so that it can be viewed with gmsh (see images below, and gridgen.yml)
  • a bunch of other minor things that were needed because of the first 2 bullet points
  • references change because the interpolation is slightly different

Testing

note, spack stack 1.5.1 is required for these branches to work.

Here is what the grid looks like with the 5deg grid on 16 PEs
image

And if you're curious how weird the tripolar fold looks, here is one PE with halos
image

  • ctests pass
  • test with regional HAT10 on orion
  • test with global 1/4

Dependencies

build-group=https://github.com/JCSDA-internal/oops/pull/2465
build-group=https://github.com/JCSDA-internal/saber/pull/729

@travissluka travissluka added the waiting for another PR waiting on another pull request to be merged first label Dec 6, 2023
@travissluka travissluka self-assigned this Dec 6, 2023
@kbhargava
Copy link
Contributor

I was able to build this after the soca-science/#625 on orion that updates spack-stack version. All the ctests pass for soca. I am testing with HAT10 now, will update soon.

Copy link
Contributor

@fmahebert fmahebert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mine is not an expert review, but these code changes look as expected to me

@travissluka
Copy link
Contributor Author

@hga007 FYI, in case any of this is useful to you. Yours should be more straightforward since you don't have the northern tripolar fold or cyclic E/W boundaries to deal with. I have code here that does extra checking to see which duplicate halo points need to be removed before being passed to atlas... but you shouldn't have any duplicates to worry about.

@hga007
Copy link

hga007 commented Dec 7, 2023

@travissluka, Thank you! Indeed, it will be easier. I looked at your code and think it will be easier if I follow your approach. You did an excellent work implementing this feature. Cheers.

@travissluka
Copy link
Contributor Author

note, tests pass, except for clang. not sure why, I don't have clang.... ignore that one for now.

@travissluka
Copy link
Contributor Author

note to self, I disabled the FPE trapping on the diffusion dirac test. I'm not entirely sure why that's failing on clang only (see https://cdash.jcsda.org/test/186729). Odds are, like the other FPE issues, its something weird going on in a region of the halo that I don't care about, or on land. I'm not opening an issue because it's already on my todo list to someday go through and fix and remove as many NOTRAPFPE directives as possible in our ctests... someday

@kbhargava
Copy link
Contributor

The regional experiment fails with the following error

Assertion failed: Assertion [part(209) < proc.size()] failed
part(209) = 1079164896
proc.size() = 480 in build_nodes_remote_idx, line 311 of /work/noaa/epic/role-epic/spack-stack/orion/spac+ k-stack-1.5.1/cache/build_stage/spack-stage-ecmwf-atlas-0.35.0-mkl256cnvwvp6xcrywhjqsnofzkarieb/spack-s+ rc/src/atlas/mesh/actions/BuildParallelFields.cc
Assertion failed: Assertion [part(15) < proc.size()] failed
part(15) = 1062232653
proc.size() = 480 in build_nodes_remote_idx, line 311 of /work/noaa/epic/role-epic/spack-stack/orion/spac+ k-stack-1.5.1/cache/build_stage/spack-stage-ecmwf-atlas-0.35.0-mkl256cnvwvp6xcrywhjqsnofzkarieb/spack-s+ rc/src/atlas/mesh/actions/BuildParallelFields.cc
[Orion-19-69:335639:0:335639] Caught signal 11 (Segmentation fault: address not mapped to object at addre+ ss 0xfffffffebe91d174)

@travissluka
Copy link
Contributor Author

The regional experiment fails with the following error

Assertion failed: Assertion [part(209) < proc.size()] failed
part(209) = 1079164896
proc.size() = 480 in build_nodes_remote_idx, line 311 of /work/noaa/epic/role-epic/spack-stack/orion/spac+ k-stack-1.5.1/cache/build_stage/spack-stage-ecmwf-atlas-0.35.0-mkl256cnvwvp6xcrywhjqsnofzkarieb/spack-s+ rc/src/atlas/mesh/actions/BuildParallelFields.cc
Assertion failed: Assertion [part(15) < proc.size()] failed
part(15) = 1062232653
proc.size() = 480 in build_nodes_remote_idx, line 311 of /work/noaa/epic/role-epic/spack-stack/orion/spac+ k-stack-1.5.1/cache/build_stage/spack-stage-ecmwf-atlas-0.35.0-mkl256cnvwvp6xcrywhjqsnofzkarieb/spack-s+ rc/src/atlas/mesh/actions/BuildParallelFields.cc
[Orion-19-69:335639:0:335639] Caught signal 11 (Segmentation fault: address not mapped to object at addre+ ss 0xfffffffebe91d174)

That's weird, me thinking "eh, it should work, I don't need to test it with regional!" is usually never wrong!!

@travissluka
Copy link
Contributor Author

@fmahebert I could look at your code, but it's probably faster for me to ask you: for the regional FV3, with the "halo" points that aren't owned by anyone because they are at the boundary of the regional domain, do you just assign them to the PE for the patch that is sitting closest to those "boundary halo-ish" points?

TEST_DEPENDS test_soca_gridgen
test_soca_static_socaerror_init)

# NOTE the following test throws a FPE only on clang
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past I've seen clang-specific FPEs arising from clang's choice to optimize by speculatively evaluating possible branches, ignoring the fact that evaluating floating-point operations can have side effects like FPEs. As an example of what I mean, you could hide a division under if (denominator != 0) but clang might evaluate the branch anyway just in case it can use that early evaluation to optimize the code.

Could be what's happening here.

One workaround, if this is the problem and after it is localized, is to insert something the optimizer can't optimize into the if block, like an assembly-level instruction.

I say all this not as a change request but to share the info...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, thanks for the info, that's good to know. That sounds like a likely cause, I believe the error is happening inside a gpnorm function where divide by 0 is possible so there should be an if checking for that.

Someday I'll get clang working on my linux machine to debug things like this.

@fmahebert
Copy link
Contributor

@fmahebert I could look at your code, but it's probably faster for me to ask you: for the regional FV3, with the "halo" points that aren't owned by anyone because they are at the boundary of the regional domain, do you just assign them to the PE for the patch that is sitting closest to those "boundary halo-ish" points?

@travissluka that's correct. My rationale was that if we later want to apply a proper boundary condition, that would likely be populated into the fv3 halos on that PE anyway.

@travissluka
Copy link
Contributor Author

@kbhargava it should work for hat10 now. I only ran a check on the gridgen, not a full cycle. But I have changed the logic so that the nodes/cells representing the halo outside the domain to the north and east are removed. (note to self, that in the future we'll want to change that logic when/if we remove the "virtual land" that surrounds our domain)

here is the mesh with HAT10 on 16 PEs
image

and as a sanity check, a PE in the middle of the domain has halos on all sides:
image

but the PE on the NE corner does not have halos above and to the right
image

Copy link
Contributor

@guillaumevernieres guillaumevernieres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test at "high-res" @travissluka or @kbhargava ? Did you notice a change in the time it takes to run the geometry constructor?
Thanks for the pretty figures ...

@kbhargava
Copy link
Contributor

prep.soca step passes now with the regional case. 🎉

Copy link
Contributor

@kbhargava kbhargava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Travis

@travissluka
Copy link
Contributor Author

Did you test at "high-res" @travissluka or @kbhargava ? Did you notice a change in the time it takes to run the geometry constructor? Thanks for the pretty figures ...

@guillaumevernieres you doubt my mental ctests??? Fine, let me test with 1/4 deg global.

oh wait...

Node uid: 2918430278073000000   0 xy(-999.0000000000,-999.0000000000) has already been added as node 0 (-999.0000000000,-999.0000000000)
Node uid: 2918430278073000000   0 xy(-999.0000000000,-999.0000000000) has already been added as node 0 (-999.0000000000,-999.0000000000)
Node uid: 2918430278073000000   0 xy(-999.0000000000,-999.0000000000) has already been added as node 0 (-999.0000000000,-999.0000000000)
Node uid: 2918430278073000000   0 xy(-999.0000000000,-999.0000000000) has already been added as node 0 (-999.0000000000,-999.0000000000)
backtrace [1] stack has 13 addresses
(/home/tsluka/work/spack-stack/envs/skylab.1.5.1_gcc.12.3.0/install/gcc/12.3.0/eckit-1.24.4-tk6ui5d/lib/libeckit.so+eckit::BackTrace::dump[abi:cxx11]())0x1db 
(/home/tsluka/work/spack-stack/envs/skylab.1.5.1_gcc.12.3.0/install/gcc/12.3.0/eckit-1.24.4-tk6ui5d/lib/libeckit.so+eckit::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, eckit::CodeLocation const&, bool))0xb3 
(/home/tsluka/work/spack-stack/envs/skylab.1.5.1_gcc.12.3.0/install/gcc/12.3.0/ecmwf-atlas-0.35.0-tez5dnr/lib/libatlas.so.0.35+atlas::throw_Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, eckit::CodeLocation const&))0x29 
(/home/tsluka/work/spack-stack/envs/skylab.1.5.1_gcc.12.3.0/install/gcc/12.3.0/ecmwf-atlas-0.35.0-tez5dnr/lib/libatlas.so.0.35+atlas::mesh::actions::build_lookup_uid2node(atlas::Mesh&, std::unordered_map<long, int, std::hash<long>, std::equal_to<long>, std::allocator<std::pair<long const, int> > >&))0xdca 
(/home/tsluka/work/spack-stack/envs/skylab.1.5.1_gcc.12.3.0/install/gcc/12.3.0/ecmwf-atlas-0.35.0-tez5dnr/lib/libatlas.so.0.35+atlas::mesh::actions::increase_halo_interior(atlas::mesh::actions::BuildHaloHelper&))0x3d8 
(/home/tsluka/work/spack-stack/envs/skylab.1.5.1_gcc.12.3.0/install/gcc/12.3.0/ecmwf-atlas-0.35.0-tez5dnr/lib/libatlas.so.0.35+atlas::mesh::actions::BuildHalo::operator()(int))0x53b 
(/home/tsluka/work/soca/build_gcc11/bin/../lib/libsoca.so+soca::Geometry::Geometry(eckit::Configuration const&, eckit::mpi::Comm const&, bool))0xfbe 
(./soca_gridgen.x+)0x268c 
(/home/tsluka/work/soca/build_gcc11/bin/../lib/liboops.so+oops::Run::execute(oops::Application const&, eckit::mpi::Comm const&))0x2c2 
(./soca_gridgen.x+main)0x5a 
(/lib/x86_64-linux-gnu/libc.so.6+)0x280d0 
(/lib/x86_64-linux-gnu/libc.so.6+__libc_start_main)0x89 
(./soca_gridgen.x+)0x2365 

@travissluka
Copy link
Contributor Author

@guillaumevernieres @kbhargava never mind, false alarm!

I was trying to test 1/4 deg but my MOM_input file did not have TRIPOLAR_N=True so bad things were happening. We'll have to update the static files for skylab because I don't think MOM_input has that set.

I only tested the grid generation, I did not test a full cycle (I leave that as an exercise for the user!). Here is 1/4 deg with 32 PEs
image

and the last PE along the tripolar fold showing halos
image

and @guillaumevernieres there is no noticeable difference in run time, creating the atlas mesh is fast.

Copy link
Contributor

@fmahebert fmahebert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @travissluka!

@fmahebert
Copy link
Contributor

The OOPS PR has been merged; this PR can be merged as well.

(Unless if the clang CI needs more attention?)

@travissluka
Copy link
Contributor Author

The OOPS PR has been merged; this PR can be merged as well.

(Unless if the clang CI needs more attention?)

@fmahebert the clang CI was apparently breaking with NaNs before this PR... i just never noticed it. Deal with that later.

@travissluka travissluka merged commit 15eab96 into develop Dec 12, 2023
@travissluka travissluka deleted the feature/atlas branch December 12, 2023 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting for another PR waiting on another pull request to be merged first

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update ATLAS interface stuff

5 participants