Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
6d78659
Initialize GeosDycoreWrapper with bdt (timestep)
pchakraborty Jan 27, 2023
0a3e857
Use GEOS version of constants
pchakraborty Jan 27, 2023
0a8d705
1. Add qcld to the list of tracers beings advected
pchakraborty Jan 27, 2023
3b73d71
Accumulate diss_est
pchakraborty Jan 27, 2023
87e2e4a
Fix & extend debug pass. Add progress pass. Add PACE_DACE_DEBUG to tr…
FlorianDeconinck Feb 13, 2023
392745d
Merge branch 'feature/f32_precision' into feature/dace_debug
FlorianDeconinck Feb 14, 2023
8a25702
Merge branch 'feature/f32_precision' into feature/dace_debug
FlorianDeconinck Feb 16, 2023
027ee8f
Fix intermediate field required at 64 bit precision
FlorianDeconinck Feb 22, 2023
a68d160
Allow GEOS_WRAPPER to process device data
FlorianDeconinck Feb 24, 2023
33ba53f
Add clear to collector for 3rd party use. GEOS pass down timings to c…
FlorianDeconinck Feb 28, 2023
57313e1
Merge branch 'geos/main' into feature/dace_debug
FlorianDeconinck Mar 1, 2023
8968698
Merge branch 'geos/main' into opt_geos_wrapper_bridge
FlorianDeconinck Mar 1, 2023
d36ffec
Update orchestration primer
FlorianDeconinck Mar 1, 2023
2327cbe
Make kernel analysis run a copy stencil to compute local bandwith
FlorianDeconinck Mar 3, 2023
cb4ec5f
Move constant on a env var
FlorianDeconinck Mar 3, 2023
7348922
lint
FlorianDeconinck Mar 3, 2023
e234d16
lint
FlorianDeconinck Mar 3, 2023
131a2af
More linting
FlorianDeconinck Mar 3, 2023
dce3fb7
Merge branch 'opt_geos_wrapper_bridge' into debug/pchakrab/aquaplanet…
FlorianDeconinck Mar 3, 2023
8982542
Remove unused if leading to empty code block
FlorianDeconinck Mar 6, 2023
da2f902
Restrict dace to 0.14.1 due to a parsing bug
FlorianDeconinck Mar 6, 2023
f2799d8
Merge branch 'feature/dace_debug' into debug/pchakrab/aquaplanet/root…
FlorianDeconinck Mar 6, 2023
27fae1c
Add guard for bdt==0
FlorianDeconinck Mar 7, 2023
9c8eb2f
Verbose documentation of DaCe debug
FlorianDeconinck Mar 7, 2023
2f8ebac
Remove unused code
FlorianDeconinck Mar 7, 2023
5d9e0a0
Merge branch 'feature/dace_debug' into opt_geos_wrapper_bridge
FlorianDeconinck Mar 27, 2023
b8edbf2
Merge remote-tracking branch 'nasa/feature/kernel_bw_tool' into opt_g…
FlorianDeconinck Mar 27, 2023
f54b231
Merge branch 'opt_geos_wrapper_bridge' into debug/pchakrab/aquaplanet…
FlorianDeconinck Mar 27, 2023
81d00ce
Fix theroritical timings
FlorianDeconinck Mar 28, 2023
4891d56
Fixed a bug where pkz was being calculated twice, and the second calc…
pchakraborty Apr 7, 2023
fafbfc7
Downgrade DaCe to 0.14.0 pending array aliasing fix
FlorianDeconinck Apr 10, 2023
4fc5b4d
Set default cache path for orchestrated DaCe to respect GT_CACHE_* env
FlorianDeconinck Apr 10, 2023
2245027
Remove previous per stencil override of default_build_folder
FlorianDeconinck Apr 11, 2023
4f8fdc3
Revert "Set default cache path for orchestrated DaCe to respect GT_CA…
FlorianDeconinck Apr 11, 2023
47421a0
Revert "Remove previous per stencil override of default_build_folder"
FlorianDeconinck Apr 11, 2023
d51bc11
Read cache_root in default dace backend
FlorianDeconinck Apr 11, 2023
6bdd595
Document faulty behavior with GT_CACHE_DIR_NAME
FlorianDeconinck Apr 11, 2023
80cbb01
Fix bad requirements syntax
FlorianDeconinck Apr 13, 2023
40f2440
Check for the string value of CONST_VERSION directly instead of enum
pchakraborty Apr 14, 2023
cae25a9
Protect constant selection more rigorusly.
FlorianDeconinck Apr 20, 2023
915993e
Log constants selection
FlorianDeconinck Apr 20, 2023
c3e355c
Refactor NQ to constants.py
FlorianDeconinck Apr 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion constraints.txt
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ cytoolz==0.11.2
# via
# gt4py
# gt4py (external/gt4py/setup.cfg)
dace==0.14.1
dace==0.14.0
# via
# -r requirements_dev.txt
# pace-dsl
Expand Down
58 changes: 33 additions & 25 deletions doc_primer_orchestration.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
DaCe Orchestration in Pace: a primer
====================================


Fundamentals
------------

Expand All @@ -10,12 +9,14 @@ Full program optimziation with DaCe is the process of turning all Python and GT4
_Orchestration_ is our own wording for full program optimization. We only _orchestrate_ the runtime code of the model, e.g. everything in the `__call__` method of the module. All code in `__init__` is executed like a normal gt backend.

At the highest level in Pace, to turn on orchestration you need to flip the `FV3_DACEMODE` to an orchestrated options _and_ run a `dace:*` backend (it will error out if run anything else). Option for `FV3_DACEMODE` are:

- _Python_: default, turns orchestration off.
- _Build_: build the SDFG then exit without running. See Build for limitation of build strategy.
- _BuildAndRun_: as above, but distribute the build and run.
- _Run_: tries to execute, errors out if the cache don't exists.

Code is orchestrated two ways:

- functions are orchestrated via `orchestrate_function` decorator,
- methods are orchestrate via the `orchestrate` function (e.g. `pace.driver.Driver._critical_path_step_all`)

Expand All @@ -29,28 +30,30 @@ File structure
--------------

`pace.dsl.dace.*` carries the structure for orchestration.
* `build.py`: tooling for distributed build & SDFG load.
* `dace_config.py`: DaCeConfig & DaCeOrchestration enum.
* `orchestration.py`: main code, takes care of orchestration .scaffolding, build pipeline (including parsing) and execution.
* `sdfg_opt_passes.py`: custom optimization pass for Pace, used in the build pipeline.
* `utils.py`: as every "utils" or "misc" or "common" file, this should not exists and collect tools & functions I lazily didn't put in a proper place.
* `wrapped_halo_exchange.py`: a callback-ready halo exchanger, which is our current solution for keeping the Halo Exchange in python (because of prior optimization) in orchestration.

- `build.py`: tooling for distributed build & SDFG load.
- `dace_config.py`: DaCeConfig & DaCeOrchestration enum.
- `orchestration.py`: main code, takes care of orchestration .scaffolding, build pipeline (including parsing) and execution.
- `sdfg_opt_passes.py`: custom optimization pass for Pace, used in the build pipeline.
- `utils.py`: as every "utils" or "misc" or "common" file, this should not exists and collect tools & functions I lazily didn't put in a proper place.
- `wrapped_halo_exchange.py`: a callback-ready halo exchanger, which is our current solution for keeping the Halo Exchange in python (because of prior optimization) in orchestration.

DaCe Config
-----------

DaCe has many configuration options. When executing, it drops or reads a `dace.conf` to get/set options for execution. Because this is a performance-portable model and not a DaCe model, decision has been taken to freeze the options.

`pace.dsl.dace.dace_config` carries a set of tested options for DaCe, with doc. It also takes care of removing the `dace.conf` that will be generated automatically when using DaCe. Documentation should be self-explanatory but a good one to remember is:
`pace.dsl.dace.dace_config` carries a set of tested options for DaCe, with doc. It also takes care of removing the `dace.conf` that will be generated automatically when using DaCe.

```python
# Enable to debug GPU failures
dace.config.Config.set("compiler", "cuda", "syncdebug", value=False)
```
Orchestration can be debugged by using the env var `PACE_DACE_DEBUG`.
When set to `True`, this will drop a few checks:
- `sdfg_nan_checker`, which drops a NaN check after _every_ computation on field _written_.
- `negative_qtracers_checker` drops a check for `tracer < -1e8` for every written field named one of the tracers
- `negative_delp_checker` drops a check for `delp < -1e8` for every written field named `delp*`

- `sdfg_nan_checker`, which drops a NaN check after _every_ computation on field _written_,
- `negative_qtracers_checker` drops a check for `tracer < -1e8` for every written field named one of the tracers,
- `negative_delp_checker` drops a check for `delp < -1e8` for every written field named `delp*`,
- `trace_all_outputs_at_index` drops a print on every variable at a given index to track numerical protection,
- `sdfg_execution_progress`, drops a print after each kernel. Useful when encurring bad crash with no stacktrace,
- insert a CUDA_ERROR_CHECK in C after each kernel.
See `dsl/pace/dsl/dace/utils.py` for details.

Build
Expand All @@ -59,19 +62,20 @@ Build
Orchestrated code won't build the same way the gt backend builds. The build pipeline will lead to a single folder with code & `.so`. In the case of the driver main call, this would be in `.gt_cache_*/dacecache/pace_driver_driver_Driver__critical_path_step_all`.

Code goes through phases before being ready to execute:
* stencils are `parsed` into non-expanded SDFG (gt4py takes care of this),
* all code is `parsed` into a single SDFG with stencils' SDFG included (dace takes care of this and the following steps),
* a first `simplify` is applied to the SDFG to optimize the memory flow,
* we apply the custom `splittable_region_expansion` which optimize small regions (_major_ speed up),
* `expand` will expand all the stencils to a fully workable SDFG (with tasklet filled)
* another `simplify` is applied,
* the memory that can is flagged to be `pooled`,
* [OPTIONAL] Insert debugging passes
* `code generation` into a single file for CPU or two for GPU (a `.cpp` and a `.cu`),
* the SDFG is analysed for memory consumption.

- stencils are `parsed` into non-expanded SDFG (gt4py takes care of this),
- all code is `parsed` into a single SDFG with stencils' SDFG included (dace takes care of this and the following steps),
- a first `simplify` is applied to the SDFG to optimize the memory flow,
- we apply the custom `splittable_region_expansion` which optimize small regions (_major_ speed up),
- `expand` will expand all the stencils to a fully workable SDFG (with tasklet filled)
- another `simplify` is applied,
- the memory that can is flagged to be `pooled`,
- [OPTIONAL] Insert debugging passes
- `code generation` into a single file for CPU or two for GPU (a `.cpp` and a `.cu`),
- the SDFG is analysed for memory consumption.

Orchestration comes with it's own distributed compilation (could be merged with gt). It compiles the top tile and distriubutes the results to other ranks. This uses a couple of hypothesis that limits how to build/execute. The major one is that any decomposition from `(3,3)` upward will require the following workflow:

- compile on `(3,3)`,
- copy caches 0 to 8 (top tile) to target decomposition run dir,
- execute (`FV3_DACEMODE=Run`) target decompoposition.
Expand Down Expand Up @@ -100,6 +104,10 @@ _Parsing errors_

DaCe cannot parse _any_ dynamic Python and any code that allocates memory on the fly (think list creation). It will also complain about any arguments it can't memory describe (remember `dace_compiletime_args` ).

_GT_CACHE_DIR_NAME_

We do not honor the `GT_CACHE_DIR_NAME` with orchestration. `GT_CACHE_ROOT` is respected.

Conclusion
----------

Expand Down
2 changes: 1 addition & 1 deletion driver/pace/driver/grid.py
Original file line number Diff line number Diff line change
Expand Up @@ -215,5 +215,5 @@ def _transform_horizontal_grid(
grid.data[:, :, 0] = lon_transform[:]
grid.data[:, :, 1] = lat_transform[:]

metric_terms._grid.data[:] = grid.data[:]
metric_terms._grid.data[:] = grid.data[:] # type: ignore[attr-defined]
metric_terms._init_agrid()
4 changes: 4 additions & 0 deletions driver/pace/driver/performance/collector.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,10 @@ def __init__(self, experiment_name: str, comm: pace.util.Comm):
self.experiment_name = experiment_name
self.comm = comm

def clear(self):
self.times_per_step = []
self.hits_per_step = []

def collect_performance(self):
"""
Take the accumulated timings and flush them into a new entry
Expand Down
38 changes: 36 additions & 2 deletions driver/pace/driver/tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,14 +27,48 @@
type=click.STRING,
)
@click.option("--report_detail", is_flag=True, type=click.BOOL, default=False)
def command_line(action: str, sdfg_path: Optional[str], report_detail: Optional[bool]):
@click.option(
"--hardware_bw_in_gb_s",
required=False,
type=click.FLOAT,
default=0.0,
)
@click.option(
"--output_format",
required=False,
type=click.STRING,
default=None,
)
@click.option(
"--backend",
required=False,
type=click.STRING,
default="dace:gpu",
)
def command_line(
action: str,
sdfg_path: Optional[str],
report_detail: Optional[bool],
hardware_bw_in_gb_s: Optional[float],
output_format: Optional[str],
backend: Optional[str],
):
"""
Run tooling.
"""
if action == ACTION_SDFG_MEMORY_STATIC_ANALYSIS:
print(memory_static_analysis_from_path(sdfg_path, detail_report=report_detail))
elif action == ACTION_SDFG_KERNEL_THEORETICAL_TIMING:
print(kernel_theoretical_timing_from_path(sdfg_path))
print(
kernel_theoretical_timing_from_path(
sdfg_path,
hardware_bw_in_GB_s=(
None if hardware_bw_in_gb_s == 0 else hardware_bw_in_gb_s
),
backend=backend,
output_format=output_format,
)
)


if __name__ == "__main__":
Expand Down
15 changes: 13 additions & 2 deletions dsl/pace/dsl/dace/dace_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from dace.frontend.python.parser import DaceProgram

from pace.dsl.gt4py_utils import is_gpu_backend
from pace.util._optional_imports import cupy as cp
from pace.util.communicator import CubedSphereCommunicator


Expand Down Expand Up @@ -82,6 +83,10 @@ def __init__(
else:
self._orchestrate = orchestration

# Debugging Dace orchestration deeper can be done by turning on `syncdebug`
# We control this Dace configuration below with our own override
dace_debug_env_var = os.getenv("PACE_DACE_DEBUG", "False") == "True"

# Set the configuration of DaCe to a rigid & tested set of divergence
# from the defaults when orchestrating
if orchestration != DaCeOrchestration.Python:
Expand Down Expand Up @@ -112,7 +117,11 @@ def __init__(
"args",
value="-std=c++14 -Xcompiler -fPIC -O3 -Xcompiler -march=native",
)
dace.config.Config.set("compiler", "cuda", "cuda_arch", value="60")

cuda_sm = 60
if cp:
cuda_sm = cp.cuda.Device(0).compute_capability
dace.config.Config.set("compiler", "cuda", "cuda_arch", value=f"{cuda_sm}")
dace.config.Config.set(
"compiler", "cuda", "default_block_size", value="64,8,1"
)
Expand Down Expand Up @@ -155,7 +164,9 @@ def __init__(
)

# Enable to debug GPU failures
dace.config.Config.set("compiler", "cuda", "syncdebug", value=False)
dace.config.Config.set(
"compiler", "cuda", "syncdebug", value=dace_debug_env_var
)

# attempt to kill the dace.conf to avoid confusion
if dace.config.Config._cfg_filename:
Expand Down
5 changes: 2 additions & 3 deletions dsl/pace/dsl/dace/orchestration.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
from pace.dsl.dace.sdfg_debug_passes import (
negative_delp_checker,
negative_qtracers_checker,
trace_all_outputs_at_index,
sdfg_nan_checker,
)
from pace.dsl.dace.sdfg_opt_passes import splittable_region_expansion
from pace.dsl.dace.utils import (
Expand Down Expand Up @@ -180,8 +180,7 @@ def _build_sdfg(
# is turned on.
if config.get_sync_debug():
with DaCeProgress(config, "Tooling the SDFG for debug"):
# sdfg_nan_checker(sdfg) # TODO (florian): segfault - bad range?
trace_all_outputs_at_index(sdfg, 0, 0, 60)
sdfg_nan_checker(sdfg)
negative_delp_checker(sdfg)
negative_qtracers_checker(sdfg)

Expand Down
Loading