Skip to content

Release v2.7.0 with new features and bug fixes

Latest
Compare
Choose a tag to compare
@khuck khuck released this 29 Oct 18:02

This release contains quite a lot of new functionality, and refactored untied task support. Here's a
list of the new features included in this release:

  • Updated Kokkos autotuning support, with new search strategies genetic_search, nelder_mead, and automatic. The complete list is exhaustive, random, simulated_annealing, genetic_search, nelder_mead an
    d automatic.
  • Nested Kokkos autotuning support allows for complicated search strategies when doing nested Kokkos
    search contexts. For example, choose between two execution policies, while autotuning the internals of each.
  • NVTX pass-through support allows APEX timers to be fed to NVIDIA performance tools if desired.
  • dladdr support for symbol name resolution when binutils are not available.
  • Robust tracking of pthreads without needing to wrap all pthread functions.
  • Added support for the TaskStubs library, a "PerfStubs" like library for instrumenting task based r
    untimes like Iris, PaRSEC, and StarPU. This support includes new events for scheduling, data transfe
    r as well as execution. Task arguments are included in the Google Trace Events output.
  • Added support for complicated MxN parent-child task dependencies, not just 1xN. This provides comp
    lete support for the above runtimes.
  • All runtimes are treated as untied tasks, even standard callpath timer stacks. This allows for com
    plicated task dependency graphs combining asynchronous tasks and callpath timer stacks.
  • Enabled measurement of HIP, CUDA, and SYCL in the same executable for Iris support. Will also support OpenCL (Intel) if needed.
  • Added OpenCL support for Iris on Intel GPUs/FPGAs.
  • Added Python support with updated PerfStubs.

Complete list of commits in this release:

  • view commit • Updating kokkos to 4.2.01
  • view commit • Re-enabling kokkos allocation tracking When enabled, APEX will keep track of allocations through Kokkos and ensure they are all freed before exit
  • view commit • Trying to clean up memory allocation tracking When tracking allocations on the host, everything seems to be working correctly but on occasion, we see allocation amounts changing on the stack in gdb on frontier. can't explain it yet. But some fixes are included in this commit.
  • view commit • Updating roofline stats to use new CSV output
  • view commit • Adding NVTX pass-through support. As requested for the pika project, the ablity to pass APEX timers through to NVTX. This is not compatible with APEX cuda support, since it implements the NVTX API. However, it should work with an applicaiton linked with APEX if the APEX_ENABLE_NVTX_HANDOFF environment variable is set.
  • view commit • Merge branch 'develop' of git.nic.uoregon.edu:/gitroot/xpress-apex into develop
  • view commit • Fixing errors in unit tests on apple
  • view commit • Removing debug print message
  • view commit • Fixing builds with and without kokkos. If APEX is configured without Kokkos it should build now. And APEX will build examples correctly with kokkos, and will build Kokkos only if the Kokkos examples/tests are requested. Otherwise it just uses the headers.
  • view commit • Fixing .dylib/.so for apple configurations
  • view commit • Forgot to update an enum name for CUDA
  • view commit • Adding OpenMP library when linking against Kokkos unit test with OpenMP back end.
  • view commit • A few updates to help with periodic sampling of ROCM metrics on Frontier.
  • view commit • Added env var for specifying libnvToolsExt.so When using APEX as an NVTX pass-through using the APEX_ENABLE_NVTX_HANDOFF variable, NVIDIA doesn't automatically load the library with nvtx support. Adding APEX_NVTX_LIBRARY which defaults to "libnvToolsExt.so" but can be overridden with a full path, or you can add the path to LD_LIBRARY_PATH.
  • view commit • Merge branch 'develop' of git.nic.uoregon.edu:/gitroot/xpress-apex into develop
  • view commit • Changing from backtrace_symbols to dladdr to resolve addresses when BFD is not available.
  • view commit • Removing sleep during shutdown if the background thread is already done.
  • view commit • Removing sleeps from unit tests, exit should be clean now even for very short programs.
  • view commit • Changing NVTX pass-through to use different colors for up to 25 different timers.
  • view commit • Adding first version of genetic search algorithm based on random
  • view commit • Fixing osx build errors
  • view commit • First genetic search implementation.
  • view commit • Debugging genetic search
  • view commit • Debugging autotuning of kokkos kernels - problems with genetic search and nested contexts should now be working.
  • view commit • Fixing cached variable usage When using cached tunings from Kokkos, the variables might not be declared or provided in the same order, and depending on the execution may not be given at all. So...ignore the IDs other than for mapping from the old context to the old variables, or the new contexts to the new variables. We now match the variable names between context hashes, which should be correct in the future.
  • view commit • Enhancing autotuning search with nested contexts. When doing autotuning with Kokkos kernels, it is possible to have nested search contexts. When that happens, we want to make sure that we explore all branches of all possible scenarios. For example, the "idk_jmm" test in the https://github.com/khuck/apex-kokkos-tuning repository has one context to choose between a team policy and a mdrange policy, and each of those has tunable parameters. Because the outer context has only two choices, it can converge very quickly unless we prevent it from converging until the search has completed for both team and mdrange policy implementations. That's what this change does - if we have nested search contexts, the outer context(s) won't converge until the inner context searches converge. I also reduced the max_iterations limit for random, genetic_search and simulated_annealing to 500 from 1000.
  • view commit • Fixing output for exhaustive tuning
  • view commit • Fixing output for tuning strategies to be more helpful.
  • view commit • debugging simulated annealing with one variable of only a few values.
  • view commit • Minor changes to fix thread tracing When threads are created before main (looking at you, CUDA), or are not terminated before main terminates (looking at you, Iris), we need to make sure that the registered new threads have a thread ID. To do that, the top level timers for those threads should have synchronous start/stop events, otherwise they get a single "unified" event at the end with the thread ID belonging to the main thread (because it is the one cleaning up a the end).
  • view commit • Minor bug fix for flow event timestamp The google trace event format leaves a little bit to be desired, in that the start of a flow event has to be between the start end end of the parent task, not equal to the start. So, we take the start timestamp of the parent and add 0.250 microseconds which is the smallest resolution we can use to increase the timestamp to make sure it is during the parent but not after the parent stopped.
  • view commit • Adding environment variables to tests Disabling the asynchronous thread for the apex_version test, there is a race condition somewhere that causes apex to crash 1% of the time for this short test that prints the apex version. I thought I have fixed it multiple times but I am tired of playing whack-a-mole for this short test.
  • view commit • Minor gtrace timestamp changes PErfetto seems to have changed how it interprets flow event time stamps. This was detected by testing with the new support for the Iris tasking runtime. For that reason, we now add 0.250 us to the start of the flow event and subtract 0.250 us from the end of the flow event. the timestamps are adjusted to meet their definition of "enclosing slice" and "begin >= timestamp of the flow".
  • view commit • Adding support for taskstubs submodule
  • view commit • Improving auto-tuning output messages
  • view commit • Initial changes for multiple-parents
  • view commit • Adding initial support for multiple parents for Iris, PARSEc
  • view commit • Adding new test
  • view commit • Fixing multi-parent OTF2 bug
  • view commit • Fixing multiple parent build error
  • view commit • Some updates for handling Iris commands - generate unique IDs.
  • view commit • Adding custom debugger executable parameter to the apex_exec script This allows calling rocgdb or cuda-gdb or a different gdb
  • view commit • Adding fixes when testing with PaRSEC and MPI
  • view commit • Adding fixes when testing with PaRSEC and MPI
  • view commit • Updating counter scatterplot script to handle new counter set of columns.
  • view commit • Updating counter scatterplot script to handle new counter set of columns.
  • view commit • Fixing null (unknown) parent pointers.
  • view commit • Merge branch 'develop' into multiple-parents
  • view commit • Include a flow event when using untied tasks, even when the task dependency is on the same OS thread.
  • view commit • Script optimization to mark dataframe row that it has been visited already, which is needed in the case of multiple parents - otherwise we traverse the tree a stupid number of times for all possible parents.
  • view commit • For now, don't delete tasks from the guid map. it's possible that the parent has been completed and destroyed before the child has even been created.
  • view commit • Initial changes for multiple-parents
  • view commit • Adding initial support for multiple parents for Iris, PARSEc
  • view commit • Adding new test
  • view commit • Fixing multi-parent OTF2 bug
  • view commit • Fixing multiple parent build error
  • view commit • Some updates for handling Iris commands - generate unique IDs.
  • view commit • Adding fixes when testing with PaRSEC and MPI
  • view commit • Adding fixes when testing with PaRSEC and MPI
  • view commit • Include a flow event when using untied tasks, even when the task dependency is on the same OS thread.
  • view commit • Script optimization to mark dataframe row that it has been visited already, which is needed in the case of multiple parents - otherwise we traverse the tree a stupid number of times for all possible parents.
  • view commit • For now, don't delete tasks from the guid map. it's possible that the parent has been completed and destroyed before the child has even been created.
  • view commit • debugging multiple parents support
  • view commit • Fixing untied tasks with correct task dependency. Only remaining problem is that the HPX direct action support could be affected when untied tasks are enabled, but that shouldn't happen. To make it future-proof, we should make sure that direct actions are correctly supported with untied tasks.
  • view commit • Merge branch 'multiple-parents' of git.nic.uoregon.edu:/gitroot/xpress-apex into multiple-parents
  • view commit • Adding stacks header
  • view commit • debugging untied task support Untied tasks now work in all situations execept for direct actions that have their parents yielded while executing. That will be supported later. In the meantime, this commit includes fixed support for untied tasks, and debugged multiple parent support. Support for the taskstubs API calls add_parents and add_children has been added, and tested with tasktree and trace output. Preload support has been debugged and refactored. An apex_taskstubs_cpp test was added to test the taskstubs API implementations.
  • view commit • Fixing exhaustive search when the best setting is the last setting tested.
  • view commit • Stop evaluating results after convergence.
  • view commit • Fixing colorscale for contribution to total time
  • view commit • Cleaning up memory leaks, and converting the dependency tree to shared pointers due to multiple parents and multiple parents - cleaning it up becomes a nightmare otherwise. Also adding some utilities to the apex_exec script to help with testing.
  • view commit • Merge branch 'multiple-parents' of git.nic.uoregon.edu:/gitroot/xpress-apex into multiple-parents
  • view commit • Cleaning up destruction of task_wrapper objects, don't delete the task_identifier objects because they live until program exit. Also cleaning up lock usage for the task identifier map.
  • view commit • Lots of debugging for untied tasks and validating task state. Added lots of assertions for debug build checking. Still have some issues with direct actions, but those should be fixed soon, then untied_tasks will be the default.
  • view commit • Untied tasks working everywhere, even with direct actions. Next step is to refactor and remove all references to the per-thread timer stacks.
  • view commit • found logic bug where CUDA and HIP were both enabled for the other when tracking gpu memory usage.
  • view commit • Merge branch 'develop' into multiple-parents
  • view commit • Removing debug messages
  • view commit • Fixing bug in kokkos counter name generation
  • view commit • Adding --apex:kokkos-counters flag
  • view commit • Cleaning up usage message
  • view commit • Adding native nelder-mead search. Due to ongoing technical difficulties with the Active Harmony integration, we've implemented our own nelder mead search. It still needs some debugging but appears to be working for discretized input sets. Still needs refactoring to handle continuous values from the Kokkos tuning interface. That's the next step.
  • view commit • Adding automatic search strategy.
  • view commit • Don't continue to evaluate after convergence
  • view commit • DEbugging automatic search strategy, tweaking nelder mead convergence criteria.
  • view commit • Cleaning up compiler warnings
  • view commit • Man, compilers are just garbage.
  • view commit • Fixing bug in determining tree node for kokkos tuning
  • view commit • Allowing APEX to measure CUDA and HIP and SYCL at the same time for Iris support
  • view commit • Moving status message to verbose only
  • view commit • Don't call cuptiFinalize for cupti versions between 18 and 21.
  • view commit • Removing superfluous cmake_policy call
  • view commit • Debugging nelder mead and removing print statements
  • view commit • Intel compiler used to automatically include -lstdc++ but no longer. Have to add it explicitly.
  • view commit • Fixing initial values for search strategies
  • view commit • Make sure -lstdc++ comes last in link order
  • view commit • Updating level0 support, still need to investigate task dependency issue that just popped up.
  • view commit • Debugging autotuning searches on sunspot
  • view commit • Debugging simulated annealing on frontier
  • view commit • Modifying search criteria for kokkos autotuning to be in seconds, not nanoseconds. This helps the math in the nelder mead search strategy, and doesn't hurt any other searches.
  • view commit • Tweaking nelder mead tolerances, still not happy
  • view commit • Merge branch 'develop' into multiple-parents
  • view commit • Fixing bug when threads call APEX API functions without first registering with APEX. That should now be handled correctly. Also debugging isues in the taskstubs API implementation, and making the printf statements controllable with the APEX verbose option.
  • view commit • Fixing bug in verbose output
  • view commit • More verbose output fixes
  • view commit • Merge branch 'develop' into multiple-parents
  • view commit • Massive refactoring. Removed all per-thread timer stacks. Now, profiler objects will maintain a reference to the running timer when they were started, which allows for "timer stack" behavior when dealing with "direct actions" i.e. timed direct function calls from timed asynchronous tasks. This now makes "untied timers" the default and only behavior for maintaining a "timer stack", and it works fine for conventional timer stacks, too.
  • view commit • Fixing bug where apex::dump is called multiple times, the tasktree hierarchy has a static set that contains a collection of the processed tree nodes. That set needs to be reset each time that the CSV tree writer is called, otherwise subsequent writes will have an empty file.
  • view commit • Adding python support with perfstubs 3.12+ support.
  • view commit • Add flow events for tasks on the same thread when fed by the taskStubs API
  • view commit • No reason to search for rapidsjson header, it was causing problems on some systems due to CMake being less than awesome. Just set the include path and assume it's correct because we put it there.
  • view commit • Revising existence and executable tests for apex_exec
  • view commit • Minor fix for apex_exec and python checking if file exists and is executable
  • view commit • Merge branch 'multiple-parents' into python_and_multi_parents
  • view commit • Fixing puts in preload code
  • view commit • Fixing usage message in apex_exec
  • view commit • Removing debug print message when tasks created by one thread and started by another
  • view commit • Removing debug print message when tasks created by one thread and started by another
  • view commit • Removing exclamation point
  • view commit • Merge branch 'develop' into multiple-parents
  • view commit • Removing verbose output in preload
  • view commit • Removing debug messages from OTF2 listener
  • view commit • Merge branch 'develop' into multiple-parents
  • view commit • Adding support for handling the taskstubs schedule event The schedule event includes arguments to the task, those are now propagated to the gtrace output. Also, the memory transfer event also includes the source and destination information in the trace data. This requires the latest version of the taskstubs API headers.
  • view commit • Forgot to update the test for taskstubs
  • view commit • Disabling usage of task argument name for now The only check is whether the argument name is set to not nullptr, and in cases when it is never set, it can create garbage strings in the trace. disable for now.
  • view commit • Removing active harmony from the kitchen sink build.
  • view commit • Adding first half of OpenCL host API
  • view commit • Finished the host side API support for OpenCL
  • view commit • Adding opencl circleci test, if it is available
  • view commit • Minor fixes to opencl from testing on apple
  • view commit • adding profiling property to command queues
  • view commit • Merge branch 'opencl' of git.nic.uoregon.edu:/gitroot/xpress-apex into opencl
  • view commit • OpenCL async activity working with profiling
  • view commit • Working opencl support?
  • view commit • Fixed timestamp offset for tracing
  • view commit • Fixed flow events for opencl
  • view commit • Fixing opencl on apple
  • view commit • Changed to asynchronous processing for opencl device activity
  • view commit • Merge branch 'opencl' of git.nic.uoregon.edu:/gitroot/xpress-apex into opencl
  • view commit • Fixing async processing of OpenCL queues
  • view commit • Cannot reliably process the opencl events asynchronously. So we have to wait after each enqueued event. Bummer.
  • view commit • Adding all features needed for iris opencl support.
  • view commit • Removing debug message
  • view commit • Updating kokkos version.
  • view commit • Minor fixes before release
  • view commit • Updating version number