Commit d55b666
authored
Topic/monitoring (#3109)
Add a monitoring PML, OSC and IO. They track all data exchanges between processes,
with capability to include or exclude collective traffic. The monitoring infrastructure is
driven using MPI_T, and can be tuned of and on any time o any communicators/files/windows.
Documentations and examples have been added, as well as a shared library that can be
used with LD_PRELOAD and that allows the monitoring of any application.
Signed-off-by: George Bosilca <[email protected]>
Signed-off-by: Clement Foyer <[email protected]>
* add ability to querry pml monitorinting results with MPI Tools interface
using performance variables "pml_monitoring_messages_count" and
"pml_monitoring_messages_size"
Signed-off-by: George Bosilca <[email protected]>
* Fix a convertion problem and add a comment about the lack of component
retain in the new component infrastructure.
Signed-off-by: George Bosilca <[email protected]>
* Allow the pvar to be written by invoking the associated callback.
Signed-off-by: George Bosilca <[email protected]>
* Various fixes for the monitoring.
Allocate all counting arrays in a single allocation
Don't delay the initialization (do it at the first add_proc as we
know the number of processes in MPI_COMM_WORLD)
Add a choice: with or without MPI_T (default).
Signed-off-by: George Bosilca <[email protected]>
* Cleanup for the monitoring module.
Fixed few bugs, and reshape the operations to prepare for
global or communicator-based monitoring. Start integrating
support for MPI_T as well as MCA monitoring.
Signed-off-by: George Bosilca <[email protected]>
* Adding documentation about how to use pml_monitoring component.
Document present the use with and without MPI_T.
May not reflect exactly how it works right now, but should reflects
how it should work in the end.
Signed-off-by: Clement Foyer <[email protected]>
* Change rank into MPI_COMM_WORLD and size(MPI_COMM_WORLD) to global variables in pml_monitoring.c.
Change mca_pml_monitoring_flush() signature so we don't need the size and rank parameters.
Signed-off-by: George Bosilca <[email protected]>
* Improve monitoring support (including integration with MPI_T)
Use mca_pml_monitoring_enable to check status state. Set mca_pml_monitoring_current_filename iif parameter is set
Allow 3 modes for pml_monitoring_enable_output: - 1 : stdout; - 2 : stderr; - 3 : filename
Fix test : 1 for differenciated messages, >1 for not differenciated. Fix output.
Add documentation for pml_monitoring_enable_output parameter. Remove useless parameter in example
Set filename only if using mpi tools
Adding missing parameters for fprintf in monitoring_flush (for output in std's cases)
Fix expected output/results for example header
Fix exemple when using MPI_Tools : a null-pointer can't be passed directly. It needs to be a pointer to a null-pointer
Base whether to output or not on message count, in order to print something if only empty messages are exchanged
Add a new example on how to access performance variables from within the code
Allocate arrays regarding value returned by binding
Signed-off-by: Clement Foyer <[email protected]>
* Add overhead benchmark, with script to use data and create graphs out of the results
Signed-off-by: Clement Foyer <[email protected]>
* Fix segfault error at end when not loading pml
Signed-off-by: Clement Foyer <[email protected]>
* Start create common monitoring module. Factorise version numbering
Signed-off-by: Clement Foyer <[email protected]>
* Fix microbenchmarks script
Signed-off-by: Clement Foyer <[email protected]>
* Improve readability of code
NULL can't be passed as a PVAR parameter value. It must be a pointer to NULL or an empty string.
Signed-off-by: Clement Foyer <[email protected]>
* Add osc monitoring component
Signed-off-by: Clement Foyer <[email protected]>
* Add error checking if running out of memory in osc_monitoring
Signed-off-by: Clement Foyer <[email protected]>
* Resolve brutal segfault when double freeing filename
Signed-off-by: Clement Foyer <[email protected]>
* Moving to ompi/mca/common the proper parts of the monitoring system
Using common functions instead of pml specific one. Removing pml ones.
Signed-off-by: Clement Foyer <[email protected]>
* Add calls to record monitored data from osc. Use common function to translate ranks.
Signed-off-by: Clement Foyer <[email protected]>
* Fix test_overhead benchmark script distribution
Signed-off-by: Clement Foyer <[email protected]>
* Fix linking library with mca/common
Signed-off-by: Clement Foyer <[email protected]>
* Add passive operations in monitoring_test
Signed-off-by: Clement Foyer <[email protected]>
* Fix from rank calculation. Add more detailed error messages
Signed-off-by: Clement Foyer <[email protected]>
* Fix alignments. Fix common_monitoring_get_world_rank function. Remove useless trailing new lines
Signed-off-by: Clement Foyer <[email protected]>
* Fix osc_monitoring mget_message_count function call
Signed-off-by: Clement Foyer <[email protected]>
* Change common_monitoring function names to respect the naming convention. Move to common_finalize the common parts of finalization. Add some comments.
Signed-off-by: Clement Foyer <[email protected]>
* Add monitoring common output system
Signed-off-by: Clement Foyer <[email protected]>
* Add error message when trying to flush to a file, and open fails. Remove erroneous info message when flushing wereas the monitoring is already disabled.
Signed-off-by: Clement Foyer <[email protected]>
* Consistent output file name (with and without MPI_T).
Signed-off-by: Clement Foyer <[email protected]>
* Always output to a file when flushing at pvar_stop(flush).
Signed-off-by: Clement Foyer <[email protected]>
* Update the monitoring documentation.
Complete informations from HowTo. Fix a few mistake and typos.
Signed-off-by: Clement Foyer <[email protected]>
* Use the world_rank for printf's.
Fix name generation for output files when using MPI_T. Minor changes in benchmarks starting script
Signed-off-by: Clement Foyer <[email protected]>
* Clean potential previous runs, but keep the results at the end in order to potentially reprocess the data. Add comments.
Signed-off-by: Clement Foyer <[email protected]>
* Add security check for unique initialization for osc monitoring
Signed-off-by: Clement Foyer <[email protected]>
* Clean the amout of symbols available outside mca/common/monitoring
Signed-off-by: Clement Foyer <[email protected]>
* Remove use of __sync_* built-ins. Use opal_atomic_* instead.
Signed-off-by: Clement Foyer <[email protected]>
* Allocate the hashtable on common/monitoring component initialization. Define symbols to set the values for error/warning/info verbose output. Use opal_atomic instead of built-in function in osc/monitoring template initialization.
Signed-off-by: Clement Foyer <[email protected]>
* Deleting now useless file : moved to common/monitoring
Signed-off-by: Clement Foyer <[email protected]>
* Add histogram ditribution of message sizes
Signed-off-by: Clement Foyer <[email protected]>
* Add histogram array of 2-based log of message sizes. Use simple call to reset/allocate arrays in common_monitoring.c
Signed-off-by: Clement Foyer <[email protected]>
* Add informations in dumping file. Separate per category (pt2pt/osc/coll (to come)) monitored data
Signed-off-by: Clement Foyer <[email protected]>
* Add coll component for collectives communications monitoring
Signed-off-by: Clement Foyer <[email protected]>
* Fix warning messages : use c_name as the magic id is not always defined. Moreover, there was a % missing. Add call to release underlying modules. Add debug info messages. Add warning which may lead to further analysis.
Signed-off-by: Clement Foyer <[email protected]>
* Fix log10_2 constant initialization. Fix index calculation for histogram array.
Signed-off-by: Clement Foyer <[email protected]>
* Add debug info messages to follow more easily initialization steps.
Signed-off-by: Clement Foyer <[email protected]>
* Group all the var/pvar definitions to common_monitoring. Separate initial filename from the current on, to ease its lifetime management. Add verifications to ensure common is initialized once only. Move state variable management to common_monitoring.
monitoring_filter only indicates if filtering is activated.
Fix out of range access in histogram.
List is not used with the struct mca_monitoring_coll_data_t, so heritate only from opal_object_t.
Remove useless dead code.
Signed-off-by: Clement Foyer <[email protected]>
* Fix invalid memory allocation. Initialize initial_filename to empty string to avoid invalid read in mca_base_var_register.
Signed-off-by: Clement Foyer <[email protected]>
* Don't install the test scripts.
Signed-off-by: George Bosilca <[email protected]>
Signed-off-by: Clement Foyer <[email protected]>
* Fix missing procs in hashtable. Cache coll monitoring data.
* Add MCA_PML_BASE_FLAG_REQUIRE_WORLD flag to the PML layer.
* Cache monitoring data relative to collectives operations on creation.
* Remove double caching.
* Use same proc name definition for hash table when inserting and
when retrieving.
Signed-off-by: Clement Foyer <[email protected]>
* Use intermediate variable to avoid invalid write while retrieving ranks in hashtable.
Signed-off-by: Clement Foyer <[email protected]>
* Add missing release of the last element in flush_all. Add release of the hashtable in finalize.
Signed-off-by: Clement Foyer <[email protected]>
* Use a linked list instead of a hashtable to keep tracks of communicator data. Add release of the structure at finalize time.
Signed-off-by: Clement Foyer <[email protected]>
* Set world_rank from hashtable only if found
Signed-off-by: Clement Foyer <[email protected]>
* Use predefined symbol from opal system to print int
Signed-off-by: Clement Foyer <[email protected]>
* Move collective monitoring data to a hashtable. Add pvar to access the monitoring_coll_data. Move functions header to a private file only to be used in ompi/mca/common/monitoring
Signed-off-by: Clement Foyer <[email protected]>
* Fix pvar registration. Use OMPI_ERROR isntead of -1 as returned error value. Fix releasing of coll_data_t objects. Affect value only if data is found in the hashtable.
Signed-off-by: Clement Foyer <[email protected]>
* Add automated check (with MPI_Tools) of monitoring.
Signed-off-by: Clement Foyer <[email protected]>
* Fix procs list caching in common_monitoring_coll_data_t
* Fix monitoring_coll_data type definition.
* Use size(COMM_WORLD)-1 to determine max number of digits.
Signed-off-by: Clement Foyer <[email protected]>
* Add linking to Fortran applications for LD_PRELOAD usage of monitoring_prof
Signed-off-by: Clement Foyer <[email protected]>
* Add PVAR's handles. Clean up code (visibility, add comments...). Start updating the documentation
Signed-off-by: Clement Foyer <[email protected]>
* Fix coll operations monitoring. Update check_monitoring accordingly to the added pvar. Fix monitoring array allocation.
Signed-off-by: Clement Foyer <[email protected]>
* Documentation update.
Update and then move the latex and README documentation to a more logical place
Signed-off-by: Clement Foyer <[email protected]>
* Aggregate monitoring COLL data to the generated matrix. Update documentation accordingly.
Signed-off-by: Clement Foyer <[email protected]>
* Fix monitoring_prof (bad variable.vector used, and wrong array in PMPI_Gather).
Signed-off-by: Clement Foyer <[email protected]>
* Add reduce_scatter and reduce_scatter_block monitoring. Reduce memory footprint of monitoring_prof. Unify OSC related outputs.
Signed-off-by: Clement Foyer <[email protected]>
* Add the use of a machine file for overhead benchmark
Signed-off-by: Clement Foyer <[email protected]>
* Check for out-of-bound write in histogram
Signed-off-by: Clement Foyer <[email protected]>
* Fix common_monitoring_cache object init for MPI_COMM_WORLD
Signed-off-by: Clement Foyer <[email protected]>
* Add RDMA benchmarks to test_overhead
Add error file output. Add MPI_Put and MPI_Get results analysis. Add overhead computation for complete sending (pingpong / 2).
Signed-off-by: Clement Foyer <[email protected]>
* Add computation of average and median of overheads. Add comments and copyrigths to the test_overhead script
Signed-off-by: Clement Foyer <[email protected]>
* Add technical documentation
Signed-off-by: Clement Foyer <[email protected]>
* Adapt to the new definition of communicators
Signed-off-by: Clement Foyer <[email protected]>
* Update expected output in test/monitoring/monitoring_test.c
Signed-off-by: Clement Foyer <[email protected]>
* Add dumping histogram in edge case
Signed-off-by: Clement Foyer <[email protected]>
* Adding a reduce(pml_monitoring_messages_count, MPI_MAX) example
Signed-off-by: Clement Foyer <[email protected]>
* Add consistency in header inclusion.
Include ompi/mpi/fortran/mpif-h/bindings.h only if needed.
Add sanity check before emptying hashtable.
Fix typos in documentation.
Signed-off-by: Clement Foyer <[email protected]>
* misc monitoring fixes
* test/monitoring: fix test when weak symbols are not available
* monitoring: fix a typo and add a missing file in Makefile.am
and have monitoring_common.h and monitoring_common_coll.h included in the distro
* test/monitoring: cleanup all tests and make distclean a happy panda
* test/monitoring: use gettimeofday() if clock_gettime() is unavailable
* monitoring: silence misc warnings (#3)
Signed-off-by: Gilles Gouaillardet <[email protected]>
* Cleanups.
Signed-off-by: George Bosilca <[email protected]>
* Changing int64_t to size_t.
Keep the size_t used accross all monitoring components.
Adapt the documentation.
Remove useless MPI_Request and MPI_Status from monitoring_test.c.
Signed-off-by: Clement Foyer <[email protected]>
* Add parameter for RMA test case
Signed-off-by: Clement Foyer <[email protected]>
* Clean the maximum bound computation for proc list dump.
Use ptrdiff_t instead of OPAL_PTRDIFF_TYPE to reflect the changes from commit fa5cd0d.
Signed-off-by: Clement Foyer <[email protected]>
* Add communicator-specific monitored collective data reset
Signed-off-by: Clement Foyer <[email protected]>
* Add monitoring scripts to the 'make dist'
Also install them in the build and the install directories.
Signed-off-by: George Bosilca <[email protected]>1 parent b1e639e commit d55b666
File tree
65 files changed
+8216
-684
lines changed- ompi/mca
- coll
- base
- monitoring
- common/monitoring
- osc/monitoring
- pml/monitoring
- opal/mca/base
- test/monitoring
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
65 files changed
+8216
-684
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1409 | 1409 | | |
1410 | 1410 | | |
1411 | 1411 | | |
| 1412 | + | |
| 1413 | + | |
| 1414 | + | |
| 1415 | + | |
1412 | 1416 | | |
1413 | 1417 | | |
1414 | 1418 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
| 5 | + | |
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | 49 | | |
53 | 50 | | |
54 | 51 | | |
| |||
105 | 102 | | |
106 | 103 | | |
107 | 104 | | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
108 | 119 | | |
109 | 120 | | |
110 | 121 | | |
| |||
138 | 149 | | |
139 | 150 | | |
140 | 151 | | |
141 | | - | |
142 | | - | |
143 | | - | |
144 | | - | |
145 | | - | |
146 | | - | |
147 | | - | |
148 | | - | |
149 | | - | |
150 | | - | |
151 | | - | |
152 | | - | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
153 | 156 | | |
154 | 157 | | |
155 | 158 | | |
156 | 159 | | |
157 | | - | |
158 | | - | |
159 | | - | |
160 | | - | |
161 | | - | |
162 | | - | |
163 | | - | |
164 | | - | |
165 | | - | |
166 | | - | |
167 | | - | |
168 | | - | |
169 | | - | |
170 | | - | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
0 commit comments