fabtests: multinode test updates #10344

amirshehataornl · 2024-08-27T21:18:03Z

Made a few updates to the multinode test:

accept a -x flag to turn off setting the service/node/flags

this is needed to work with CXI

accept a -u flag to set a process manager: pmi or pmix
modify the code to get the rank from the appropriate environment variable if a process manager is specified.
Add a runmultinode.py script which enables users to run the test using a backing process manager. The python script takes a YAML configuration file which defines the environment and test. An example python configuration file:

multinode:
environment:
FI_MR_CACHE_MAX_SIZE: -1
FI_MR_CACHE_MAX_COUNT: 524288
FI_SHM_USE_XPMEM: 1
FI_LOG_LEVEL: info
bind-to: core
map-by-count: 1
map-by: l3cache
pattern: full_mesh

Script Usage:
usage: runmultinode.py [-h] [--dry-run] [--ci CI] [-C CAPABILITY]
[-i ITERATIONS] [-l {internal,srun,mpirun}]
[-p PROVIDER] [-np NUM_PROCS] [-c CONFIG]
[-t PROCS_PER_NODE]

libfabric multinode test with slurm

optional arguments:
-h, --help show this help message and exit
--dry-run Perform a dry run without making any changes.
--ci CI Commands to prepend to test call. Only used with the
internal launcher option
-C CAPABILITY, --capability CAPABILITY
libfabric capability
-i ITERATIONS, --iterations ITERATIONS
Number of iterations
-l {internal,srun,mpirun}, --launcher {internal,srun,mpirun}
launcher to use for running job. If nothing is
specified, test manages processes internally.
Available options: internal, srun and mpirun

Required arguments:
-p PROVIDER, --provider PROVIDER
libfabric provider
-np NUM_PROCS, --num-procs NUM_PROCS
Map process by node, l3cache, etc
-c CONFIG, --config CONFIG
Test configuration

Required if using srun:
-t PROCS_PER_NODE, --procs-per-node PROCS_PER_NODE
Number of procs per node

Running the script:

runmultinode.py -p cxi -i 1 --procs-per-node 8 --num-procs 8 -l srun -c mn.yaml

j-xiong · 2024-08-27T23:28:06Z

fabtests/multinode/include/core.h

+enum multi_scheduler_type {
+	SCHED_PMIX,
+	SCHED_PMI,
+	SCHED_UNDEF,
+};


Not sure if scheduler is the most suitable term here. Might just call it "pm". Also I would move the undefined item to the beginning and call it "PM_NONE" if it means no PM at all or "PM_UNSPEC" if it means "don't care".

j-xiong · 2024-08-27T23:38:26Z

fabtests/multinode/src/harness.c

+		/* Rank 0 should be the server and the only one that should call
+		* bind() and server_connect() */


This comment is unnecessary since this explained by the code itself.

j-xiong · 2024-08-27T23:40:06Z

fabtests/multinode/src/harness.c

+			ret = server_connect();
+			bound = true;
+		} else if (sched) {
+			ret = -FI_EINVAL;


Why overwrite the error code? The original code could be more helpful in diagnosing the connection error.

j-xiong · 2024-08-27T23:47:56Z

fabtests/multinode/src/harness.c

+		case 'a':
+			opts.options |= FT_OPT_ADDR_IS_OOB;
+			break;


Could you just use the existing -E option?

j-xiong · 2024-08-29T23:48:49Z

fabtests/multinode/include/core.h

+#define PRINTF(fmt, args...) log_print(stdout, fmt, ## args)
+#define EPRINTF(fmt, args...) log_print(stderr, fmt, ## args)
+
+#else
+
+#define PRINTF(fmt, args...) fprintf(stdout, fmt, ## args)
+#define EPRINTF(fmt, args...) fprintf(stderr, fmt, ## args)


MSVC doesn't like this definition and that's why the appveyor tests are failing.

Standard variadic macro looks like this:

#define MACRO(s, ...) printf(s, __VA_ARGS__)

Made a few updates to the multinode test: 1. accept a -x flag to turn off setting the service/node/flags - this is needed to work with CXI 2. accept a -u flag to set a process manager: pmi or pmix 3. modify the code to get the rank from the appropriate environment variable if a process manager is specified. 4. Add a runmultinode.py script which enables users to run the test using a backing process manager. The python script takes a YAML configuration file which defines the environment and test. An example python configuration file: multinode: environment: FI_MR_CACHE_MAX_SIZE: -1 FI_MR_CACHE_MAX_COUNT: 524288 FI_SHM_USE_XPMEM: 1 FI_LOG_LEVEL: info bind-to: core map-by-count: 1 map-by: l3cache pattern: full_mesh Script Usage: usage: runmultinode.py [-h] [--dry-run] [--ci CI] [-C CAPABILITY] [-i ITERATIONS] [-l {internal,srun,mpirun}] [-p PROVIDER] [-np NUM_PROCS] [-c CONFIG] [-t PROCS_PER_NODE] libfabric multinode test with slurm optional arguments: -h, --help show this help message and exit --dry-run Perform a dry run without making any changes. --ci CI Commands to prepend to test call. Only used with the internal launcher option -C CAPABILITY, --capability CAPABILITY libfabric capability -i ITERATIONS, --iterations ITERATIONS Number of iterations -l {internal,srun,mpirun}, --launcher {internal,srun,mpirun} launcher to use for running job. If nothing is specified, test manages processes internally. Available options: internal, srun and mpirun Required arguments: -p PROVIDER, --provider PROVIDER libfabric provider -np NUM_PROCS, --num-procs NUM_PROCS Map process by node, l3cache, etc -c CONFIG, --config CONFIG Test configuration Required if using srun: -t PROCS_PER_NODE, --procs-per-node PROCS_PER_NODE Number of procs per node Running the script: runmultinode.py -p cxi -i 1 --procs-per-node 8 --num-procs 8 -l srun -c mn.yaml Signed-off-by: Amir Shehata <[email protected]>

amirshehataornl · 2024-08-30T18:17:36Z

Is there a problem with the approach. Just wondering why there is a "do not merge" label on it.

j-xiong · 2024-08-30T19:03:39Z

I am working on the 2.0.0alpha release and that requires holding off merging any PR before the release is done.

amirshehataornl · 2024-09-04T13:39:34Z

I can't access the failed tests. Are they related to this PR?

j-xiong · 2024-09-04T17:21:52Z

The failures are setopt/getopt errors with ucx which have been fixed by #10341. Unrelated to this PR.

amirshehataornl force-pushed the main branch from 36faad5 to fb94a26 Compare August 27, 2024 21:51

j-xiong reviewed Aug 27, 2024

View reviewed changes

amirshehataornl force-pushed the main branch 2 times, most recently from af3b436 to 6fd89bc Compare August 29, 2024 22:15

j-xiong added the ⚠️ Do not merge label Aug 29, 2024

j-xiong reviewed Aug 29, 2024

View reviewed changes

amirshehataornl force-pushed the main branch from 6fd89bc to aa8829b Compare August 30, 2024 12:43

j-xiong removed the ⚠️ Do not merge label Aug 31, 2024

j-xiong approved these changes Sep 4, 2024

View reviewed changes

j-xiong merged commit ad3a40c into ofiwg:main Sep 4, 2024
12 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fabtests: multinode test updates #10344

fabtests: multinode test updates #10344

amirshehataornl commented Aug 27, 2024

j-xiong Aug 27, 2024

j-xiong Aug 27, 2024

j-xiong Aug 27, 2024

j-xiong Aug 27, 2024

j-xiong Aug 29, 2024

amirshehataornl commented Aug 30, 2024

j-xiong commented Aug 30, 2024 •

edited

Loading

amirshehataornl commented Sep 4, 2024

j-xiong commented Sep 4, 2024

		/* Rank 0 should be the server and the only one that should call
		* bind() and server_connect() */

fabtests: multinode test updates #10344

fabtests: multinode test updates #10344

Conversation

amirshehataornl commented Aug 27, 2024

j-xiong Aug 27, 2024

Choose a reason for hiding this comment

j-xiong Aug 27, 2024

Choose a reason for hiding this comment

j-xiong Aug 27, 2024

Choose a reason for hiding this comment

j-xiong Aug 27, 2024

Choose a reason for hiding this comment

j-xiong Aug 29, 2024

Choose a reason for hiding this comment

amirshehataornl commented Aug 30, 2024

j-xiong commented Aug 30, 2024 • edited Loading

amirshehataornl commented Sep 4, 2024

j-xiong commented Sep 4, 2024

j-xiong commented Aug 30, 2024 •

edited

Loading