-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fabtests: multinode test updates #10344
Conversation
fabtests/multinode/include/core.h
Outdated
enum multi_scheduler_type { | ||
SCHED_PMIX, | ||
SCHED_PMI, | ||
SCHED_UNDEF, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if scheduler is the most suitable term here. Might just call it "pm". Also I would move the undefined item to the beginning and call it "PM_NONE" if it means no PM at all or "PM_UNSPEC" if it means "don't care".
fabtests/multinode/src/harness.c
Outdated
/* Rank 0 should be the server and the only one that should call | ||
* bind() and server_connect() */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is unnecessary since this explained by the code itself.
fabtests/multinode/src/harness.c
Outdated
ret = server_connect(); | ||
bound = true; | ||
} else if (sched) { | ||
ret = -FI_EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why overwrite the error code? The original code could be more helpful in diagnosing the connection error.
fabtests/multinode/src/harness.c
Outdated
case 'a': | ||
opts.options |= FT_OPT_ADDR_IS_OOB; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you just use the existing -E
option?
af3b436
to
6fd89bc
Compare
fabtests/multinode/include/core.h
Outdated
#define PRINTF(fmt, args...) log_print(stdout, fmt, ## args) | ||
#define EPRINTF(fmt, args...) log_print(stderr, fmt, ## args) | ||
|
||
#else | ||
|
||
#define PRINTF(fmt, args...) fprintf(stdout, fmt, ## args) | ||
#define EPRINTF(fmt, args...) fprintf(stderr, fmt, ## args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MSVC doesn't like this definition and that's why the appveyor tests are failing.
Standard variadic macro looks like this:
#define MACRO(s, ...) printf(s, __VA_ARGS__)
Made a few updates to the multinode test: 1. accept a -x flag to turn off setting the service/node/flags - this is needed to work with CXI 2. accept a -u flag to set a process manager: pmi or pmix 3. modify the code to get the rank from the appropriate environment variable if a process manager is specified. 4. Add a runmultinode.py script which enables users to run the test using a backing process manager. The python script takes a YAML configuration file which defines the environment and test. An example python configuration file: multinode: environment: FI_MR_CACHE_MAX_SIZE: -1 FI_MR_CACHE_MAX_COUNT: 524288 FI_SHM_USE_XPMEM: 1 FI_LOG_LEVEL: info bind-to: core map-by-count: 1 map-by: l3cache pattern: full_mesh Script Usage: usage: runmultinode.py [-h] [--dry-run] [--ci CI] [-C CAPABILITY] [-i ITERATIONS] [-l {internal,srun,mpirun}] [-p PROVIDER] [-np NUM_PROCS] [-c CONFIG] [-t PROCS_PER_NODE] libfabric multinode test with slurm optional arguments: -h, --help show this help message and exit --dry-run Perform a dry run without making any changes. --ci CI Commands to prepend to test call. Only used with the internal launcher option -C CAPABILITY, --capability CAPABILITY libfabric capability -i ITERATIONS, --iterations ITERATIONS Number of iterations -l {internal,srun,mpirun}, --launcher {internal,srun,mpirun} launcher to use for running job. If nothing is specified, test manages processes internally. Available options: internal, srun and mpirun Required arguments: -p PROVIDER, --provider PROVIDER libfabric provider -np NUM_PROCS, --num-procs NUM_PROCS Map process by node, l3cache, etc -c CONFIG, --config CONFIG Test configuration Required if using srun: -t PROCS_PER_NODE, --procs-per-node PROCS_PER_NODE Number of procs per node Running the script: runmultinode.py -p cxi -i 1 --procs-per-node 8 --num-procs 8 -l srun -c mn.yaml Signed-off-by: Amir Shehata <[email protected]>
Is there a problem with the approach. Just wondering why there is a "do not merge" label on it. |
I am working on the 2.0.0alpha release and that requires holding off merging any PR before the release is done. |
I can't access the failed tests. Are they related to this PR? |
The failures are setopt/getopt errors with ucx which have been fixed by #10341. Unrelated to this PR. |
Made a few updates to the multinode test:
multinode:
environment:
FI_MR_CACHE_MAX_SIZE: -1
FI_MR_CACHE_MAX_COUNT: 524288
FI_SHM_USE_XPMEM: 1
FI_LOG_LEVEL: info
bind-to: core
map-by-count: 1
map-by: l3cache
pattern: full_mesh
Script Usage:
usage: runmultinode.py [-h] [--dry-run] [--ci CI] [-C CAPABILITY]
[-i ITERATIONS] [-l {internal,srun,mpirun}]
[-p PROVIDER] [-np NUM_PROCS] [-c CONFIG]
[-t PROCS_PER_NODE]
libfabric multinode test with slurm
optional arguments:
-h, --help show this help message and exit
--dry-run Perform a dry run without making any changes.
--ci CI Commands to prepend to test call. Only used with the
internal launcher option
-C CAPABILITY, --capability CAPABILITY
libfabric capability
-i ITERATIONS, --iterations ITERATIONS
Number of iterations
-l {internal,srun,mpirun}, --launcher {internal,srun,mpirun}
launcher to use for running job. If nothing is
specified, test manages processes internally.
Available options: internal, srun and mpirun
Required arguments:
-p PROVIDER, --provider PROVIDER
libfabric provider
-np NUM_PROCS, --num-procs NUM_PROCS
Map process by node, l3cache, etc
-c CONFIG, --config CONFIG
Test configuration
Required if using srun:
-t PROCS_PER_NODE, --procs-per-node PROCS_PER_NODE
Number of procs per node
Running the script:
runmultinode.py -p cxi -i 1 --procs-per-node 8 --num-procs 8 -l srun -c mn.yaml