Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fabtests/multi_ep: add more stressing of resource binding to multi_ep test #10249

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

aingerson
Copy link
Contributor

  • adds more MRs
  • adds RMA testing
  • adds closing MRs, reregistering, retesting
  • adds threading level toggling for all tests
  • adds more combinations of multi_ep testing to runfabtests.sh
  • adds multi_ep testing back in for tcp provider

cq_read_idx = 0;
else
cq_read_idx = i;
while (recv_bufs[i][0] == 'A')
Copy link
Contributor

@shijin-aws shijin-aws Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For EFA, I think this wait strategy is not safe unless NIC deliver all the data in the buffer once in sequence. We have some problem for it before and currently we only use this method when NIC supports RDMA write with in-order & write-once delivery. Why not use fi_writedata here so you can rely on the CQ poll to know data is delivered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense, I can change it. Thanks!

@shijin-aws
Copy link
Contributor

shijin-aws commented Aug 13, 2024

Is intel CI failure related?

@aingerson
Copy link
Contributor Author

@shijin-aws Yes, it was. UCX doesn't support CQ data and also uses FI_MR_RAW so I had to add MR_RAW support and use an alternative solution to the CQ data (just used a regular sync). Let me know if efa would be ok with this solution.

@shijin-aws
Copy link
Contributor

shijin-aws commented Aug 13, 2024

I don't think efa supports FI_MR_RAW, will look at your new solution

@shijin-aws
Copy link
Contributor

shijin-aws commented Aug 13, 2024

AWS CI failure: It seems the remote key is wrong for fi_write

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.35.180 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10249-debug-mempoison/install/fabtests/bin/fi_multi_ep -e rdm -p efa -E=9229'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.35.87 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10249-debug-mempoison/install/fabtests/bin/fi_multi_ep -e rdm -p efa -E=9229 172.31.35.180'"'"''
client_stdout:
libfabric:265083:1723578812::efa:mr:ofi_mr_map_verify():124<warn> unknown key: 2359276
libfabric:265083:1723578812::efa:ep_ctrl:efa_rdm_rma_verified_copy_iov():35<warn> MR verification failed (Invalid argument), addr: 17777c0 key: 2359276
libfabric:265083:1723578812::efa:cq:efa_rdm_pke_proc_eager_rtw():149<warn> RMA address verify failed!
libfabric:265083:1723578812::efa:eq:efa_base_ep_write_eq_error():423<warn> Writing error to EQ: err: Invalid argument (-22) prov_errno: RMA address verification failed (4111)
libfabric:265083:1723578812::efa:eq:efa_base_ep_write_eq_error():437<warn> Unable to write to EQ
Libfabric EFA provider has encountered an internal error:

Libfabric error: (-22) Invalid argument
EFA internal error: (4111) RMA address verification failed

Your application will now abort().
timeout: the monitored command dumped core

client returncode: 255
server_stdout:

server returncode: 124
------------ generated xml file: /tmp/fabtests_pqpvr0.xml.parallel -------------
=========================== short test summary info ============================
FAILED efa/test_multi_ep.py::test_multi_ep[False] - Failed: returncode 124, t...
============ 1 failed, 165 passed, 289 skipped in 597.16s (0:09:57) ============

@shijin-aws
Copy link
Contributor

Same error

--------------------------------- Captured Log ---------------------------------

--------------------------------- Captured Out ---------------------------------

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.58.193 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10249/install/fabtests/bin/fi_multi_ep -e rdm  --shared-cq -p efa -E=9229'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.56.121 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10249/install/fabtests/bin/fi_multi_ep -e rdm  --shared-cq -p efa -E=9229 172.31.58.193'"'"''
client_stdout:
libfabric:57468:1723606810::efa:mr:ofi_mr_map_verify():124<warn> unknown key: 9443854
libfabric:57468:1723606810::efa:ep_ctrl:efa_rdm_rma_verified_copy_iov():35<warn> MR verification failed (Invalid argument), addr: fa366e0 key: 9443854
libfabric:57468:1723606810::efa:cq:efa_rdm_pke_proc_eager_rtw():149<warn> RMA address verify failed!
libfabric:57468:1723606810::efa:eq:efa_base_ep_write_eq_error():423<warn> Writing error to EQ: err: Invalid argument (-22) prov_errno: RMA address verification failed (4111)
libfabric:57468:1723606810::efa:eq:efa_base_ep_write_eq_error():437<warn> Unable to write to EQ
Libfabric EFA provider has encountered an internal error:

Libfabric error: (-22) Invalid argument
EFA internal error: (4111) RMA address verification failed

Your application will now abort().

client returncode: 255
server_stdout:

server returncode: 124

@aingerson
Copy link
Contributor Author

@shijin-aws I pushed some changes to make sure we weren't corrupting the key. Is the AWS failure the same key verification error?

@shijin-aws
Copy link
Contributor

shijin-aws commented Aug 19, 2024

Yes, same failure

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.19.252 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10249/install/fabtests/bin/fi_multi_ep -e rdm  --shared-cq -p efa -E=9229'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.19.64 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10249/install/fabtests/bin/fi_multi_ep -e rdm  --shared-cq -p efa -E=9229 172.31.19.252'"'"''
client_stdout:

client returncode: 124
server_stdout:
libfabric:740051:1723844387::efa:mr:ofi_mr_map_verify():123<warn> unknown key: 6295859
libfabric:740051:1723844387::efa:ep_ctrl:efa_rdm_rma_verified_copy_iov():33<warn> MR verification failed (Invalid argument), addr: 2789bd90 key: 6295859
libfabric:740051:1723844387::efa:cq:efa_rdm_pke_proc_eager_rtw():149<warn> RMA address verify failed!
libfabric:740051:1723844387::efa:eq:efa_base_ep_write_eq_error():422<warn> Writing error to EQ: err: Invalid argument (-22) prov_errno: RMA address verification failed (4111)
libfabric:740051:1723844387::efa:eq:efa_base_ep_write_eq_error():437<warn> Unable to write to EQ
Libfabric EFA provider has encountered an internal error:

Libfabric error: (-22) Invalid argument
EFA internal error: (4111) RMA address verification failed

Your application will now abort().
timeout: the monitored command dumped core

server returncode: 255

May I know does your patch respect FI_MR_PROV_KEY?

@aingerson
Copy link
Contributor Author

@shijin-aws It should, yes. I put all the existing key exchange into a common function that we call in this test. That includes getting they key for the MR and sending that one to the peer instead of the requested key. See here
I can repush with some prints to see what key is getting registered and which is getting used.

@aingerson
Copy link
Contributor Author

@shijin-aws Can you send the AWS log whenever you get chance? No rush

@shijin-aws
Copy link
Contributor

@aingerson sure

--------------------------------- Captured Log ---------------------------------

--------------------------------- Captured Out ---------------------------------

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.72.165 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10249/install/fabtests/bin/fi_multi_ep -e rdm  --shared-cq -p efa -E=9229'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.78.210 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10249/install/fabtests/bin/fi_multi_ep -e rdm  --shared-cq -p efa -E=9229 172.31.72.165'"'"''
client_stdout:
libfabric:37616:1724103998::core:core:ofi_xpmem_init():112<warn> Failed to export process virtual address space for use with xpmem
libfabric:37616:1724103999::efa:mr:ofi_mr_map_verify():124<warn> unknown key: 13632752
libfabric:37616:1724103999::efa:ep_ctrl:efa_rdm_rma_verified_copy_iov():35<warn> MR verification failed (Invalid argument), addr: db3730 key: 13632752
libfabric:37616:1724103999::efa:cq:efa_rdm_pke_proc_eager_rtw():149<warn> RMA address verify failed!
libfabric:37616:1724103999::efa:eq:efa_base_ep_write_eq_error():423<warn> Writing error to EQ: err: Invalid argument (-22) prov_errno: RMA address verification failed (4111)
libfabric:37616:1724103999::efa:eq:efa_base_ep_write_eq_error():437<warn> Unable to write to EQ
Libfabric EFA provider has encountered an internal error:

Libfabric error: (-22) Invalid argument
EFA internal error: (4111) RMA address verification failed

Your application will now abort().

client returncode: 255
server_stdout:
libfabric:41405:1724103997::core:core:ofi_xpmem_init():112<warn> Failed to export process virtual address space for use with xpmem

server returncode: 124

xpmem warning can be ignored.

@shijin-aws
Copy link
Contributor

We run the same test on multiple OSes/VM types. Here is another failed run on different OS/VM

--------------------------------- Captured Log ---------------------------------

--------------------------------- Captured Out ---------------------------------

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.41.181 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10249-debug-mempoison/install/fabtests/bin/fi_multi_ep -e rdm  --shared-cq -p efa -E=9229'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.40.144 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10249-debug-mempoison/install/fabtests/bin/fi_multi_ep -e rdm  --shared-cq -p efa -E=9229 172.31.41.181'"'"''
client_stdout:
libfabric:775035:1724106580::efa:mr:ofi_mr_map_verify():123<warn> unknown key: 0
libfabric:775035:1724106580::efa:ep_ctrl:efa_rdm_rma_verified_copy_iov():33<warn> MR verification failed (Invalid argument), addr: 0 key: 0
libfabric:775035:1724106580::efa:cq:efa_rdm_pke_proc_eager_rtw():149<warn> RMA address verify failed!
libfabric:775035:1724106580::efa:eq:efa_base_ep_write_eq_error():422<warn> Writing error to EQ: err: Invalid argument (-22) prov_errno: RMA address verification failed (4111)
libfabric:775035:1724106580::efa:eq:efa_base_ep_write_eq_error():437<warn> Unable to write to EQ
Libfabric EFA provider has encountered an internal error:

Libfabric error: (-22) Invalid argument
EFA internal error: (4111) RMA address verification failed

Your application will now abort().
timeout: the monitored command dumped core

client returncode: 255
server_stdout:

server returncode: 124

@shijin-aws
Copy link
Contributor

It can be a bug in efa provider though. I have seen the same test passed for single node which finally use shm, but not on double node.

@shijin-aws
Copy link
Contributor

Can you remind me what changed in your latest version that makes the test failed? I remember the same test passed earlier when you use the cq data mechanism

@shijin-aws
Copy link
Contributor

bot:aws:retest

@aingerson
Copy link
Contributor Author

@shijin-aws Added some more fixes that were affecting other providers. Still same error with efa? Any luck with the logs?

@shijin-aws
Copy link
Contributor

shijin-aws commented Aug 21, 2024

@aingerson This times seems the fabtests parser is broken ... How can that happen?

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.35.73 'timeout 720 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn ASAN_OPTIONS=detect_leaks=0:protect_shadow_gap=0 /home/ubuntu/PortaFiducia/build/libraries/libfabric/pr10249-debug-mempoison-asan/install/fabtests/bin/fi_multi_ep -e rdm  --shared-cq -p efa -E=9229'"'"''
Client error: Server has terminated
server_stdout:
/home/ubuntu/PortaFiducia/build/libraries/libfabric/pr10249-debug-mempoison-asan/install/fabtests/bin/fi_multi_ep: unrecognized option '--shared-cq'
Usage:
  /home/ubuntu/PortaFiducia/build/libraries/libfabric/pr10249-debug-mempoison-asan/install/fabtests/bin/fi_multi_ep [OPTIONS]		start server
  /home/ubuntu/PortaFiducia/build/libraries/libfabric/pr10249-debug-mempoison-asan/install/fabtests/bin/fi_multi_ep [OPTIONS] <host>	connect to server

Multi endpoint test

Options:
-B <src_port>                  non default source port number
-P <dst_port>                  non default destination port number
-s <address>                   source address
-b[=<oob_port>]                enable out-of-band address exchange and synchronization over the, optional, port
-E[=<oob_port>]                enable out-of-band address exchange only over the, optional, port
-C <number>                    simultaneous connections to server
-O <addr>                      use the provided addr for out of band
-F <addr_format>               Address format (default:FI_FORMAT_UNSPEC)
-f <fabric>                    fabric name
-d <domain>                    domain name
-p <provider>                  specific provider name eg sockets, verbs
-e <ep_type>                   Endpoint type: msg|rdm|dgram (default:rdm)
                               Only the following tests support this option for now:
                               fi_rma_bw
                               fi_shared_ctx
                               fi_multi_mr
                               fi_multi_ep
                               fi_recv_cancel
                               fi_unexpected_msg
                               fi_resmgmt_test
                               fi_bw
-U                             run fabtests with FI_DELIVERY_COMPLETE set
                               Only the following tests support this option for now:

@aingerson
Copy link
Contributor Author

@shijin-aws Agh. Sorry I made a dumb mistake. I had to change the multi_ep test to use -A and -Q for --shared-av and --shared-cq instead of the long opts because it interfered with the common long opts. Sorry I missed that because I didn't run all the combinations of tests. Made sure all of the arg parsing works this time!

@shijin-aws
Copy link
Contributor

@aingerson AWS CI is not using runfabtests.sh, it's using runfabtests.py, so if you change --shared-cq to -Q you'd have to modify https://github.com/ofiwg/libfabric/blob/main/fabtests/pytest/efa/test_multi_ep.py#L9 as well.

I had to change the multi_ep test to use -A and -Q for --shared-av and --shared-cq instead of the long opts because it interfered with the common long opts.

What common long opts they interfere with?

@aingerson
Copy link
Contributor Author

@shijin-aws multi_ep.c was overriding the common long opts with its own (shared-av and shared-cq) so it wasn't picking up the common code ones. So I had to remove the test-specific long opts and replace them with short ops

	int lopt_idx = 0;
	struct option long_opts[] = {
		{"shared-av", no_argument, NULL, LONG_OPT_SHARED_AV},
		{"shared-cq", no_argument, NULL, LONG_OPT_SHARED_CQ},
		{0, 0, 0, 0}
	};

	while ((op = getopt_long(argc, argv, "c:vh" ADDR_OPTS INFO_OPTS,
				 long_opts, &lopt_idx)) != -1) {

I repushed with the pytest fixes. Next week we can see if it still gets the same key error as before (I left the patch with the prints in there to debug).

@aingerson
Copy link
Contributor Author

@shijin-aws Thank you so much!!

Yeah something is off here. Everything looks good except when the client sends the RMA keys the second time, one of them is not received correctly on the server side:
What the client sent:

[warn] fabtests:functional/multi_ep.c:295: Send RMA info[0]

[warn] fabtests:functional/multi_ep.c:205: RMA info: addr 39648624, len 0, key 3170113

[warn] fabtests:functional/multi_ep.c:295: Send RMA info[1]

[warn] fabtests:functional/multi_ep.c:205: RMA info: addr 39649168, len 0, key 11542619

[warn] fabtests:functional/multi_ep.c:295: Send RMA info[2]

[warn] fabtests:functional/multi_ep.c:205: RMA info: addr 39649712, len 0, key 2119778

What the server got:

[warn] fabtests:functional/multi_ep.c:328: Get RMA info[0]

[warn] fabtests:functional/multi_ep.c:205: RMA info: addr 39648624, len 0, key 15741169

[warn] fabtests:functional/multi_ep.c:328: Get RMA info[1]

[warn] fabtests:functional/multi_ep.c:205: RMA info: addr 39649168, len 0, key 11542619

[warn] fabtests:functional/multi_ep.c:328: Get RMA info[2]

[warn] fabtests:functional/multi_ep.c:205: RMA info: addr 39649712, len 0, key 2119778

The server then issues a write with key 15741169 but the client correctly rejects it because it should be 3170113
No idea how that single RMA info send got corrupted. It's not like it's even off by a bit or so. They are totally different.
Need to take a closer look at the test and see how that would be possible. Thank you for the info! So helpful!
I will probably repush with the extra prints you suggest and maybe more.

@aingerson
Copy link
Contributor Author

Spotted an ordering issue with the key exchange. Refactored the key exchange to eliminate that issue and repushed but still with the prints just in case (and with a few more as suggested)

@shijin-aws
Copy link
Contributor

bot:aws:retest

@shijin-aws
Copy link
Contributor

shijin-aws commented Sep 9, 2024

@aingerson This time AWS CI passed except for the known issue I mentioned in #10362 (comment). It shouldn't block this PR

@aingerson
Copy link
Contributor Author

@shijin-aws Sorry we had CI issues but I think everything is passing now on our end. CI still ok on your end or do you want to rerun? I had to update the raw key path but I don't think that should affect your path. Let me know if you think it's good to merge!

@shijin-aws
Copy link
Contributor

bot:aws:retest

@shijin-aws
Copy link
Contributor

@aingerson I thought you already merged it... Yeah feel free to go ahead

@aingerson
Copy link
Contributor Author

@shijin-aws Ok awesome. I'll let this one run just in case and then merge. Thanks!

@shijin-aws
Copy link
Contributor

@aingerson test pass, but it seems you need to rebase

Instead of using one allocation and MR, separate into separate regions
to test multiple MRs with multiple EPs

Use common hmem alloc interfaces to properly use device support

Signed-off-by: Alexia Ingerson <[email protected]>
Pull some of the mr key/addr exchange into separate functions
that fill in and convert rma info before being exchanged so that
separate tests can call this function and support FI_MR_RAW and
FI_MR_VIRT_ADDR more easily

Signed-off-by: Alexia Ingerson <[email protected]>
To test RMA in addition to FI_MSG, make the following changes:
- Register all MRs for RMA use
- Use existing message test to exchange RMA information (key, address)
- Add RMA test with data validation after the message exchange test

Also includes renaming remote_addr to remote_fiaddr to distinguish
between fiaddr and RMA addr

Signed-off-by: Alexia Ingerson <[email protected]>
Add extra stress testing on multiple EPs/MRs by closing all the
MRs, re-registering them, and re-running the whole test sequence

Signed-off-by: Alexia Ingerson <[email protected]>
Change the --shared-av and --shared-cq into short opts
-A and -Q, respectively. This allows the multi_ep test
to make sure of the common long opts

Signed-off-by: Alexia Ingerson <[email protected]>
Add long option --threading for tests to use to allow caller to set threading level

This removes default setting of domain_attr->threading in various
tests since the default is now universally set to FI_THREAD_DOMAIN

Signed-off-by: Alexia Ingerson <[email protected]>
As the most OFI resource intensive test, this test is a good test
for testing different combinations of resource binding, especially
with FI_THREAD_COMPLETION turned on.

Applications that utilize threads will likely use a combination of
separate domains, EPs, CQs, and AVs. Even though this test does not
use threads and cannot test the protection against these resources,
it can test different optimization paths within providers that may
be triggered based on the threading level requested.

Signed-off-by: Alexia Ingerson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants