-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fabtests/efa: Add fabtests for efa-direct #10800
Conversation
@@ -1,8 +1,8 @@ | |||
import pytest | |||
|
|||
@pytest.mark.functional | |||
def test_flood_peer(cmdline_args, fabric): | |||
def test_flood_peer(cmdline_args): | |||
from common import ClientServerTest | |||
test = ClientServerTest(cmdline_args, f"fi_flood -e rdm -W 6400 -S 512 -T 5", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails with
transmit(): common/shared.c:2243, ret=12 (Cannot allocate memory)
Batch memory registration:Fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks bad, need to look into
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a TODO
here so that we don't forget?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test uses a mode FT_OPT_ALLOC_MULT_MR , which allocates another array of buffers of size opts.transfer_size and register each of them of such size https://github.com/ofiwg/libfabric/blob/v1.22.0/fabtests/common/shared.c#L587-L690. Then
during fi_flood test it post recv from these separate mr buffers via ft_post_rx . However ft_post_rx will finally post a fi_recv with size as the MAX of opts.transfer_size and FT_MAX_CTRL_MSG (1024). When opts.transfer_size is smaller than FT_MAX_CTRL_MSG, it will finally post a recv with size larger than the recv buffer.
I am still discussing with Intel whether it is a bug of the test, but it won't work for efa-direct anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, should be fixed by #10819
@@ -3,14 +3,14 @@ | |||
|
|||
@pytest.mark.functional | |||
@pytest.mark.parametrize("environment_variable", ["FI_EFA_FORK_SAFE", "RDMAV_FORK_SAFE"]) | |||
def test_fork_support(cmdline_args, completion_semantic, environment_variable, fabric): | |||
def test_fork_support(cmdline_args, completion_semantic, environment_variable): | |||
from common import ClientServerTest | |||
import copy | |||
cmdline_args_copy = copy.copy(cmdline_args) | |||
|
|||
cmdline_args_copy.append_environ("{}=1".format(environment_variable)) | |||
test = ClientServerTest(cmdline_args_copy, "fi_rdm_tagged_bw -K", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
efa-direct does not support tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to run with tagged_bw, we can run with fi_rdm_bw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this and got timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the default message sizes are too big?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed by #10802
fabtests/pytest/efa/test_multi_ep.py
Outdated
from common import ClientServerTest | ||
cmd = "fi_multi_ep -e rdm" | ||
if shared_cq: | ||
cmd += " -Q" | ||
test = ClientServerTest(cmdline_args, cmd, fabric=fabric) | ||
test = ClientServerTest(cmdline_args, cmd, fabric="efa") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout with efa-direct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds a bug too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figure it out, it is because the rx_size too large again. And this test post a too large rx buffer during the ep enable, and that rx buffer cannot get a send delivered at the final ft_finalize_ep
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to specify a small -S
after this PR is merged #10814
fabtests/pytest/efa/test_rdm.py
Outdated
command = "fi_rdm_pingpong" + " " + perf_progress_model_cli | ||
if iteration_type == "standard" and fabric == "efa-direct": | ||
pytest.skip("Skip standard iteration for efa-direct") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails with inject(): common/shared.c:2321, ret=12 (Cannot allocate memory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ret=12 sounds like a bug that this PR will fix #10818
fabtests/pytest/efa/test_rnr.py
Outdated
@@ -15,7 +15,7 @@ def test_rnr_read_cq_error(cmdline_args, fabric): | |||
# in this case. | |||
cmdline_args_copy = copy.copy(cmdline_args) | |||
cmdline_args_copy.strict_fabtests_mode = False | |||
test = ClientServerTest(cmdline_args_copy, "fi_efa_rnr_read_cq_error", fabric=fabric) | |||
test = ClientServerTest(cmdline_args_copy, "fi_efa_rnr_read_cq_error", fabric="efa") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libfabric:27104:1739913431::efa:cq:efa_cq_handle_error():100 err: 269, message: Destination resource not ready (no work queue entries posted on receive queue) My EFA addr: fi_addr_efa://[fe80::4f6:e9ff:fefe:adcd]:1:1128862596 My host id: i-059a4b7d0e872e94d Peer EFA addr: fi_addr_efa://[fe80::409:2dff:fe85:85d9]:1:1721598274 Peer host id: N/A (10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks unexpected too. efa-direct should configure the rnr retry based on setopt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/*
* In order for the sender to get RNR error, we need to first consume
* all pre-posted receive buffer (in efa provider, fi->rx_attr->size
* receiving buffer are pre-posted) on the receiver side, the subsequent
* sends (expected_rnr_error) will then get RNR errors.
*/
This is only true for efa-rdm. Efa-direct doesn't prepost any buffer, the number of fi_send posted should be adjusted accordingly.
052ea96
to
352de40
Compare
@@ -1,8 +1,8 @@ | |||
import pytest | |||
|
|||
@pytest.mark.functional | |||
def test_flood_peer(cmdline_args, fabric): | |||
def test_flood_peer(cmdline_args): | |||
from common import ClientServerTest | |||
test = ClientServerTest(cmdline_args, f"fi_flood -e rdm -W 6400 -S 512 -T 5", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks bad, need to look into
@@ -3,14 +3,14 @@ | |||
|
|||
@pytest.mark.functional | |||
@pytest.mark.parametrize("environment_variable", ["FI_EFA_FORK_SAFE", "RDMAV_FORK_SAFE"]) | |||
def test_fork_support(cmdline_args, completion_semantic, environment_variable, fabric): | |||
def test_fork_support(cmdline_args, completion_semantic, environment_variable): | |||
from common import ClientServerTest | |||
import copy | |||
cmdline_args_copy = copy.copy(cmdline_args) | |||
|
|||
cmdline_args_copy.append_environ("{}=1".format(environment_variable)) | |||
test = ClientServerTest(cmdline_args_copy, "fi_rdm_tagged_bw -K", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to run with tagged_bw, we can run with fi_rdm_bw
fabtests/pytest/efa/test_multi_ep.py
Outdated
from common import ClientServerTest | ||
cmd = "fi_multi_ep -e rdm" | ||
if shared_cq: | ||
cmd += " -Q" | ||
test = ClientServerTest(cmdline_args, cmd, fabric=fabric) | ||
test = ClientServerTest(cmdline_args, cmd, fabric="efa") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds a bug too
fabtests/pytest/efa/test_rnr.py
Outdated
@@ -15,7 +15,7 @@ def test_rnr_read_cq_error(cmdline_args, fabric): | |||
# in this case. | |||
cmdline_args_copy = copy.copy(cmdline_args) | |||
cmdline_args_copy.strict_fabtests_mode = False | |||
test = ClientServerTest(cmdline_args_copy, "fi_efa_rnr_read_cq_error", fabric=fabric) | |||
test = ClientServerTest(cmdline_args_copy, "fi_efa_rnr_read_cq_error", fabric="efa") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks unexpected too. efa-direct should configure the rnr retry based on setopt
fabtests/scripts/runfabtests.py
Outdated
""" | ||
import subprocess | ||
binpath = os.getcwd() | ||
cmd = "timeout 360 " + os.path.join(binpath, "fi_efa_rdma_checker -o writedata") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should you check all read, write, and writedata?
fabtests/scripts/runfabtests.py
Outdated
@@ -167,6 +183,8 @@ def fabtests_args_to_pytest_args(fabtests_args, shared_options, run_mode): | |||
pytest_args.append(markers) | |||
|
|||
if fabtests_args.expression: | |||
if not support_fi_rma(fabtests_args.server_id): | |||
fabtests_args.expression += " and not efa-direct" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you just add pytest.skip() in the rma test when fabric == 'efa-direct' and not support_fi_rma
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought you prefer adding this in runfabtests.py. How about checking has_rdma
in the fixture rma_bw_completion_semantic
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't mix these, you can introduce another fixture as rma_fabric
like
def rma_fabric(fabric):
if fabric == 'efa-direct' and not_fi_rma(...):
pytest.skip()
return fabric
Then make all test_rma_bw
and test_rma_pingpong
use rma_fabric
instead of fabric
as the fixture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this skip efa-direct
on all instances that don't support writedata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my testing of this PR. support_fi_rma
is even False on p5, and all efa-direct tests are skipped...
@@ -1,8 +1,8 @@ | |||
import pytest | |||
|
|||
@pytest.mark.functional | |||
def test_flood_peer(cmdline_args, fabric): | |||
def test_flood_peer(cmdline_args): | |||
from common import ClientServerTest | |||
test = ClientServerTest(cmdline_args, f"fi_flood -e rdm -W 6400 -S 512 -T 5", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a TODO
here so that we don't forget?
fabtests/pytest/efa/test_rdm.py
Outdated
command = "fi_rdm_pingpong" + " " + perf_progress_model_cli | ||
if iteration_type == "standard" and fabric == "efa-direct": | ||
pytest.skip("Skip standard iteration for efa-direct") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should pass?
fabtests/scripts/runfabtests.py
Outdated
@@ -167,6 +183,8 @@ def fabtests_args_to_pytest_args(fabtests_args, shared_options, run_mode): | |||
pytest_args.append(markers) | |||
|
|||
if fabtests_args.expression: | |||
if not support_fi_rma(fabtests_args.server_id): | |||
fabtests_args.expression += " and not efa-direct" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this skip efa-direct
on all instances that don't support writedata?
@@ -30,6 +30,9 @@ def test_transfer_with_read_protocol_cuda(cmdline_args, fabtest_name, cntrl_env_ | |||
if not has_cuda(cmdline_args.client_id) or not has_cuda(cmdline_args.server_id): | |||
pytest.skip("Client and server both need a Cuda device") | |||
|
|||
if fabtest_name == "fi_rdm_tagged_bw" and fabric == "efa-direct": | |||
pytest.skip("efa-direct does not support tagged") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need to run this test with efa-direct at all, it is testing protocol selection that efa-direct doesn't have
07000d4
to
6a1262e
Compare
6a1262e
to
8f18de1
Compare
Run fabtests for efa-direct up to device max msg size(8K) where FI_RMA is supported. RMA tests currently post a recv with the max transfer size. It could run up to max rdma size after fixing fabtests to post the recv within device max msg size. Also skip 0 byte for rma because fabtests use inject for messages smaller than inject_size, but efa-direct does not support it until firmware supports inline write. Signed-off-by: Jessie Yang <[email protected]>
8f18de1
to
35ba211
Compare
Run fabtests for efa-direct up to device max msg size(8K) where FI_RMA is supported.
RMA tests currently post a recv with the max transfer size. It could run up to max rdma size after fixing fabtests to post the recv within device max msg size.
Also skip 0 byte for rma because fabtests use inject for messages smaller than inject_size, but efa-direct does not support it until firmware supports inline write.