Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fabtests: Severe performance degradation when running two test tasks at the same time #9641

Open
Multhree opened this issue Dec 14, 2023 · 1 comment

Comments

@Multhree
Copy link

Multhree commented Dec 14, 2023

Describe the bug
We try to run two fi_rma_bw (other bandwidth testcase perform the same) simultaneously, but we found that each testcase can only achieve half bandwidth we expected. As a comparison, we also try to run other case, include one perftest (a RDMA device performance micro-benchmark) / two perftest / one fi_rma_bw / one perfter + one fi_rma_bw and only the last test case results were not as expected. If We run single fi_rma_bw, we can see the result reach the line rate of RDMA NIC.

Can you provide some possible reasons? For example, does libfabric take up the resources of an entire NIC?

To Reproduce
Two servers with two Mellanox CX7 cards each and two NIC is configured as bonding.
One host run two fi_rma_bw as server, one host run two fi_rma_bw as client.

Expected behavior
Each fi_rma_bw can reach ~200Gb/s because single CX7 's line rate is 200Gb/s.

Environment:
provider: verbs or verbs;ofi_rxm or verbs;ofi_rxd
version: libfabric-1.20

Additional context
We've ruled out the possibility of a bottleneck on the network.

@Juee14Desai
Copy link
Contributor

Hello @Multhree

Could you please provide more detail on the setup that you tested on? How is the interface configured and how are the nodes connected? Is it back-to-back connection or connected via switch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants