Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic/topo numa distance part2 #8853

Merged
merged 1 commit into from
Apr 18, 2023

Conversation

ofirfarjun7
Copy link
Contributor

@ofirfarjun7 ofirfarjun7 commented Feb 7, 2023

What

  • Move numa latency calculations to UCP layer
  • Remove libnuma from UCX

Why ?

  • Remove UCX libnuma dependency

How ?

  • Move device memory latency calculation to UCP
  • Remove/Replace libnuma dependent code.
  • Remove libnuma from UCX

@ofirfarjun7 ofirfarjun7 added the WIP-DNM Work in progress / Do not review label Feb 7, 2023
@ofirfarjun7 ofirfarjun7 force-pushed the topic/topo-numa-distance-part2 branch from b77fd85 to 42ad98c Compare February 21, 2023 18:33
@ofirfarjun7 ofirfarjun7 removed the WIP-DNM Work in progress / Do not review label Feb 21, 2023
@ofirfarjun7 ofirfarjun7 requested a review from gleon99 February 21, 2023 18:34
@ofirfarjun7 ofirfarjun7 force-pushed the topic/topo-numa-distance-part2 branch from 42ad98c to e118212 Compare February 21, 2023 18:37
@ofirfarjun7 ofirfarjun7 added the WIP-DNM Work in progress / Do not review label Mar 19, 2023
@ofirfarjun7
Copy link
Contributor Author

ofirfarjun7 commented Mar 19, 2023

pls don't review until first part is merged.

config/m4/sysdep.m4 Outdated Show resolved Hide resolved
docs/source/faq.md Outdated Show resolved Hide resolved
src/tools/info/sys_info.c Outdated Show resolved Hide resolved
src/tools/info/sys_info.c Outdated Show resolved Hide resolved
test/gtest/ucs/test_topo.cc Outdated Show resolved Hide resolved
test/gtest/ucs/test_topo.cc Outdated Show resolved Hide resolved
test/apps/uct_info/Makefile.in Outdated Show resolved Hide resolved
src/tools/info/sys_info.c Outdated Show resolved Hide resolved
@gleon99
Copy link
Contributor

gleon99 commented Mar 22, 2023

Please look for numa / NUMA across the repo and cleanup where possible, seems there are several places.

@ofirfarjun7 ofirfarjun7 force-pushed the topic/topo-numa-distance-part2 branch 2 times, most recently from 8e84160 to b60fb6d Compare April 8, 2023 10:32
@ofirfarjun7
Copy link
Contributor Author

ofirfarjun7 commented Apr 8, 2023

  • In the FAQ we have "Important note" saying that (if I understand correctly) numactl is a dependency of rdma-core, and as a result also dependency of ucx-ib. Is it still true?
    I've tried to run a docker without numactl installed to see if I can compile UCX but it complained that MLNX_OFED requires libnuma:
    "Error: One or more required packages for installing MLNX_OFED_LINUX are missing.
    Please install the missing packages using your Linux distribution Package Management tool.
    Run:
    yum install numactl-libs"
  • In numa.c and numa.h we have definitions of numa_policy, Do we still need it?
  • In ProtoV2 (proto_common.c) why we consider lane distance only for zcopy send?

@ofirfarjun7 ofirfarjun7 removed the WIP-DNM Work in progress / Do not review label Apr 8, 2023
@ofirfarjun7 ofirfarjun7 force-pushed the topic/topo-numa-distance-part2 branch from b60fb6d to 5a6feea Compare April 9, 2023 13:41
@ofirfarjun7 ofirfarjun7 requested a review from gleon99 April 9, 2023 13:43
@ofirfarjun7 ofirfarjun7 added WIP-DNM Work in progress / Do not review and removed WIP-DNM Work in progress / Do not review labels Apr 9, 2023
src/uct/api/uct.h Outdated Show resolved Hide resolved
src/tools/info/sys_info.c Outdated Show resolved Hide resolved
src/tools/info/sys_info.c Outdated Show resolved Hide resolved
Comment on lines +214 to +210
print_row_separator(distance_width, name_width, num_devices, ' ', '|');
print_row_separator(distance_width, name_width, num_devices, '-', '+');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give example of the output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pls fix units/dev name?
see #8853 (comment)

src/tools/info/tl_info.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
unsigned num_devices = ucs_topo_num_devices();
static const int distance_width = 10;
const char *distance_unit = "MB/s";
unsigned num_devices = ucs_topo_num_devices();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const?

src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
Comment on lines +214 to +210
print_row_separator(distance_width, name_width, num_devices, ' ', '|');
print_row_separator(distance_width, name_width, num_devices, '-', '+');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pls fix units/dev name?
see #8853 (comment)

ucs_sys_device_t sys_dev = ucp_worker_get_sys_device(wiface);
ucs_status_t status;

status = ucs_topo_get_memory_distance(sys_dev, distance);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe ucs_topo_get_memory_distance should return void?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can take it even further, maybe we need to add a constrain to the topo providers API definition to have fallback behavior?
This way both get_distance and get_memory_distance will return void...
But maybe in another PR?

I can do it for the memory_distance for now and add it to the API description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do it for memory distance for now

ucp_worker_iface_add_distance(&wiface->attr, &distance);
}
ucp_worker_get_sys_device_memory_distance(wiface);
ucp_worker_iface_add_distance(&wiface->attr, &wiface->memory_distance);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need to save memory_distance on the wiface?
seems it's used only during initialization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will reduce the calls to ucs_topo_get_memory_distance in ucp_worker_iface_estimate_perf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but ucp_worker_iface_estimate_perf still calls UCT estimate perf
and ucs_topo_get_memory_distance should be quite fast since the NUMA distances are saved in a hash in ucs/topo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true. Will revert.

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gleon99 can you pls take another look?

src/ucs/sys/topo/base/topo.h Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
src/ucp/wireup/wireup.c Show resolved Hide resolved
src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
yosefe
yosefe previously approved these changes Apr 18, 2023
gleon99
gleon99 previously approved these changes Apr 18, 2023
Copy link
Contributor

@gleon99 gleon99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ofirfarjun7 please squash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants