Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/WIREUP/IB: Fix error message when FLID is not available #10515

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

yosefe
Copy link
Contributor

@yosefe yosefe commented Feb 25, 2025

Before:

[1740477778.122527] [swx-ucx-ibr-03:1950035:0] select.c:641 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, rc_verbs/mlx5_0:1 - Destination is unreachable

After:

[1740487396.304113] [swx-ucx-ibr-03:1951997:0] select.c:642 UCX ERROR no active messages transport to swx-ucx-ibr-01:3618365: self/memory - unreachable, rc_verbs/mlx5_0:1 - different subnet prefix 0xfec0000000000002/0xfec0000000000001 and FLID is not available

@yosefe yosefe requested review from amastbaum and brminich February 25, 2025 12:52
return 0;
if (!iface->config.flid_enabled) {
flid_info_str = "disabled";
} else if ((uct_ib_iface_gid_extract_flid(&iface->gid_info.gid)) == 0 ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if ((uct_ib_iface_gid_extract_flid(&iface->gid_info.gid)) == 0 ||
} else if ((uct_ib_iface_gid_extract_flid(&iface->gid_info.gid) == 0) ||

@yosefe yosefe force-pushed the topic/ucp-wireup-ib-fix-error-message-when branch from 5873820 to 4b5f309 Compare February 25, 2025 14:41
@@ -592,7 +592,7 @@ static UCS_F_NOINLINE ucs_status_t ucp_wireup_select_transport(
}

is_reachable = 0;

ucs_snprintf_safe(info_str, info_str_size, "unreachable");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use snprintf here for consistency then? In this function only snprintf is used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@amastbaum
Copy link
Contributor

@yosefe do you think we still need to be able to print an info string from ucp_wireup_select_lanes? What you added makes pretty much the same thing as intended in select.c:2513 but better.

  • we may consider removing the code that resets info_str_buf in uct_iface_is_reachable_v2 as the info buffer is filled with "unreachable" before this function is called.

@@ -592,7 +592,7 @@ static UCS_F_NOINLINE ucs_status_t ucp_wireup_select_transport(
}

is_reachable = 0;

ucs_snprintf_safe(info_str, info_str_size, "unreachable");
Copy link
Contributor

@amastbaum amastbaum Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR we currently enable info_str to be NULL and its value is checked inside uct_iface_fill_info_str_buf.

I actually don't remember why we didn't pass an info str buffer to all ucp_wireup_add_X_lanes functions, but maybe now it can be removed (as I mentioned in the general comment) and info_str can be defined inside ucp_wireup_select_transport.. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed info_str from parameters and moved it inside select_transport

@yosefe
Copy link
Contributor Author

yosefe commented Feb 25, 2025

@yosefe do you think we still need to be able to print an info string from ucp_wireup_select_lanes? What you added makes pretty much the same thing as intended in select.c:2513 but better.

Removed it from select.c:2513

  • we may consider removing the code that resets info_str_buf in uct_iface_is_reachable_v2 as the info buffer is filled with "unreachable" before this function is called.

We still need to reset uct_info string to "not available" in UCP before going over remote addresses, since ucp_wireup_is_reachable() can return 0 also if tl name is different - without calling uct_iface_is_reachable_v2.

Added a gtest that UCT is_reachable always fills the info string

Copy link
Contributor

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test errors seem to be relevant

brminich
brminich previously approved these changes Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants