Skip to content

v4.1.x: Using package_rank to select between NIC of equal distance from the process#8176

Merged
rajachan merged 2 commits intoopen-mpi:v4.1.xfrom
dancejic:multi-v4.1.x
Nov 11, 2020
Merged

v4.1.x: Using package_rank to select between NIC of equal distance from the process#8176
rajachan merged 2 commits intoopen-mpi:v4.1.xfrom
dancejic:multi-v4.1.x

Conversation

@dancejic
Copy link
Contributor

@dancejic dancejic commented Nov 3, 2020

commit a53e69d (HEAD -> multi-v4.1.x)
Author: Nikola Dancejic dancejic@amazon.com
Date: Tue Nov 3 11:35:47 2020 -0800

Adjusting multi-NIC commit to work with ompi v4.1.x

Some of the information in master branch is not available for the multi-NIC
patch, such as myprocinfo.rank. This info is used to select between multiple
NIC of equal distance to the process. This adapts the previous commit to work
with the v4.1.x branch.

Signed-off-by: Nikola Dancejic <dancejic@amazon.com>

commit cee4592
Author: Nikola Dancejic dancejic@amazon.com
Date: Thu Oct 22 19:18:28 2020 -0700

Using package_rank to select between NIC of equal distance from the process.

If PMIX_PACKAGE_RANK is available, uses this value to select between multiple
NIC of equal distance between the current process. If this value is not
available, try to calculate it by getting the locality string from each local
process and assign a package_rank. If everything fails, fall back to using
process_id.rank to select the NIC. This last case is not ideal, but has a small
chance of occuring, and causes an output to be displayed to notify that this is
occuring.

Signed-off-by: Nikola Dancejic <dancejic@amazon.com>
(cherry picked from commit 8017f1280137dee4d5fc7dac0a5c627e72e48058)

@dancejic
Copy link
Contributor Author

dancejic commented Nov 3, 2020

I had a few issues with applying this patch to 4.1.x. namely, the process_info_t is different in ompi_process_info and opal_process_info. opal_process_info doesn't contain the pid, which I was using as a fallback for the case where package_rank is unable to be calculated. In this case, I ended up using my_local_rank (which was the previous fallback). I'm also not sure what the protocol is when a cherry-pick cannot be cleanly applied? I just did 2 commits, but I can change this if a single working commit is preferred

@rajachan
Copy link
Member

rajachan commented Nov 3, 2020

Thanks for keeping the commits separate for review purposes. Now that you have at least one approval, can you squash them so we don't break the ability to bisect commits and do functional checks.

@dancejic dancejic changed the title Using package_rank to select between NIC of equal distance from the process v4.1.x: Using package_rank to select between NIC of equal distance from the process Nov 3, 2020
@dancejic
Copy link
Contributor Author

dancejic commented Nov 4, 2020

oops picked up some commits, will push another

@dancejic
Copy link
Contributor Author

dancejic commented Nov 4, 2020

@rhc54 I noticed while testing that ompi_process_info.cpuset is not populated. I get the current process locality string from that, and it's causing some odd behavior. can I normally expect this to be populated, or is this something that I should check for and get from a PMIx call if it's not there?

@rhc54
Copy link
Contributor

rhc54 commented Nov 5, 2020

@dancejic How are you testing it? Looking at the code, it all looks to me like it is correct. The cpuset should be getting populated in orte_init when it asks HWLOC to fill in that field.

@dancejic
Copy link
Contributor Author

dancejic commented Nov 9, 2020

@rhc54 Sorry for the late response, I'm trying to print out the cpuset in the mtl ofi: printf("cpuset: %s\n", ompi_process_info.cpuset); but it's been coming out as empty: cpuset: (null), I'm not sure why this isn't being populated on my instance

@dancejic
Copy link
Contributor Author

dancejic commented Nov 9, 2020

Just checked, and this is also happening on the v4.1.x branch without this PR on my instance. I think if the cpuset is expected to be filled in the usual case and this is just an issue with the instance I'm running on, this is good to merge and I'll just continue trying to figure out what's going on from my end.

@rhc54
Copy link
Contributor

rhc54 commented Nov 9, 2020

@dancejic I just checked on head of the v4.1.x branch and cpuset is indeed set when launching with mpirun:

$ mpirun --mca ess_base_verbose 5 -n 1 ./hello_c
[rhc-node01:36512] [[33327,1],0] setting up session dir with
	tmpdir: /tmp
	host rhc-node01
[rhc-node01:36512] MCW rank 0 bound to SK0:L30:L20:L10:CR0:HT0:NM0
Hello, world, I am 0 of 1, (Open MPI v4.1.0rc3, package: Open MPI mpiuser@rhc-node01 Distribution, ident: 4.1.0rc3, repo rev: v4.1.0rc3, Unreleased developer copy, 135)

The "bound to" message is created using the orte_process_info.cpuset field, which maps to the ompi_process_info one. Try setting that --mca ess_base_verbose 5 option in your system and see what it tells you. I'm wondering if the problem is that the field is getting set after you go thru the MTL init procedure.

@dancejic
Copy link
Contributor Author

dancejic commented Nov 9, 2020

$ ./install/bin/mpirun --mca ess_base_verbose 5 -n 1 ./ompi/examples/hello_c
MCW rank 0 bound to SK0:L30:L20:L10:CR0:HT0-1:NM0
Hello, world, I am 0 of 1, (Open MPI v4.0.4, package: Open MPI Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020, 133)

that looks like it's working for me, so I'm not sure why ompi_process_info.cpuset prints out null in the mtl ofi

@dancejic
Copy link
Contributor Author

dancejic commented Nov 9, 2020

$ ./install/bin/mpirun --mca ess_base_verbose 5 -n 1 --mca btl ^openib ./ompi/examples/hello_c
MCW rank 0 bound to SK0:L30:L20:L10:CR0:HT0-1:NM0
cpuset: (null)
Hello, world, I am 0 of 1, (Open MPI v4.1.0rc3, package: Open MPI Distribution, ident: 4.1.0rc3, repo rev: v4.1.0rc3, Unreleased developer copy, 155)

this is with the same print statement as earlier: printf("cpuset: %s\n", ompi_process_info.cpuset);
I'm also using the internal PMIx, I checked in config.log: configure:14178: WARNING: using internal PMIx

@rhc54
Copy link
Contributor

rhc54 commented Nov 10, 2020

Okay, let me investigate. I'm guessing that we fail to transfer it across to the ompi_process_info struct.

@rhc54
Copy link
Contributor

rhc54 commented Nov 10, 2020

I'm at a loss - everything appears to be set just fine. I can find no problems with the code and no place where the cpuset gets eliminated. Let me check with your branch (may take me a bit to build).

@dancejic
Copy link
Contributor Author

I'm having trouble figuring this out too, do you have a hint where I should be looking? I checked in orte_init.c, but it looks like at that point it's already empty

@rhc54
Copy link
Contributor

rhc54 commented Nov 10, 2020

It gets set in orte_init - around line 270:

    /* initialize the RTE for this environment */
    if (ORTE_SUCCESS != (ret = orte_ess.init())) {
        error = "orte_ess_init";
        goto error;
    }

The MPI layer defines ompi_process_info to be orte_process_info, so the two are identical. A few lines later, we transfer the fields down to the opal_process_info struct so the BTLs can also access them.

Like I said, I find that the fields are correctly filled out - I can't find any problem. Unfortunately, I can't build your MTL to see what might be going on there, but everything checks out fine outside of that component.

@rhc54
Copy link
Contributor

rhc54 commented Nov 10, 2020

Okay, I am finally able to reproduce this:

$ mpirun -n 1 --mca ess_base_verbose 5 ./hello_c
[rhc-node01:20028] [[17035,1],0] setting up session dir with
	tmpdir: /tmp
	host rhc-node01
[rhc-node01:20028] MCW rank 0 bound to SK0:L30:L20:L10:CR0:HT0:NM0
cpuset: [runtime/ompi_mpi_init.c:527] (null)
cpuset: [runtime/ompi_mpi_init.c:639] (null)
cpuset: [runtime/ompi_mpi_init.c:679] (null)
cpuset: [runtime/ompi_mpi_init.c:728] (null)
cpuset: [runtime/ompi_mpi_init.c:834] (null)
cpuset: [runtime/ompi_mpi_init.c:876] (null)
Hello, world, I am 0 of 1, (Open MPI v4.1.0rc3, package: Open MPI mpiuser@rhc-node01 Distribution, ident: 4.1.0rc3, repo rev: v4.1.0rc3, Unreleased developer copy, 135)

Let me see what happened.

@rhc54
Copy link
Contributor

rhc54 commented Nov 10, 2020

@dancejic I pushed a couple of commits that I believe will resolve the problem. Please take a look at them (you can ignore the first commit - there was a file added in the PMIx v3.2.1 update that didn't get ignored as it should). In brief, the problem was a combination of a couple of issues:

  • you were incorrectly passing the cpuset to the opal/hwloc function to compute relative locality. It wants the locality string, not the cpuset.

  • ORTE wasn't passing cpuset. I added it to the PMIx payload. However, direct launchers like Slurm are unlikely to provide it, and we don't necessarily want to grab the topology if it isn't present as that would impact scalability. So I added some code to look for cpuset if locality string isn't available, and fallback to using your PID value if not found.

Hope that helps

@rhc54
Copy link
Contributor

rhc54 commented Nov 10, 2020

@dancejic Feel free to squash and modify as necessary/desired!

@dancejic
Copy link
Contributor Author

Thank you Ralph! I appreciate the help on this. I think I was using cpuset because in the master branch I believe it is the locality string. I'm not sure if this is the intended value for it or if this should also reflect the naming and be the actual cpuset.

from ompi_rte.c:

/* identify our location */
    val = NULL;
    OPAL_MODEX_RECV_VALUE_OPTIONAL(rc, PMIX_LOCALITY_STRING,
                                   &opal_process_info.my_name, &val, PMIX_STRING);
    if (PMIX_SUCCESS == rc && NULL != val) {
        opal_process_info.cpuset = val;
        opal_process_info.proc_is_bound = true;
        val = NULL;  // protect the string
    } else {
        opal_process_info.cpuset = NULL;
        opal_process_info.proc_is_bound = false;
    }

@rhc54
Copy link
Contributor

rhc54 commented Nov 10, 2020

I'll have to check the master branch as that is clearly incorrect. I'll also take a gander in v4.1.x to ensure other users properly interpreted it as the cpuset and not the locality - I believe they do (based on my earlier scan), but I'll double-check.

@rhc54
Copy link
Contributor

rhc54 commented Nov 10, 2020

Okay, I checked both v4.1.x and v4.0.x and we are okay. The only places actually using cpuset are in fact looking for the cpuset and not the locality string. I'll fix master.

Thanks for bringing it to my attention!

dancejic and others added 2 commits November 10, 2020 13:05
…om the process.

If PMIX_PACKAGE_RANK is available, uses this value to select between multiple
NIC of equal distance between the current process. If this value is not
available, try to calculate it by getting the locality string from each local
process and assign a package_rank. If everything fails, fall back to using
process_id.rank to select the NIC. This last case is not ideal, but has a small
chance of occuring, and causes an output to be displayed to notify that this is
occuring.

Some of the information in master branch is not available for the multi-NIC
patch, such as myprocinfo.rank. This info is used to select between multiple
NIC of equal distance to the process. This adapts the previous commit to work
with the v4.1.x branch.

Signed-off-by: Nikola Dancejic <dancejic@amazon.com>
(cherry picked from commit 8017f12)
Ensure we always pass the cpuset as well as the locality string for each
proc. Correct the mtl/ofi component's computation of relative locality
as the function being called expects to be given the locality string of
each proc, not the cpuset. If the locality string of the current proc
isn't available, then use the cpuset if available and compute the
locality before trying to compute relative localities of our peers.

Signed-off-by: Ralph Castain <rhc@pmix.org>
@dancejic
Copy link
Contributor Author

squashed it down to 2 commits, my original version with no locality check and your changes with the fixes to cpuset and getting the locality string.

Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants