Skip to content

Conversation

@awlauria
Copy link
Contributor

@awlauria awlauria commented Jan 7, 2021

  • Make a managed allocation filter a hostfile/hostlist.

If the user asks for a hostfile/hostlist inside of a managed allocation,
make sure that rmaps filters these and maps processes based on them. Otherwise,
it can result in inconsistent mappings across root and compute nodes if the
user orders their hostfile differently than the resource manager.

  • Fix bug where orte under a managed allocation does not honor -host.

For example:

$. bsub -n 40 -m "node1 node2" mpirun -np 6 -host node1:2,node2:4 hostname

would not map two hostname processes to node1 and four to node2.
Instead, it would still think that each node1
and node2 had (for example) 20 cpu resources, and map accordingly.

If the user asks for a hostfile/hostlist inside of a managed allocation,
make sure that rmaps filters these and maps processes based on them. Otherwise,
it can result in inconsistent mappings across root and compute nodes if the
user orders their hostfile differently than the resource manager.

Signed-off-by: Austen Lauria <[email protected]>
For example:

$. bsub -n 40 -m "node1 node2" mpirun -np 6 -host node1:2,node2:4 hostname

would not map two hostname processes to node1 and four to node2.
Instead, it would still think that each node1
and node2 had (for example) 20 cpu resources, and map accordingly.

Signed-off-by: Austen Lauria <[email protected]>
@awlauria awlauria requested review from gpaulsen and jjhursey January 7, 2021 21:57
@awlauria awlauria changed the title v4.0.x Fix a couple managed allocation issues. v4.0.x: Fix a couple managed allocation issues. Jan 7, 2021
@gpaulsen gpaulsen requested a review from hppritcha January 8, 2021 15:29
@gpaulsen gpaulsen added this to the v4.0.6 milestone Jan 8, 2021
@gpaulsen
Copy link
Member

gpaulsen commented Jan 8, 2021

Is this fixing a customer observed issue or was it found with dev testing?

Does this code affect other resource managers other than LSF? If so, which ones, and could we ask them to test this before merging to release branch?

Also, is it possible that this could break existing jobs submission scripts if they were relying on the older (presumably broken) functionality?

Finally, could you also please make a v4.1 version of this as it also relies on orte launcher, and presumably would have a similar bug?

@awlauria
Copy link
Contributor Author

awlauria commented Jan 8, 2021

Commit 1 was a customer reported issue who uses LSF. I only tested these fixes under LSF - I think we would need others from the community to test other RM's.

Commit 1 fixes a bug where the ranks are mapped inconsistently across orted's, resulting in PMIx failures and other weirdness (such as multiple proc's being assigned the same rank, and hangs in MPI_Finalize()).

Commit 2 fixes a bug as described in the message. This was not customer reported, but I found it in my testing of the fix.

I plan to make a 4.1 PR once this is merged. I figure if people have changes/suggestions, I'd rather just fix it once than twice.

@awlauria
Copy link
Contributor Author

awlauria commented Jan 8, 2021

From reviewing the code this could impact Slurm as well - I don't see others, but am not 100% on that.. @gpaulsen do you know who to tag for slurm/other testing?

@gpaulsen
Copy link
Member

gpaulsen commented Jan 8, 2021

We can ask on Tuesday. @rhc54 might know if it's worth hand testing?

Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will, of course, be seen by other RMs and not just LSF. However, I don't see any harm in fixing the bug. It only impacts if they specify -host or -hostfile, which isn't that common a case.

@rhc54
Copy link
Contributor

rhc54 commented Jan 8, 2021

@awlauria Please upstream to PRRTE as well - if you don't have time, let me know so we can ensure it gets there.

@awlauria
Copy link
Contributor Author

awlauria commented Jan 8, 2021

Thanks @rhc54 . Ported over: openpmix/prrte#718

@awlauria
Copy link
Contributor Author

awlauria commented Jan 8, 2021

v4.1.x: #8355

@gpaulsen gpaulsen merged commit fc16f90 into open-mpi:v4.0.x Jan 11, 2021
@awlauria awlauria deleted the managed_allocation_v4.0.x branch January 11, 2021 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants