Skip to content

Conversation

@jjhursey
Copy link
Member

@jjhursey jjhursey commented Jan 5, 2021

  • Remove build_daemon_nidmap function from regx framework - as it is not used.
  • Remove static_port optimization path that didn't work for multiple static ports.

@jjhursey jjhursey linked an issue Jan 5, 2021 that may be closed by this pull request
@jjhursey
Copy link
Member Author

jjhursey commented Jan 5, 2021

From the debug output this might be sufficient (it looks like it is picking up the correct port in the oob). However, my test box with the firewall is down right now. I'll have to test tomorrow. Posting this here for now.

@rhc54
Copy link
Contributor

rhc54 commented Jan 6, 2021

@jjhursey Let's chat about this on the phone, if you don't mind - I'd like to better understand what is going on so I can perhaps help advise on a solution. It seems odd that we'd need a different nidmap constructor as it shouldn't care what port is being used by each daemon.

@jjhursey
Copy link
Member Author

jjhursey commented Jan 6, 2021

Sure. I'll ping you on slack.

I agree - I'm not sure why we needed this function anymore, but this break has been there since the framework was created in 4cd7f3b (replaced orte_util_build_daemon_nidmap with orte_regx.build_daemon_nidmap, but never implemented the latter). Maybe it just got caught up in the sweep. I'm not sure.

@jjhursey jjhursey requested a review from rhc54 January 7, 2021 17:41
@jjhursey jjhursey added this to the v4.1.1 milestone Jan 7, 2021
@jjhursey
Copy link
Member Author

jjhursey commented Jan 7, 2021

I think this is ready for review. Ralph and I chatted about this and decided to remove the optimization path around static_ports - for reasons in the commit message. I think I got everything, and testing showed it seems to be working.

Once this PR is reviewed and finalized then I'll PR it to the v4.0.x and v3.1.x branches.

@jjhursey jjhursey marked this pull request as ready for review January 7, 2021 17:44
Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little more cleanup to do, I think.

Signed-off-by: Joshua Hursey <[email protected]>
 * After discussing this with Ralph we concluded that the
   original code has some deficiencies that are not worth
   preserving.
   - The optimization here was that if we have a single
     static port then we can calculate the the URI of all
     of the daemons (including the HNP). Thus we do not
     have to have the daemons phone home to the HNP for
     the contact information. Instead the first message
     they receive would be the launch message.
   - This optimization path really only worked for a
     single static port, not a set of them.
   - This optimization wasn't used. As evidence by how
     long this bug has been present.
   - Finally, in practice, it didn't really save much time
     during launch.
 * Remove the build_daemon_nidmap from the regx framework structure

Signed-off-by: Joshua Hursey <[email protected]>
@jjhursey
Copy link
Member Author

jjhursey commented Jan 8, 2021

@rhc54 I pushed a new version that should address your comments. Except for #8339 (comment)

Note that the diff for orte/mca/odls/base/odls_base_default_fns.c looks a lot bigger than it really is - I just removed the outer if block and re-indented all of that code.

@rhc54
Copy link
Contributor

rhc54 commented Jan 8, 2021

bot:aws:retest

CI server failure

@jjhursey jjhursey changed the title Fix segv when launching with static ports v4.1: Fix segv when launching with static ports Jan 11, 2021
@jsquyres jsquyres merged commit a9c7412 into open-mpi:v4.1.x Jan 11, 2021
@jjhursey jjhursey deleted the fix-static-port branch January 11, 2021 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ORTE SEGV when launching with static ports

3 participants