-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/shm: Fix progress error cases and init id paths #10313
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit messages are a bit confusing, but changes look good!
@shijin-aws can you share what the AWS CI failure is? |
All of the efa provider tests either failed or timed out, though I haven't had log yet, I can try to reproduce it manually and get back to you today
|
Ran the same test locally, I hit a hang with the following warning
|
@shijin-aws how do you run this test? I would like to try to reproduce it on my cluster. |
The test is run via efa provider, as open mpi cannot pick shm directly
|
@zachdworkin OK, if I run -n 48 -N 48, the test passed. Seems the problem is only triggered by larger process number? |
|
@shijin-aws what version of ompi are you using? |
@shijin-aws nevermind I see it is 4.1.6 |
@shijin-aws so the failing case is -n 96 -ppn 48? |
No. I ran on single node, it's 96 total process and 96 PPN |
The problem seems to be from |
And also the warning Remember EFA call shm's fi_av_insert with flags FI_AV_USER_ID. Can this path be impacted by the commit? |
@shijin-aws that is very helpful. I made a comment in that commit that the case where a peer disconnects and then reconnects with the same name and ID, it will have a valid id with no shm region. There is a TODO to create an smr_unmap_peer function to let it re initialize everything. Do you think this test is hitting this edge case? |
I don't think so. this is at the beginning of the test and I don't see open mpi even call |
@@ -554,8 +555,7 @@ int smr_map_add(const struct fi_provider *prov, struct smr_map *map, | |||
if (ret) { | |||
assert(ret == -FI_EALREADY); | |||
*id = (intptr_t) node->data; | |||
ofi_spin_unlock(&map->lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to change the behavior, you have smr_map_to_region
in the out
while before your change it seems to return without map, is it expected? @aingerson
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand what is going on. Its trying to map to region when it shouldnt because the region already exists... so either I need to revert and return 0 or check if map->peers[id].region is valid and if it isnt then go to the out to map to region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a check in smr_map_to_region
that skips initialization if it's already been set so I don't think it should do a double mapping. The addition of the mapping was added because if the mapping failed on av_insert, we need to trigger it again on the next map_add (in progress_connreq). Not saying it isn't related but it should be skipping the duplicate mapping in the case you're talking about
@shijin-aws I just pushed a patch with a bunch of prints. I cannot figure out what the error is and when I run ofi shm through ompi with osu 7.4 it doesnt have the hang. Can you please send over the log with all the prints when it finishes? |
|
@shijin-aws thank you! I had a print argument backwards... can you send this new output instead? I had addr and shm_id switched in smr_av.c's smr_av_insert call |
The test is still running, but this error message says something?
Is it possible that your code is trying to access an address of |
@shijin-aws does the segfault only occur with the prints I added? |
@zachdworkin Now I am trying to run your PR on my local VM with Open MPI 4.1.6 manually, I get such printing all the time
Is that expected? |
@shijin-aws yes that is expected. One of the peers gets in a verify peer loop and will keep printing that while the other side is stuck. @aingerson helped me find the issue with the patch you identified and I am working on cleaning it up. Thanks for providing all the logs! |
Move smr_map_create to an initialization type function. The map doesn't need to be calloc'd because there is a 1:1 relationship between av:map so there is no reason to allocate them separately. With this change these functions can move into smr_av.c because that is the only place they are called. peer_data.addr.name is not used anywhere and is already initialized to zeros on region creation so it doesn't need to be initialized. Signed-off-by: Zach Dworkin <[email protected]>
The id field was incorrectly being used as an indicator that a region was mapped. Instead a valid id shows that a peer is in the rbmap and not that region memory is valid. Map_to_region has moved out of map_add and a region can only be initialized in verify_peer or in progress_connreq. This was done to reduce initialization time and memory footprint because there are many cases where communication doesn't happen across every peer. We only need to initialize a region when we need to communicate with that peer. Previously map_to_endpoint would be able to map to an endpoint if the peer was in the map but the region wasn't mapped. This now protects against segmentation faults because the region must be valid. Since map_to_region was moved out of map_add there is no point in doing map_to_endpoint on av_insert because the region is not mapped. Map_to_endpoint now happens at the end of map_to_region because we guarentee that the peer is in the map and its region is valid. This endpoint mapping does not need to happen for every possible peer so we can use a for each call to do it for initialized peers only. With all of these changes we can silence coverity issues about accessing the fields in the map without a lock. This is because these fields are properly protected and region and id are guaranteed to be valid at the point of accessing them. Signed-off-by: Zach Dworkin <[email protected]>
df919ba
to
c0822f9
Compare
Err was passed into progress_iov remove err path iov to avoid doing a copy if there was a previous error. This err was discovered to always be 0 so these changes remove it. Signed-off-by: Zach Dworkin <[email protected]>
Make progress errors ints to match the fi_cq_err_entry struct's err type. This error needs to be positive so the paths have been checked and they all return negative errors. The err sign is flipped when calling write_err_comp to keep it positive like the API expects. Signed-off-by: Zach Dworkin <[email protected]>
@shijin-aws what is the latest failure? I have it passing all my tests now. |
@zachdworkin It's a socket provider failure
I am not sure even AWS CI should test this provider any longer as I heard it's not maintained. |
bot:aws:retest |
Fix progress error cases to always be positive ints when calling write_err_comp
Make id always be valid when looking up if pid_fd has been set for level-zero