Skip to content

Conversation

@mridulm
Copy link
Contributor

@mridulm mridulm commented Mar 18, 2024

What changes were proposed in this pull request?

Importing details from apache/spark#43162:

--
This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in apache/spark#9853.

--

Why are the changes needed?

We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there.
(I am yet to file the overall TLS support jira yet, but this is enabling work).

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

@codecov
Copy link

codecov bot commented Mar 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 48.85%. Comparing base (91f6378) to head (cc964a1).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2400      +/-   ##
==========================================
+ Coverage   48.85%   48.85%   +0.01%     
==========================================
  Files         209      209              
  Lines       13101    13102       +1     
  Branches     1134     1134              
==========================================
+ Hits         6399     6400       +1     
  Misses       6282     6282              
  Partials      420      420              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@SteNicholas SteNicholas changed the title CELEBORN-1339: Mark connection as timedOut in TransportClient.close [CELEBORN-1339] Mark connection as timedOut in TransportClient.close Mar 19, 2024
Copy link
Member

@SteNicholas SteNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

SteNicholas pushed a commit that referenced this pull request Mar 20, 2024
### What changes were proposed in this pull request?

Importing details from apache/spark#43162:

--
This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in apache/spark#9853.

--

### Why are the changes needed?

We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there.
(I am yet to file the overall TLS support jira yet, but this is enabling work).

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes #2400 from mridulm/add-SPARK-45375.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: SteNicholas <[email protected]>
(cherry picked from commit 21d5698)
Signed-off-by: SteNicholas <[email protected]>
@SteNicholas
Copy link
Member

Merging to main(v0.5.0) and branch-0.4(v0.4.1).

@mridulm mridulm deleted the add-SPARK-45375 branch March 22, 2024 00:54
cfmcgrady pushed a commit to cfmcgrady/incubator-celeborn that referenced this pull request Aug 21, 2025
### What changes were proposed in this pull request?

Importing details from apache/spark#43162:

--
This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in apache/spark#9853.

--

### Why are the changes needed?

We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there.
(I am yet to file the overall TLS support jira yet, but this is enabling work).

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes apache#2400 from mridulm/add-SPARK-45375.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: SteNicholas <[email protected]>
(cherry picked from commit 21d5698)
Signed-off-by: SteNicholas <[email protected]>
cfmcgrady pushed a commit to cfmcgrady/incubator-celeborn that referenced this pull request Aug 21, 2025
Importing details from apache/spark#43162:

--
This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in apache/spark#9853.

--

We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there.
(I am yet to file the overall TLS support jira yet, but this is enabling work).

No

Unit tests

Closes apache#2400 from mridulm/add-SPARK-45375.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: SteNicholas <[email protected]>
(cherry picked from commit 21d5698)
Signed-off-by: SteNicholas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants