Skip to content

Conversation

@rangadi
Copy link

@rangadi rangadi commented Sep 21, 2023

What changes were proposed in this pull request?

createSimpleWorker() method in PythonWorkerFactory waits forever if the worker fails to connect back to the server.

This is because it calls accept() without a timeout. If the worker does not connect back, accept() waits forever. There is supposed to be 10 seconds timeout, but it was not implemented correctly.

This PR adds a 10 second timeout.

Why are the changes needed?

Otherwise create method could be stuck forever.

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Unit test
  • Manual

Was this patch authored or co-authored using generative AI tooling?

Generated-by: ChatGPT 4.0
Asked ChatGPT to generate sample code to do non-blocking accept() on a socket channel in Java.

@rangadi
Copy link
Author

rangadi commented Sep 21, 2023

cc: @HyukjinKwon, @WweiL

redirectStreamsToStderr(workerProcess.getInputStream, workerProcess.getErrorStream)

// Wait for it to connect to our socket, and validate the auth secret.
serverSocketChannel.socket().setSoTimeout(10000)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was supposed to be 10 second timeout. But this call does not seem to affect serverSocketChannel.accept().
This set might only take effect if we did serverSocketChannel.socket().accept(), but hat returns a socket, not a channel.

Raghu Angadi added 2 commits September 20, 2023 21:09
Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ueshin

// return (accept() used to be blocking), the test doesn't hang for a long time.
val createFuture = Future {
val ex = intercept[SparkException] {
workerFactory.createSimpleWorker(blockingMode = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I know if this blockingMode=true would affect serverSocketChannel.configureBlocking(false) above?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. It is for client socket. Shall I remove it?

@WweiL
Copy link
Contributor

WweiL commented Sep 21, 2023

Seems the tests are failing because of format, maybe try

./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl connector/connect/common -pl connector/connect/server -pl connector/connect/client/jvm

@HyukjinKwon HyukjinKwon changed the title [SPARK-45245] PythonWorkerFactory: Timeout if worker does not connect back. [SPARK-45245][PYTHON][CONNECT] PythonWorkerFactory: Timeout if worker does not connect back. Sep 22, 2023
@rangadi
Copy link
Author

rangadi commented Oct 30, 2023

@HyukjinKwon the test failures seem unrelated. I tried multiple times. different types of tests are failing. Do you think we can merge this?

@HyukjinKwon
Copy link
Member

Yeah I am merging it to master but I think we should switch this to use the daemonized worker instead of simple workers soon.

Merged to master.

@rangadi
Copy link
Author

rangadi commented Oct 30, 2023

I think we should switch this to use the daemonized worker instead of simple workers soon.

I see. I have been looking for reasons doing so. This seems to be one of them.

import org.apache.spark.util.ThreadUtils

// Tests for PythonWorkerFactory.
class PythonWorkerFactorySuite extends SparkFunSuite with Matchers with SharedSparkContext {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this suite have to extend Matchers? @HyukjinKwon can you double check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Made a followup PR: #45459

dongjoon-hyun pushed a commit that referenced this pull request Mar 11, 2024
…it in the test

### What changes were proposed in this pull request?

This PR is a followup of #43023 that addresses a post-review comment.

### Why are the changes needed?

It is unnecessary. It also matters with Scala compatibility so should better remove if unused.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45459 from HyukjinKwon/SPARK-45245-folllowup.

Lead-authored-by: Hyukjin Kwon <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants