Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jul 27, 2025

What changes were proposed in this pull request?

This PR aims to verify if the executor pod's cpu request exceeds cpu limit or not in order to make it fail-fast.

Why are the changes needed?

Since Spark creates many executor pods, we had better do fail-fast on the invalid settings before submitting invalid pod spec to K8s cluster. It wastes lots of K8s resources.

Note that newly added validation check only happens when spark.kubernetes.executor.limit.cores is given explicitly.

Does this PR introduce any user-facing change?

No behavior change eventually because the existing misconfigured spark.kubernetes.executor.limit.cores means Spark driver cannot get any executor pods and the job will hang or fail eventually.

How was this patch tested?

Pass the CIs with the newly added test case.

Was this patch authored or co-authored using generative AI tooling?

No.

@dongjoon-hyun
Copy link
Member Author

Could you review this when you have some time, @peter-toth ?

@dongjoon-hyun
Copy link
Member Author

Could you review this PR when you have some time, @HyukjinKwon ?

val executorCpuLimitQuantity = new Quantity(limitCores)
if (executorCpuLimitQuantity.compareTo(executorCpuQuantity) < 0) {
throw new SparkException(
"The executor cpu request should be less than or equal to cpu limit")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should the request value and limit value be included in the error message?

@dongjoon-hyun
Copy link
Member Author

Thank you, @LuciferYang and @HyukjinKwon .

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM
Thank you @dongjoon-hyun

@dongjoon-hyun
Copy link
Member Author

All comments are addressed and I verified the test result manually because the last commit changes only the exception message string.

[info] BasicExecutorFeatureStepSuite:
[info] - test spark resource missing vendor (6 milliseconds)
[info] - test spark resource missing amount (1 millisecond)
[info] - SPARK-52933: Verify if the executor cpu request exceeds limit (5 milliseconds)

Merged to master for Apache Spark 4.1.0.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-52933 branch July 28, 2025 03:16
Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late LGTM.

@zemin-piao
Copy link

Thanks a lot for pushing this PR! @dongjoon-hyun

dongjoon-hyun added a commit to apache/spark-kubernetes-operator that referenced this pull request Oct 2, 2025
### What changes were proposed in this pull request?

This PR aims to upgrade Spark to `4.1.0-preview2` for `4.0.1`.

### Why are the changes needed?

Since Apache Spark 4.1.0 is planned next month, we had better prepare to use new features via using `4.1.0-preview2` (September) and `4.1.0-preview2 (October)` gradually.
- apache/spark#51678
- apache/spark#51522
- apache/spark#50925

### Does this PR introduce _any_ user-facing change?

No behavior change.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #364 from dongjoon-hyun/SPARK-53787.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
shubhluck pushed a commit to acceldata-io/spark3 that referenced this pull request Dec 11, 2025
### What changes were proposed in this pull request?

This PR aims to verify if the executor pod's cpu request exceeds cpu limit or not in order to make it fail-fast.

### Why are the changes needed?

Since Spark creates many executor pods, we had better do fail-fast on the invalid settings before submitting invalid pod spec to K8s cluster. It wastes lots of K8s resources.

Note that newly added validation check only happens when `spark.kubernetes.executor.limit.cores` is given explicitly.

### Does this PR introduce _any_ user-facing change?

No behavior change eventually because the existing misconfigured `spark.kubernetes.executor.limit.cores` means Spark driver cannot get any executor pods and the job will hang or fail eventually.

### How was this patch tested?

Pass the CIs with the newly added test case.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#51678 from dongjoon-hyun/SPARK-52933.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 4dc3f0f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants