You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When submitting spark jobs with yunikorn 1.6.0 I'm seeing pods get stuck in an indefinite pending state.
I've seen a couple of cases but havent been able to find the root cause or gather helpful details.
Yunikorn seems to miscalculate the requested resources vs the queues defined.
Yunikorn doesn't seem to re-evaluate if the Pod fails to schedule.
✋ I have searched the open/closed issues and my issue is not listed.
Steps to reproduce the behavior:
Roll out the solution and ensure that Yunikorn 1.6.0 is enabled.
Deploy one of the examples (I saw this with the taxi-trip and tpcds benchmarks on spark 3.5.1)
Expected behavior
Yunikorn should place the pods on nodes as we have resources available.
Actual behavior
Yunikorn will either not schedule the driver pod due to resources going over the queue limits, or we see yunikorn seem to take one try to schedule, but if there are problems it never retries:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduling 2m45s yunikorn spark-team-a/oss-data-gen-exec-31 is queued and waiting for allocation
Normal PodUnschedulable 2m44s yunikorn Task spark-team-a/oss-data-gen-exec-31 is pending for the requested resources become available
Normal Informational 2m44s yunikorn Unschedulable request '300b3324-758d-4ac1-839d-82b0703e60e1': failed plugin: 'TaintToleration'
node(s) had untolerated taint {node.kubernetes.io/not-ready: } (2x);
Warning FailedScheduling 2m44s karpenter Failed to schedule pod, incompatible with nodepool "spark-vertical-ebs-scale", daemonset overhead={"cpu":"460m","memory":"1314Mi","pods":"7"}, incompatible requirements, key NodeGroupType, NodeGroupType In [spark_benchmark_ssd] not in NodeGroupType In [spark-vertical-ebs-scale]; incompatible with nodepool "spark-memory-optimized", daemonset overhead={"cpu":"460m","memory":"1314Mi","pods":"7"}, incompatible requirements, key NodeGroupType, NodeGroupType In [spark_benchmark_ssd] not in NodeGroupType In [SparkMemoryOptimized]; incompatible with nodepool "spark-compute-optimized", daemonset overhead={"cpu":"460m","memory":"1314Mi","pods":"7"}, incompatible requirements, key NodeGroupType, NodeGroupType In [spark_benchmark_ssd] not in NodeGroupType In [SparkComputeOptimized]; incompatible with nodepool "spark-graviton-memory-optimized", daemonset overhead={"cpu":"460m","memory":"1314Mi","pods":"7"}, incompatible requirements, key NodeGroupType, NodeGroupType In [spark_benchmark_ssd] not in NodeGroupType In [SparkGravitonMemoryOptimized]
Normal Nominated 39s (x2 over 2m41s) karpenter Pod should schedule on: node/ip-100-64-185-111.ec2.internal
Additional context
More research is needed between spark-operator v2 and the yunikorn updates. For now I've set the enable_yunikorn value to false and have fallen back to the kube-scheduler.
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has been open 30 days
with no activity. Remove stale label or comment or this issue will be closed in 10 days
Description
When submitting spark jobs with yunikorn 1.6.0 I'm seeing pods get stuck in an indefinite pending state.
I've seen a couple of cases but havent been able to find the root cause or gather helpful details.
Reproduction Code [Required]
We've run into this behavior in a couple of envrionments as we've been working to update the spark blueprint. https://github.com/awslabs/data-on-eks/tree/main/analytics/terraform/spark-k8s-operator
Steps to reproduce the behavior:
Roll out the solution and ensure that Yunikorn 1.6.0 is enabled.
Deploy one of the examples (I saw this with the taxi-trip and tpcds benchmarks on spark 3.5.1)
Expected behavior
Yunikorn should place the pods on nodes as we have resources available.
Actual behavior
Yunikorn will either not schedule the driver pod due to resources going over the queue limits, or we see yunikorn seem to take one try to schedule, but if there are problems it never retries:
Additional context
More research is needed between spark-operator v2 and the yunikorn updates. For now I've set the
enable_yunikorn
value to false and have fallen back to the kube-scheduler.The text was updated successfully, but these errors were encountered: