Yunikorn fails to schedule spark pods consistently. #690

alanty · 2024-10-30T22:13:44Z

Description

When submitting spark jobs with yunikorn 1.6.0 I'm seeing pods get stuck in an indefinite pending state.
I've seen a couple of cases but havent been able to find the root cause or gather helpful details.

Yunikorn seems to miscalculate the requested resources vs the queues defined.
Yunikorn doesn't seem to re-evaluate if the Pod fails to schedule.

✋ I have searched the open/closed issues and my issue is not listed.

Reproduction Code [Required]

We've run into this behavior in a couple of envrionments as we've been working to update the spark blueprint. https://github.com/awslabs/data-on-eks/tree/main/analytics/terraform/spark-k8s-operator

Steps to reproduce the behavior:
Roll out the solution and ensure that Yunikorn 1.6.0 is enabled.
Deploy one of the examples (I saw this with the taxi-trip and tpcds benchmarks on spark 3.5.1)

Expected behavior

Yunikorn should place the pods on nodes as we have resources available.

Actual behavior

Yunikorn will either not schedule the driver pod due to resources going over the queue limits, or we see yunikorn seem to take one try to schedule, but if there are problems it never retries:

Events:
  Type    Reason            Age    From      Message
  ----    ------            ----   ----      -------
  Normal  Scheduling        2m45s  yunikorn  spark-team-a/oss-data-gen-exec-31 is queued and waiting for allocation
  Normal  PodUnschedulable  2m44s  yunikorn  Task spark-team-a/oss-data-gen-exec-31 is pending for the requested resources become available
  Normal  Informational     2m44s  yunikorn  Unschedulable request '300b3324-758d-4ac1-839d-82b0703e60e1': failed plugin: 'TaintToleration'
node(s) had untolerated taint {node.kubernetes.io/not-ready: } (2x);
  Warning  FailedScheduling  2m44s                karpenter  Failed to schedule pod, incompatible with nodepool "spark-vertical-ebs-scale", daemonset overhead={"cpu":"460m","memory":"1314Mi","pods":"7"}, incompatible requirements, key NodeGroupType, NodeGroupType In [spark_benchmark_ssd] not in NodeGroupType In [spark-vertical-ebs-scale]; incompatible with nodepool "spark-memory-optimized", daemonset overhead={"cpu":"460m","memory":"1314Mi","pods":"7"}, incompatible requirements, key NodeGroupType, NodeGroupType In [spark_benchmark_ssd] not in NodeGroupType In [SparkMemoryOptimized]; incompatible with nodepool "spark-compute-optimized", daemonset overhead={"cpu":"460m","memory":"1314Mi","pods":"7"}, incompatible requirements, key NodeGroupType, NodeGroupType In [spark_benchmark_ssd] not in NodeGroupType In [SparkComputeOptimized]; incompatible with nodepool "spark-graviton-memory-optimized", daemonset overhead={"cpu":"460m","memory":"1314Mi","pods":"7"}, incompatible requirements, key NodeGroupType, NodeGroupType In [spark_benchmark_ssd] not in NodeGroupType In [SparkGravitonMemoryOptimized]
  Normal   Nominated         39s (x2 over 2m41s)  karpenter  Pod should schedule on: node/ip-100-64-185-111.ec2.internal

Additional context

More research is needed between spark-operator v2 and the yunikorn updates. For now I've set the enable_yunikorn value to false and have fallen back to the kube-scheduler.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-12-08T00:08:53Z

This issue has been automatically marked as stale because it has been open 30 days
with no activity. Remove stale label or comment or this issue will be closed in 10 days

github-actions · 2024-12-19T00:08:05Z

Issue closed due to inactivity.

alanty self-assigned this Nov 7, 2024

github-actions bot added the stale label Dec 8, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yunikorn fails to schedule spark pods consistently. #690

Yunikorn fails to schedule spark pods consistently. #690

alanty commented Oct 30, 2024

github-actions bot commented Dec 8, 2024

github-actions bot commented Dec 19, 2024

Yunikorn fails to schedule spark pods consistently. #690

Yunikorn fails to schedule spark pods consistently. #690

Comments

alanty commented Oct 30, 2024

Description

Reproduction Code [Required]

Expected behavior

Actual behavior

Additional context

github-actions bot commented Dec 8, 2024

github-actions bot commented Dec 19, 2024