Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yunikorn fails to schedule spark pods consistently. #690

Closed
1 task done
alanty opened this issue Oct 30, 2024 · 2 comments
Closed
1 task done

Yunikorn fails to schedule spark pods consistently. #690

alanty opened this issue Oct 30, 2024 · 2 comments
Assignees
Labels

Comments

@alanty
Copy link
Contributor

alanty commented Oct 30, 2024

Description

When submitting spark jobs with yunikorn 1.6.0 I'm seeing pods get stuck in an indefinite pending state.
I've seen a couple of cases but havent been able to find the root cause or gather helpful details.

  1. Yunikorn seems to miscalculate the requested resources vs the queues defined.
  2. Yunikorn doesn't seem to re-evaluate if the Pod fails to schedule.
  • ✋ I have searched the open/closed issues and my issue is not listed.

Reproduction Code [Required]

We've run into this behavior in a couple of envrionments as we've been working to update the spark blueprint. https://github.com/awslabs/data-on-eks/tree/main/analytics/terraform/spark-k8s-operator

Steps to reproduce the behavior:
Roll out the solution and ensure that Yunikorn 1.6.0 is enabled.
Deploy one of the examples (I saw this with the taxi-trip and tpcds benchmarks on spark 3.5.1)

Expected behavior

Yunikorn should place the pods on nodes as we have resources available.

Actual behavior

Yunikorn will either not schedule the driver pod due to resources going over the queue limits, or we see yunikorn seem to take one try to schedule, but if there are problems it never retries:

Events:
  Type    Reason            Age    From      Message
  ----    ------            ----   ----      -------
  Normal  Scheduling        2m45s  yunikorn  spark-team-a/oss-data-gen-exec-31 is queued and waiting for allocation
  Normal  PodUnschedulable  2m44s  yunikorn  Task spark-team-a/oss-data-gen-exec-31 is pending for the requested resources become available
  Normal  Informational     2m44s  yunikorn  Unschedulable request '300b3324-758d-4ac1-839d-82b0703e60e1': failed plugin: 'TaintToleration'
node(s) had untolerated taint {node.kubernetes.io/not-ready: } (2x);
  Warning  FailedScheduling  2m44s                karpenter  Failed to schedule pod, incompatible with nodepool "spark-vertical-ebs-scale", daemonset overhead={"cpu":"460m","memory":"1314Mi","pods":"7"}, incompatible requirements, key NodeGroupType, NodeGroupType In [spark_benchmark_ssd] not in NodeGroupType In [spark-vertical-ebs-scale]; incompatible with nodepool "spark-memory-optimized", daemonset overhead={"cpu":"460m","memory":"1314Mi","pods":"7"}, incompatible requirements, key NodeGroupType, NodeGroupType In [spark_benchmark_ssd] not in NodeGroupType In [SparkMemoryOptimized]; incompatible with nodepool "spark-compute-optimized", daemonset overhead={"cpu":"460m","memory":"1314Mi","pods":"7"}, incompatible requirements, key NodeGroupType, NodeGroupType In [spark_benchmark_ssd] not in NodeGroupType In [SparkComputeOptimized]; incompatible with nodepool "spark-graviton-memory-optimized", daemonset overhead={"cpu":"460m","memory":"1314Mi","pods":"7"}, incompatible requirements, key NodeGroupType, NodeGroupType In [spark_benchmark_ssd] not in NodeGroupType In [SparkGravitonMemoryOptimized]
  Normal   Nominated         39s (x2 over 2m41s)  karpenter  Pod should schedule on: node/ip-100-64-185-111.ec2.internal

Additional context

More research is needed between spark-operator v2 and the yunikorn updates. For now I've set the enable_yunikorn value to false and have fallen back to the kube-scheduler.

@alanty alanty self-assigned this Nov 7, 2024
Copy link
Contributor

github-actions bot commented Dec 8, 2024

This issue has been automatically marked as stale because it has been open 30 days
with no activity. Remove stale label or comment or this issue will be closed in 10 days

@github-actions github-actions bot added the stale label Dec 8, 2024
Copy link
Contributor

Issue closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant