[SPARK-47323][K8S] Support custom executor log urls #14

EnricoMi · 2024-03-07T16:54:59Z

What changes were proposed in this pull request?

Make Kubernetes resource manager support existing config spark.ui.custom.executor.log.url.

Allow for

spark.ui.custom.executor.log.url="https://my.custom.url/logs?app={{APP_ID}}&executor={{EXECUTOR_ID}}"

Supports these variables:

APP_ID: The unique application id
EXECUTOR_ID: The executor id (a positive integer larger than zero)
HOSTNAME: The name of the host where the executor runs
KUBERNETES_NAMESPACE: The namespace where the executor pods run
KUBERNETES_POD_NAME: The name of the pod that contains the executor
FILE_NAME: The name of the log, which is always "log"

Why are the changes needed?

Running Spark on Kubernetes requires persisting the logs elsewhere. Having the Spark UI link to those logs is very useful. This is currently only supported by YARN.

Does this PR introduce any user-facing change?

Spark UI provides links to logs when run on Kubernetes.

How was this patch tested?

Unit test and manually tested on minikube K8S cluster.

Was this patch authored or co-authored using generative AI tooling?

No

github-actions · 2024-07-30T00:22:10Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

…onicalized expressions ### What changes were proposed in this pull request? Make PullOutNonDeterministic use canonicalized expressions to dedup group and aggregate expressions. This affects pyspark udfs in particular. Example: ``` from pyspark.sql.functions import col, avg, udf pythonUDF = udf(lambda x: x).asNondeterministic() spark.range(10)\ .selectExpr("id", "id % 3 as value")\ .groupBy(pythonUDF(col("value")))\ .agg(avg("id"), pythonUDF(col("value")))\ .explain(extended=True) ``` Currently results in a plan like this: ``` Aggregate [_nondeterministic#15](#15), [_nondeterministic#15 AS dummyNondeterministicUDF(value)#12, avg(id#0L) AS avg(id)#13, dummyNondeterministicUDF(value#6L)#8 AS dummyNondeterministicUDF(value)#14](#15%20AS%20dummyNondeterministicUDF(value)#12,%20avg(id#0L)%20AS%20avg(id)#13,%20dummyNondeterministicUDF(value#6L)#8%20AS%20dummyNondeterministicUDF(value)#14) +- Project [id#0L, value#6L, dummyNondeterministicUDF(value#6L)#7 AS _nondeterministic#15](#0L,%20value#6L,%20dummyNondeterministicUDF(value#6L)#7%20AS%20_nondeterministic#15) +- Project [id#0L, (id#0L % cast(3 as bigint)) AS value#6L](#0L,%20(id#0L%20%%20cast(3%20as%20bigint))%20AS%20value#6L) +- Range (0, 10, step=1, splits=Some(2)) ``` and then it throws: ``` [[MISSING_AGGREGATION] The non-aggregating expression "value" is based on columns which are not participating in the GROUP BY clause. Add the columns or the expression to the GROUP BY, aggregate the expression, or use "any_value(value)" if you do not care which of the values within a group is returned. SQLSTATE: 42803 ``` - how canonicalized fixes this: - nondeterministic PythonUDF expressions always have distinct resultIds per udf - The fix is to canonicalize the expressions when matching. Canonicalized means that we're setting the resultIds to -1, allowing us to dedup the PythonUDF expressions. - for deterministic UDFs, this rule does not apply and "Post Analysis" batch extracts and deduplicates the expressions, as expected ### Why are the changes needed? - the output of the query with the fix applied still makes sense - the nondeterministic UDF is invoked only once, in the project. ### Does this PR introduce _any_ user-facing change? Yes, it's additive, it enables queries to run that previously threw errors. ### How was this patch tested? - added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52061 from benrobby/adhoc-fix-pull-out-nondeterministic. Authored-by: Ben Hurdelhey <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added the KUBERNETES label Mar 7, 2024

EnricoMi changed the title ~~Support custom executor log urls~~ [SPARK-47323][K8S] Support custom executor log urls Mar 8, 2024

EnricoMi force-pushed the k8s-custom-executor-log-url branch from 27af189 to 239fb2f Compare March 11, 2024 06:50

github-actions bot added DOCS WEB UI labels Mar 11, 2024

EnricoMi force-pushed the k8s-custom-executor-log-url branch from a465e9b to b9ea61b Compare March 11, 2024 09:07

EnricoMi force-pushed the k8s-custom-executor-log-url branch from 80070ef to 2f896c8 Compare April 20, 2024 16:23

github-actions bot added CORE SQL PYTHON INFRA BUILD MLLIB STRUCTURED STREAMING EXAMPLES ML PANDAS API ON SPARK YARN AVRO DSTREAM R CONNECT PROTOBUF GRAPHX labels Apr 20, 2024

github-actions bot added the Stale label Jul 30, 2024

EnricoMi added 4 commits July 30, 2024 10:06

Support custom executor log urls

4a92079

Add option to kubernetes docs

e5d73ca

Add KUBERNETES_NAMESPACE, rename pod name to KUBERNETES_POD_NAME

45c3837

Prefer env vars over kubernetes backend attributes

315a0bb

EnricoMi force-pushed the k8s-custom-executor-log-url branch from 2f896c8 to 315a0bb Compare July 30, 2024 08:06

github-actions bot removed CORE SQL PYTHON INFRA BUILD MLLIB STRUCTURED STREAMING EXAMPLES ML PANDAS API ON SPARK YARN AVRO DSTREAM R CONNECT PROTOBUF GRAPHX labels Jul 30, 2024

github-actions bot closed this Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-47323][K8S] Support custom executor log urls #14

[SPARK-47323][K8S] Support custom executor log urls #14

Uh oh!

EnricoMi commented Mar 7, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Jul 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-47323][K8S] Support custom executor log urls #14

[SPARK-47323][K8S] Support custom executor log urls #14

Uh oh!

Conversation

EnricoMi commented Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Jul 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EnricoMi commented Mar 7, 2024 •

edited

Loading