-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(glue-alpha): include extra jars parameter in pyspark jobs #33238
base: main
Are you sure you want to change the base?
Conversation
Exemption Request: no changes in README or integration tests needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pull request linter fails with the following errors:
❌ Features must contain a change to a README file.
❌ Features must contain a change to an integration test file and the resulting snapshot.
If you believe this pull request should receive an exemption, please comment and provide a justification. A comment requesting an exemption should contain the text Exemption Request
. Additionally, if clarification is needed, add Clarification Request
to a comment.
✅ A exemption request has been requested. Please wait for a maintainer's review.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
One of the authors of the new L2 here - We talked about this during RFC and implementation phases as a potential anti-pattern. Can you share why you need extra jars for a python job? |
Hi Natalie, we need to use the spark-xml package in order to read XML files in Spark v3 (as you probably know, this package will be included in Spark v4). This package must be provided via the |
Thanks for the extra clarification. Let me get with the Glue service team; it sounds like this may be more of a Glue feature request than something we should work around in the L2 construct. Stay tuned. |
+1 to this. Some libraries that provide additional spark capabilities require a jar, even if one is actually using spark via python (pyspark). Here's a chatgpt-generated list of examples https://chatgpt.com/share/67a8e12d-ccd8-800e-a641-75e58db91d7b |
We had some internal discussions and (in addition to the data here) decided this is a valid use case. But we should add them to all 3 PySpark job types. |
Issue # (if applicable)
Closes #33225.
Reason for this change
PySpark jobs with extra JAR dependencies cannot be defined with the new L2 constructs introduced in v2.177.0.
Description of changes
Add the
extraJars
parameter in the PySpark job L2 constructs.Checklist
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license