Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vpj] Fix DATASET_CHANGED tracking for Spark jobs #1513

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nisargthakkar
Copy link
Contributor

@nisargthakkar nisargthakkar commented Feb 10, 2025

Fix DATASET_CHANGED tracking for Spark jobs

In Spark ComputeJob, some cases of DATASET_CHANGED were not getting tracked correctly, and were instead getting tracked as START_DATA_WRITER_JOB.

This was because of an incorrect assumption that Spark jobs would get executed only when the final action stage is encountered. But in practice, the spark stages were starting to get executed when the repartitionAndSortWithinPartitions call was getting invoked. This was previously done in the configure phase, and exceptions thrown in this stage are not handled as errors during the execution of the compute job.

This commit moves all Spark execution logic inside the runComputeJob method instead

How was this PR tested?

GH CI

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant