[vpj] Fix DATASET_CHANGED tracking for Spark jobs #1513

nisargthakkar · 2025-02-10T16:19:57Z

Fix `DATASET_CHANGED` tracking for Spark jobs

In Spark ComputeJob, some cases of DATASET_CHANGED were not getting tracked correctly, and were instead getting tracked as START_DATA_WRITER_JOB.

This was because of an incorrect assumption that Spark jobs would get executed only when the final action stage is encountered. But in practice, the spark stages were starting to get executed when the repartitionAndSortWithinPartitions call was getting invoked. This was previously done in the configure phase, and exceptions thrown in this stage are not handled as errors during the execution of the compute job.

This commit moves all Spark execution logic inside the runComputeJob method instead

How was this PR tested?

GH CI

Does this PR introduce any user-facing changes?

No. You can skip the rest of this section.
Yes. Make sure to explain your proposed changes and call out the behavior change.

[vpj] Fix DATASET_CHANGED tracking for Spark jobs

5a8f5f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vpj] Fix DATASET_CHANGED tracking for Spark jobs #1513

[vpj] Fix DATASET_CHANGED tracking for Spark jobs #1513

nisargthakkar commented Feb 10, 2025 •

edited

Loading

[vpj] Fix DATASET_CHANGED tracking for Spark jobs #1513

Are you sure you want to change the base?

[vpj] Fix DATASET_CHANGED tracking for Spark jobs #1513

Conversation

nisargthakkar commented Feb 10, 2025 • edited Loading

Fix DATASET_CHANGED tracking for Spark jobs

How was this PR tested?

Does this PR introduce any user-facing changes?

nisargthakkar commented Feb 10, 2025 •

edited

Loading

Fix `DATASET_CHANGED` tracking for Spark jobs