Skip to content

Conversation

@SreeramGarlapati
Copy link
Collaborator

@SreeramGarlapati SreeramGarlapati commented May 19, 2021

This work is an extension of the idea in issue #179 & the Spark2 work done in PR #2272 - only that - this is for Spark3.

In the current implementation:

  • Iceberg Snapshot is the upper bound for MicroBatch. A given MicroBatch will only Span within a Snapshot. It will not be composed of multiple Snapshots. BatchSize - is used to limit the number of files with in a given snapshot.
  • By default, the streaming reader - will ignore the Snapshots of type DELETE or REPLACE. rationale: DELETEs are common for GDPR and REPLACE is common during table maintenance / compaction related rewrites.
  • OVERWRITES are not handled. Something for future.
  • Columnar reads are not enabled. Something for future.

cc: @aokolnychyi & @RussellSpitzer & @holdenk @rdblue @rdsr

@github-actions github-actions bot added the spark label May 19, 2021
return new SparkBatchQueryScan(
spark, table, caseSensitive, schemaWithMetadataColumns(), filterExpressions, options);
// TODO: understand how to differentiate that this is a spark streaming microbatch scan.
if (false) {
Copy link
Collaborator Author

@SreeramGarlapati SreeramGarlapati May 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi / @RussellSpitzer / @holdenk
Spark3 gives ScanBuilder - abstraction - to define all types of Scans (Batch, MicroBatch & Continuous). But, the current implementation / class modelling - has SparkBatchScan as the Scan implementation.
Looking at some of the concerns of BatchScan - all the way from the State maintenance of a single SnapshotId to read from, the asOfTimeStamp & features like VectorizedReads - all of these don't seem relevant to Streaming Scans.
So, I feel that we need to divide out Streaming Scans into a different class.
Does this thought process - make sense?
If we go by this route - do you folks know - how to pass different Scan objects to Spark based on Batch vs Streaming?

@SreeramGarlapati SreeramGarlapati changed the title [WorkInProgress] Spark3 structured streaming micro_batch read support Spark3 structured streaming micro_batch read support May 29, 2021
public void stop() {
}

private String getOffsetLogLocation(String checkpointLocation) {
Copy link
Collaborator Author

@SreeramGarlapati SreeramGarlapati May 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi @RussellSpitzer @rdblue @holdenk - do you folks know of a better way to read the last checkpointed offset.
Am needing this hack for cases where spark cluster goes down and has to restart stream from an old checkpoint. I definitely DO NOT plan to Keep this HACK. Am trying to understand a better way to do this. Truly appreciate any help here.
Started a conversation in iceberg slack channel.

Snapshot previousSnapshot = table.snapshot(microBatchStartOffset.snapshotId());
Snapshot pointer = table.currentSnapshot();
while (pointer != null && previousSnapshot.snapshotId() != pointer.parentId()) {
Preconditions.checkState(pointer.operation().equals(DataOperations.APPEND),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add unittest coverage for overwrite operation.

@holdenk
Copy link
Contributor

holdenk commented Jun 1, 2021

Ok this is great work, I'm going to have to get back up to speed on the streaming stuff. Perhaps https://github.com/viirya has some thoughts here as well.
From the PR description though, I'm not sure I understand the logic behind skipping deletes since GDPR compliance is rather important.

@SreeramGarlapati
Copy link
Collaborator Author

@holdenk - thanks for your first take on the PR. Would be happy to hear out https://github.com/viirya's thoughts as well. Am unable to tag https://github.com/viirya. Please see if you can...

The PR description about GDPR - is to decide between
a) whether to ignore deletes by default or
b) whether to take a special flag to be able to ignore deletes.

The reasoning is that -

  • In Spark Structured streaming - we are STREAMING Iceberg table - row by row. ==> So, there is NO way to STREAM deletes from Iceberg table.
  • Which implies ==> when we encounter deletes - we are left with 2 choices
    1. fail - with UnsupportedSnapshotDataOperation - DELETE
    2. Ignore deletes and move on.
  • almost all of the users of the Iceberg tables want to be GDPR compliant.
  • which implies ==> they would want to delete some rows out of their Iceberg table & want to stream reads out of that table.
  • So, if we throw - UnsupportedOperation - when we encountered a Snapshot of type Delete while reading off of Iceberg Table - potentially all tables out there will need to handle this!
  • So, my proposal is to accept that - Iceberg tables will have GDPR deletes - i.e., - if the Iceberg table has Snapshots which are marked as Delete - we will ignore that Snapshot for streaming read purposes. In the later PRs I will expose a Spark Option - which will give the ability to fail the streaming read - if a Delete is encountered.

Did this make sense!? Happy to discuss.

@SreeramGarlapati SreeramGarlapati deleted the spark3.stream.read branch June 2, 2021 06:31
@SreeramGarlapati
Copy link
Collaborator Author

From the PR description though, I'm not sure I understand the logic behind skipping deletes since GDPR compliance is rather important.

Hi @holdenk - after gaving it a bit more thought - I actually think otherwise. Overall, I agree with you. Ignoring deletes from tables might spook out some of the consumers of this data. For now, I removed the IgnoreDeletes part from the PR and updated the description to reflect the same here: #2660

PS: I was playing around with this branch to squash-merge my 28 commits to one commit (as it was very chatty) & in the process git closed this PR and is not letting me reopen this. So, I had to create a brand new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants