Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark streaming (merge into) iceberg table concurrent write with compaction job #12187

Open
2MD opened this issue Feb 6, 2025 · 0 comments
Open
Labels
question Further information is requested

Comments

@2MD
Copy link
Contributor

2MD commented Feb 6, 2025

Query engine

Iceberg version 1.7.1
Spark version 3.3.2

Question

We have spark streaming application A1 which in every microbatch does:

s"""
       |MERGE INTO table as t
       |USING (select * from $tempViewName) as s
       |ON $joinCondition
       |WHEN MATCHED AND s.$versionColumnName > t.$versionColumnName THEN UPDATE SET *
       |WHEN NOT MATCHED THEN INSERT *
       |"""

We can update any data file.
Table "table" without partition.

And we have some spark batch application A2 which does:
remove old snapshot, rewrite manifest , compaction (binpack and sometimes z-order), rewrite position delete files, delete orphan files.
For this table.

How we can avoid concurrent troubles between two applications?

(We are still thinking about launching A2 inside mircobatch A1... but is not the best solution)

@2MD 2MD added the question Further information is requested label Feb 6, 2025
@2MD 2MD changed the title Spark streaming concurrent write with compact job Spark streaming (merge into) iceberg table concurrent write with compaction job Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant