Skip to content

Conversation

@aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Oct 11, 2025

This change adds support for writing shredded variants in the iceberg-spark module, enabling Spark to write shredded variant data into Iceberg tables.

Ideally, this should follow the approach described in the reader/writer API proposal for Iceberg V4, where the execution engine provides the shredded writer schema before creating the Iceberg writer. This design is cleaner, as it delegates schema generation responsibility to the engine.

As an interim solution, this PR implements a writer with lazy initialization for the actual Parquet writer. It buffers a portion of the data first, derives the shredded schema from the buffered records, then initializes the Parquet writer and flushes the buffered data to the file.

@aihuaxu aihuaxu force-pushed the spark-write-iceberg-variant branch from 16b7a09 to dc4f72e Compare October 11, 2025 21:03
@aihuaxu aihuaxu marked this pull request as ready for review October 11, 2025 21:15
@aihuaxu aihuaxu force-pushed the spark-write-iceberg-variant branch 2 times, most recently from 3a7d704 to 97851f0 Compare October 13, 2025 04:47
@aihuaxu aihuaxu force-pushed the spark-write-iceberg-variant branch from 97851f0 to b87e999 Compare October 13, 2025 16:47
@aihuaxu
Copy link
Contributor Author

aihuaxu commented Oct 15, 2025

@amogh-jahagirdar @Fokko @huaxingao Can you help take a look at this PR and if we have better approach for this?

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Oct 21, 2025

cc @RussellSpitzer, @pvary and @rdblue Seems it's better to have the implementation with new File Format proposal but want to check if this is acceptable approach as an interim solution or you see a better alternative.

lazy.initialize(props, compressor, rowGroupOrdinal);
this.parquetSchema = result.getSchema();
this.pageStore = result.getPageStore();
this.writeStore = result.getWriteStore();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the initial writeStore/pageStore from startRowGroup() aren’t closed before being replaced here. Could this cause memory leak?

@pvary
Copy link
Contributor

pvary commented Oct 21, 2025

@aihuaxu: Don't we want to do the same but instead of wrapping the ParquetWriter, we could wrap the DataWriter. The schema would be created near the SparkWrite.WriterFactory and it would be easier to move to the new API when it is ready. The added benefit would be that when other formats implement the Variant, we could reuse the code.

Would this be prohibitively complex?

@huaxingao
Copy link
Contributor

In Spark DSv2, planning/validation happens on the driver. BatchWrite#createBatchWriterFactory runs on the driver and returns a DataWriterFactory that is serialized to executors. That factory must already carry the write schema the executors will use when they create DataWriters.

For shredded variant, we don’t know the shredded schema at planning time. We have to inspect some records to derive it. Doing a read on the driver during createBatchWriterFactory would mean starting a second job inside planning, which is not how DSv2 is intended to work.

Because of that, the current proposed Spark approach is: put the logical variant in the writer factory, on the executor, buffer the first N rows, infer the shredded schema from data, then initialize the concrete writer and flush the buffer. I believe this PR follow the same approach, which seems like a practical solution to me given DSV2's constraints.

@pvary
Copy link
Contributor

pvary commented Oct 22, 2025

Thanks for the explanation, @huaxingao! I see several possible workarounds for the DataWriterFactory serialization issue, but I have some more fundamental concerns about the overall approach.
I believe shredding should be driven by future reader requirements rather than by the actual data being written. Ideally, it should remain relatively stable across data files within the same table and originate from a writer job configuration—or even better, from a table-level configuration.

Even if we accept that the written data should dictate the shredding logic, Spark’s implementation—while dependent on input order—is at least somewhat stable. It drops rarely used fields, handles inconsistent types, and limits the number of columns.
I understand this is only a PoC implementation for shredding, but I’m concerned that the current simplifications make it very unstable. If I’m interpreting correctly, the logic infers the type from the first occurrence of each field and creates a column for every field. This could lead to highly inconsistent column layouts within a table, especially in IoT scenarios where multiple sensors produce vastly different data.
Did I miss anything?

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Oct 24, 2025

Thanks @huaxingao and @pvary for reviewing, and thanks to Huaxin for explaining how the writer works in Spark.

Regarding the concern about unstable schemas, Spark's approach makes sense:

  • If a field appears consistently with a consistent type, create both value and typed_value
  • If a field appears with inconsistent types, create only value
  • Drop fields that occur in less than 10% of sampled rows
  • Cap the total at 300 fields (counting value and typed_value separately)

We could implement similar heuristics. Additionally, making the shredded schema configurable would allow users to choose which fields to shred at write time based on their read patterns.

For this POC, I'd like any feedback on whether there are any significant high-level design options to consider first and if this approach is acceptable. This seems hacky. I may have missed big picture on how the writers work across Spark + Iceberg + Parquet and we may have better way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants