feat: Support metadata table "Entries" #863

rshkv · 2025-01-01T21:35:04Z

Re #823. This adds support for the the Manifest Entries (docs) which lists entries in the current snapshot's manifest files.

The code is structured with nested builders of StructArray. The hierarchy is roughly as follow (new classes in bold):

EntriesTable
- status, snapshot_id, sequence_number, file_sequence_number
- data_file built by the DataFileStructBuilder, which has:
  - file_path, file_format, record_count, etc.
  - partition is a struct of partition values built by PartitionValuesStructBuilder
    - has for each partition column an AnyArrayBuilder which I had to introduce to do dynamic ArrayBuilder casting based on the column type
- readable_metrics is built by the ReadableMetricsStructBuilder
  - for each column, has a PerColumnReadableMetricsBuilder
    - contains column_size, value_count
    - has an upper_bound and lower_bound struct which use AnyArrayBuilder to preserve the type

This PR ended up being quite verbose because arrow-rs is strict about declaring generic types of array builders at compile time. Unlike Python, which supports entries in ~100 lines, we can't shove a dict into a StructBuilder. Ideally, we could build StructArray row-by-row and write logic to convert manifest entries to rows.

Reference implementations:

Xuanwo

Hi, @rshkv, sorry for making you rebase the PRs again. There are multiple open PRs here, and I spent some time figuring out the best API. I believe we are now heading in the right direction.

We can focus on merging one PR first and then updating the other PRs to avoid additional work.

Xuanwo · 2025-01-02T07:11:39Z

crates/iceberg/src/metadata_table.rs

+}
+
+/// Snapshots table.
+pub struct SnapshotsTable<'a> {


Hi, I think we can simply hold a Table here, allowing us to remove the duplicate APIs exposed at the MetadataTable level and make it a straightforward wrapper instead.

rshkv · 2025-01-02T20:53:54Z

crates/iceberg/src/arrow/schema.rs

+    /// A helper wrapping [ArrayBuilder] for building arrays without declaring the inner type at
+    /// compile-time when types are determined dynamically (e.g. based on some column type).
+    /// A [DataType] is given at construction time which is used to later downcast the inner array
+    /// and provided values.
+    pub(crate) struct AnyArrayBuilder {
+        data_type: DataType,
+        inner: Box<dyn ArrayBuilder>,
+    }


I appreciate this is quite verbose and I wish we didn't have to do all the pattern matching below. If you think of a another way to do this let me know.

rshkv · 2025-01-02T20:59:09Z

crates/iceberg/src/spec/manifest.rs

+    /// File sequence number.
+    #[inline]
+    pub fn file_sequence_number(&self) -> Option<i64> {
+        self.file_sequence_number
+    }


The file_sequence_number relies on this value but there was no way to get at the ManifestEntry::file_sequence_number before.

rshkv · 2025-01-02T21:23:45Z

crates/iceberg/src/arrow/schema.rs

+            let (array, is_scalar) = value.get();
+            assert!(is_scalar, "Can only append scalar datum");
+
+            match array.data_type() {


This is list is exhaustive based on the ArrowSchemaVisitor::primitive function above. I.e., every type produced there is covered here.

rshkv · 2025-01-02T21:24:06Z

crates/iceberg/src/arrow/schema.rs

+                DataType::Timestamp(TimeUnit::Microsecond, _) => self
+                    .builder::<TimestampMicrosecondBuilder>()?
+                    .append_value(array.as_primitive::<TimestampMicrosecondType>().value(0)),
+                DataType::Timestamp(TimeUnit::Nanosecond, _) => self
+                    .builder::<TimestampNanosecondBuilder>()?
+                    .append_value(array.as_primitive::<TimestampNanosecondType>().value(0)),


I understand it's correct to ignore the timezone here because that's not captured in the builder.

rshkv · 2025-01-02T21:24:34Z

crates/iceberg/src/scan.rs

+                                .column_sizes(HashMap::from([(1, 1u64), (2, 1u64)]))
+                                .value_counts(HashMap::from([(1, 2u64), (2, 2u64)]))
+                                .null_value_counts(HashMap::from([(1, 3u64), (2, 3u64)]))
+                                .nan_value_counts(HashMap::from([(1, 4u64), (2, 4u64)]))


This isn't based on the test data but wanted to have that reflected in tests.

rshkv · 2025-01-02T21:26:29Z

crates/iceberg/src/scan.rs

+                                .lower_bounds(HashMap::from([
+                                    (1, Datum::long(1)),
+                                    (2, Datum::long(2)),
+                                    (3, Datum::long(3)),
+                                    (4, Datum::string("Apache")),
+                                    (5, Datum::double(100)),
+                                    (6, Datum::int(100)),
+                                    (7, Datum::long(100)),
+                                    (8, Datum::bool(false)),
+                                    (9, Datum::float(100.0)),
+                                    // decimal values are not supported by schema::get_arrow_datum
+                                    // (10, Datum::decimal(Decimal(123, 2))),
+                                    (11, Datum::date(0)),
+                                    (12, Datum::timestamp_micros(0)),
+                                    (13, Datum::timestamptz_micros(0)),
+                                    // ns timestamps, uuid, fixed, binary are currently not
+                                    // supported in schema::get_arrow_datum
+                                ]))
+                                .upper_bounds(HashMap::from([
+                                    (1, Datum::long(1)),
+                                    (2, Datum::long(5)),
+                                    (3, Datum::long(4)),
+                                    (4, Datum::string("Iceberg")),
+                                    (5, Datum::double(200)),
+                                    (6, Datum::int(200)),
+                                    (7, Datum::long(200)),
+                                    (8, Datum::bool(true)),
+                                    (9, Datum::float(200.0)),
+                                    // decimal values are not supported by schema::get_arrow_datum
+                                    // (10, Datum::decimal(Decimal(123, 2))),
+                                    (11, Datum::date(0)),
+                                    (12, Datum::timestamp_micros(0)),
+                                    (13, Datum::timestamptz_micros(0)),
+                                    // ns timestamps, uuid, fixed, binary are currently not
+                                    // supported in schema::get_arrow_datum
+                                ]))


Adding these so we cover those as types in the lower and upper bounds.

I'm trying to limit the changes I'm making because it's already a lot. My preference would be to cover all types as partition columns as well.

rshkv · 2025-01-02T21:26:44Z

crates/iceberg/src/scan.rs

+                                    // ns timestamps, uuid, fixed, binary are currently not
+                                    // supported in schema::get_arrow_datum


Could add support but thought that might be for another PR.

rshkv · 2025-01-02T21:27:46Z

crates/iceberg/src/table.rs

-    pub fn metadata_table(self) -> MetadataTable {
+    pub fn metadata_table(&self) -> MetadataTable<'_> {


Addressing this comment #822 (comment). I prefer that but don't need to do here.

rshkv · 2025-01-02T21:31:03Z

crates/iceberg/src/metadata_scan.rs

+    /// Get the schema for the manifest entries table.
+    pub fn schema(&self) -> Schema {
+        Schema::new(vec![
+            Field::new("status", DataType::Int32, false),


This is populated with ManifestStatus enum values.

In Java, the status column is i32 (here) but in Python this is u8.

My preference would be u8 but treating the Java implementation as authoritative.

rshkv · 2025-01-02T21:55:25Z

crates/iceberg/src/metadata_scan.rs

+        self.file_size_in_bytes
+            .append_value(data_file.file_size_in_bytes() as i64);


The casting is slightly annoying given we're dealing with non-negative types. But the Python and Java implementation use i64.

rshkv · 2025-01-02T21:58:59Z

crates/iceberg/src/metadata_scan.rs

+            file_path: StringBuilder::new(),
+            file_format: StringBuilder::new(),
+            partition: PartitionValuesStructBuilder::new(table_metadata),
+            record_count: Int64Builder::new(),


The manifests table merged in #861 prefers using PrimitiveBuilder::new() which works because Int64Builder is just PrimitiveBuilder<Int64Type>.

I prefer saying Int64Builder here to be explicit about the type but happy to change.

rshkv · 2025-01-02T22:14:12Z

Thank you, @Xuanwo. Rebased and ready for review.

liurenjie1024

Thanks @rshkv for this contribution. I have finished first round review and left some cocerns with current api design.

crates/iceberg/src/metadata_scan.rs

liurenjie1024 · 2025-01-03T08:48:29Z

crates/iceberg/src/metadata_scan.rs

+
+impl<'a> EntriesTable<'a> {
+    /// Get the schema for the manifest entries table.
+    pub fn schema(&self) -> Schema {


Why return arrow schema rather iceberg schema here?

Following existing API but happy to update. I understand the idea in #822 (comment) was for engines to fetch the schema before having to fetch data.

@liurenjie1024, would you mind saying more. I'm happy to go with either but not sure why.

Even if there's no consumer of schema() currently, I follow @xxchan's argument's of the reader likely wanting an Arrow schema. Another benefit is that we can use the schema ourselves when constructing scans. I'm not sure what a consumer would do with an Iceberg schema (except maybe convert to Arrow).

As alternative to having an Arrow or Iceberg schema, we could also not have a public schema()?

liurenjie1024 · 2025-01-03T08:52:02Z

crates/iceberg/src/metadata_scan.rs

+/// For reference, see the Java implementation of [`DataFile`][1].
+///
+/// [1]: https://github.com/apache/iceberg/blob/apache-iceberg-1.7.1/api/src/main/java/org/apache/iceberg/DataFile.java
+struct DataFileStructBuilder<'a> {


This would not be required if we use iceberg schema

Can you say more? I suppose we still need to construct those StructArray instances?

rshkv · 2025-01-08T13:27:24Z

crates/iceberg/src/inspect/snapshots.rs

+                +--------------------------+---------------------+---------------------+-----------+---------+
+                | committed_at             | snapshot_id         | parent_id           | operation | summary |
+                +--------------------------+---------------------+---------------------+-----------+---------+
+                | 2018-01-04T21:22:35.770Z | 3051729675574597004 |                     | append    | {}      |
+                | 2019-04-12T20:29:15.770Z | 3055729675574597004 | 3051729675574597004 | append    | {}      |
+                +--------------------------+---------------------+---------------------+-----------+---------+"#]],


I think checking the rendered table is easier to read and confirm with eyeballs. We lose type information but we assert types separately.

👍 This looks great.

rshkv · 2025-01-08T13:31:19Z

crates/iceberg/src/inspect/entries.rs

+                Field { name: "sequence_number", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },
+                Field { name: "file_sequence_number", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },
+                Field { name: "data_file", data_type: Struct([Field { name: "content", data_type: Int8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "file_path", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "file_format", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "partition", data_type: Struct([Field { name: "x", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "record_count", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "file_size_in_bytes", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "column_sizes", data_type: Map(Field { name: "entries", data_type: Struct([Field { name: "keys", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "values", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_counts", data_type: Map(Field { name: "entries", data_type: Struct([Field { name: "keys", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "values", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_counts", data_type: Map(Field { name: "entries", data_type: Struct([Field { name: "keys", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "values", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_counts", data_type: Map(Field { name: "entries", data_type: Struct([Field { name: "keys", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "values", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bounds", data_type: Map(Field { name: "entries", data_type: Struct([Field { name: "keys", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "values", data_type: Binary, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bounds", data_type: Map(Field { name: "entries", data_type: Struct([Field { name: "keys", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "values", data_type: Binary, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "key_metadata", data_type: Binary, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "split_offsets", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "equality_ids", data_type: List(Field { name: "item", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "sort_order_id", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} },
+                Field { name: "readable_metrics", data_type: Struct([Field { name: "x", data_type: Struct([Field { name: "column_size", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bound", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bound", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "y", data_type: Struct([Field { name: "column_size", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bound", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bound", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "z", data_type: Struct([Field { name: "column_size", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bound", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bound", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "a", data_type: Struct([Field { name: "column_size", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bound", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bound", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "dbl", data_type: Struct([Field { name: "column_size", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bound", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bound", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "i32", data_type: Struct([Field { name: "column_size", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bound", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bound", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "i64", data_type: Struct([Field { name: "column_size", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bound", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bound", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "bool", data_type: Struct([Field { name: "column_size", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bound", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bound", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "float", data_type: Struct([Field { name: "column_size", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bound", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bound", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "decimal", data_type: Struct([Field { name: "column_size", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bound", data_type: Decimal128(3, 2), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bound", data_type: Decimal128(3, 2), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "date", data_type: Struct([Field { name: "column_size", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bound", data_type: Date32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bound", data_type: Date32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "timestamp", data_type: Struct([Field { name: "column_size", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bound", data_type: Timestamp(Microsecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bound", data_type: Timestamp(Microsecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "timestamptz", data_type: Struct([Field { name: "column_size", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bound", data_type: Timestamp(Microsecond, Some("+00:00")), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bound", data_type: Timestamp(Microsecond, Some("+00:00")), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "timestampns", data_type: Struct([Field { name: "column_size", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bound", data_type: Timestamp(Nanosecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bound", data_type: Timestamp(Nanosecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "timestamptzns", data_type: Struct([Field { name: "column_size", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bound", data_type: Timestamp(Nanosecond, Some("+00:00")), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bound", data_type: Timestamp(Nanosecond, Some("+00:00")), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "binary", data_type: Struct([Field { name: "column_size", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "null_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "nan_value_count", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "lower_bound", data_type: LargeBinary, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "upper_bound", data_type: LargeBinary, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }"#]],


This looks much worse here than it does locally:

If there's a way to pretty-print schemas let me know - couldn't find one.

I don't think there exist one. We can write a separate function (or newtype) for pretty print the schema.
I once did similar things in RisingWave:

The code:

https://github.com/risingwavelabs/risingwave/blob/main/src/connector/codec/tests/integration_tests/utils.rs#L23-L66

https://github.com/risingwavelabs/risingwave/blob/main/src/connector/codec/tests/integration_tests/utils.rs#L167-L214

Basically we ignore noises (e.g., unnecessary fields, field names) and make it more concise and readable.

Yeah, that'd be nice. E.g., we don't need to render dict_id and dict_is_ordered for every field.

I'll not introduce that in an already large PR (unless you prefer I do). If you think I should, maybe we can do the updates to check_record_batches (pretty-printing batches, ignoring struct fields, and the new schema pretty-print) in a separate PR?

Don't want to block this PR basically.

maybe we can do the updates to check_record_batches (pretty-printing batches, ignoring struct fields, and the new schema pretty-print) in a separate PR?

+1

rshkv · 2025-01-08T15:22:05Z

I've rebased on #870 and #872 to address the follow:

The entries table now lives in a separate entries.rs file.
Batches for manifest files are now computed asynchronously.

I haven't address @liurenjie1024's point about schema() returning an Iceberg schema instead of an Arrow one. We have issue #868 and PR #871 but I'm not sure that's what we want generally and whether you want this to happen before this PR.

Otherwise this is ready for another review.

liurenjie1024 · 2025-01-09T10:19:13Z

I've rebased on #870 and #872 to address the follow:

The entries table now lives in a separate entries.rs file.

Batches for manifest files are now computed asynchronously.

I haven't address @liurenjie1024's point about schema() returning an Iceberg schema instead of an Arrow one. We have issue #868 and PR #871 but I'm not sure that's what we want generally and whether you want this to happen before this PR.

Otherwise this is ready for another review.

Thanks @rshkv for the contribution, let's continue discussion about schema in #868

Xuanwo reviewed Jan 2, 2025

View reviewed changes

This was referenced Jan 2, 2025

feat: Support metadata table "History" #841

Open

feat: Support metadata table "Metadata Log Entries" #846

Open

rshkv force-pushed the wr/metadata-entries branch 3 times, most recently from 1544e11 to 2e2e9d6 Compare January 2, 2025 17:21

rshkv commented Jan 2, 2025

View reviewed changes

rshkv force-pushed the wr/metadata-entries branch 2 times, most recently from eecf1f8 to f82b7ff Compare January 2, 2025 22:08

rshkv commented Jan 2, 2025

View reviewed changes

rshkv marked this pull request as ready for review January 2, 2025 22:13

liurenjie1024 reviewed Jan 3, 2025

View reviewed changes

This was referenced Jan 3, 2025

Metadata table scans as streams #870

Merged

Split metadata tables into separate modules #872

Merged

xxchan mentioned this pull request Jan 6, 2025

feat: support metadata tables #823

Open

rshkv force-pushed the wr/metadata-entries branch from e1d3ee4 to ec127ec Compare January 8, 2025 12:54

Support 'entries' metadata table

28d92ad

rshkv force-pushed the wr/metadata-entries branch from ec127ec to 28d92ad Compare January 8, 2025 13:25

rshkv commented Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support metadata table "Entries" #863

feat: Support metadata table "Entries" #863

rshkv commented Jan 1, 2025 •

edited

Loading

Xuanwo left a comment •

edited

Loading

Xuanwo Jan 2, 2025

rshkv Jan 2, 2025

rshkv Jan 2, 2025

rshkv Jan 2, 2025

rshkv Jan 2, 2025

rshkv Jan 2, 2025

rshkv Jan 2, 2025

rshkv Jan 2, 2025

rshkv Jan 2, 2025

rshkv Jan 2, 2025

rshkv Jan 2, 2025

rshkv Jan 2, 2025

rshkv commented Jan 2, 2025

liurenjie1024 left a comment

liurenjie1024 Jan 3, 2025

rshkv Jan 3, 2025

rshkv Jan 3, 2025

liurenjie1024 Jan 3, 2025

rshkv Jan 3, 2025

rshkv Jan 8, 2025

xxchan Jan 8, 2025

rshkv Jan 8, 2025

xxchan Jan 8, 2025

rshkv Jan 8, 2025

rshkv Jan 8, 2025

xxchan Jan 8, 2025

rshkv commented Jan 8, 2025

liurenjie1024 commented Jan 9, 2025

		// ns timestamps, uuid, fixed, binary are currently not
		// supported in schema::get_arrow_datum

		pub fn metadata_table(self) -> MetadataTable {
		pub fn metadata_table(&self) -> MetadataTable<'_> {

		self.file_size_in_bytes
		.append_value(data_file.file_size_in_bytes() as i64);

feat: Support metadata table "Entries" #863

Are you sure you want to change the base?

feat: Support metadata table "Entries" #863

Conversation

rshkv commented Jan 1, 2025 • edited Loading

Xuanwo left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rshkv commented Jan 2, 2025

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rshkv commented Jan 8, 2025

liurenjie1024 commented Jan 9, 2025

rshkv commented Jan 1, 2025 •

edited

Loading

Xuanwo left a comment •

edited

Loading