Skip to content

Conversation

@kevinjqliu
Copy link
Owner

@kevinjqliu kevinjqliu commented Jan 2, 2025

cd bindings/python
hatch run dev:develop
hatch run dev:test

Error:


@kevinjqliu kevinjqliu force-pushed the kevinjqliu/datafusion-iceberg-table-provider branch from 4981675 to 7277201 Compare January 2, 2025 19:06
@kevinjqliu kevinjqliu force-pushed the kevinjqliu/datafusion-iceberg-table-provider branch 2 times, most recently from 9b8913b to 2342c54 Compare February 18, 2025 03:58
@kevinjqliu kevinjqliu force-pushed the kevinjqliu/datafusion-iceberg-table-provider branch from 2342c54 to c86219a Compare February 25, 2025 17:00
@kevinjqliu kevinjqliu force-pushed the kevinjqliu/datafusion-iceberg-table-provider branch 2 times, most recently from c3cd944 to 341e64a Compare May 10, 2025 20:14
dependabot bot and others added 18 commits May 12, 2025 09:47
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.44.2 to 1.45.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/tokio-rs/tokio/releases">tokio's
releases</a>.</em></p>
<blockquote>
<h2>Tokio v1.45.0</h2>
<h3>Added</h3>
<ul>
<li>metrics: stabilize <code>worker_total_busy_duration</code>,
<code>worker_park_count</code>, and <code>worker_unpark_count</code> (<a
href="https://github.com/tokio-rs/tokio/issues/6899">#6899</a>,
<a
href="https://github.com/tokio-rs/tokio/issues/7276">#7276</a>)</li>
<li>process: add <code>Command::spawn_with</code> (<a
href="https://github.com/tokio-rs/tokio/issues/7249">#7249</a>)</li>
</ul>
<h3>Changed</h3>
<ul>
<li>io: do not require <code>Unpin</code> for some trait impls (<a
href="https://github.com/tokio-rs/tokio/issues/7204">#7204</a>)</li>
<li>rt: mark <code>runtime::Handle</code> as unwind safe (<a
href="https://github.com/tokio-rs/tokio/issues/7230">#7230</a>)</li>
<li>time: revert internal sharding implementation (<a
href="https://github.com/tokio-rs/tokio/issues/7226">#7226</a>)</li>
</ul>
<h3>Unstable</h3>
<ul>
<li>rt: remove alt multi-threaded runtime (<a
href="https://github.com/tokio-rs/tokio/issues/7275">#7275</a>)</li>
</ul>
<p><a
href="https://github.com/tokio-rs/tokio/issues/6899">#6899</a>:
<a
href="https://github.com/tokio-rs/tokio/pull/6899">tokio-rs/tokio#6899</a>
<a
href="https://github.com/tokio-rs/tokio/issues/7276">#7276</a>:
<a
href="https://github.com/tokio-rs/tokio/pull/7276">tokio-rs/tokio#7276</a>
<a
href="https://github.com/tokio-rs/tokio/issues/7249">#7249</a>:
<a
href="https://github.com/tokio-rs/tokio/pull/7249">tokio-rs/tokio#7249</a>
<a
href="https://github.com/tokio-rs/tokio/issues/7204">#7204</a>:
<a
href="https://github.com/tokio-rs/tokio/pull/7204">tokio-rs/tokio#7204</a>
<a
href="https://github.com/tokio-rs/tokio/issues/7230">#7230</a>:
<a
href="https://github.com/tokio-rs/tokio/pull/7230">tokio-rs/tokio#7230</a>
<a
href="https://github.com/tokio-rs/tokio/issues/7226">#7226</a>:
<a
href="https://github.com/tokio-rs/tokio/pull/7226">tokio-rs/tokio#7226</a>
<a
href="https://github.com/tokio-rs/tokio/issues/7275">#7275</a>:
<a
href="https://github.com/tokio-rs/tokio/pull/7275">tokio-rs/tokio#7275</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/tokio-rs/tokio/commit/00754c8f9c8cd0c10fd54e5304cb9cb95a759d53"><code>00754c8</code></a>
chore: prepare Tokio v1.45.0 (<a
href="https://github.com/tokio-rs/tokio/issues/7308">#7308</a>)</li>
<li><a
href="https://github.com/tokio-rs/tokio/commit/1ae9434e8e4a419ce25644e6c8d2b2e2e8c34750"><code>1ae9434</code></a>
time: revert &quot;use sharding for timer implementation&quot; related
changes (<a
href="https://github.com/tokio-rs/tokio/issues/7226">#7226</a>)</li>
<li><a
href="https://github.com/tokio-rs/tokio/commit/8895bba448534a4eb159f18e57fd845c740e1d38"><code>8895bba</code></a>
ci: Test AArch64 Windows (<a
href="https://github.com/tokio-rs/tokio/issues/7288">#7288</a>)</li>
<li><a
href="https://github.com/tokio-rs/tokio/commit/48ca254d92d4408accd7b1c1beab188288fadb00"><code>48ca254</code></a>
time: update <code>sleep</code> documentation to reflect maximum allowed
duration (<a
href="https://github.com/tokio-rs/tokio/issues/7302">#7302</a>)</li>
<li><a
href="https://github.com/tokio-rs/tokio/commit/a0af02a396274b30ec1d0a27e18ac9ae6eaa2186"><code>a0af02a</code></a>
compat: add more documentation to <code>tokio_util::compat</code> (<a
href="https://github.com/tokio-rs/tokio/issues/7279">#7279</a>)</li>
<li><a
href="https://github.com/tokio-rs/tokio/commit/0ce3a1188a56c4c133d5b789eb366c0752e9b22c"><code>0ce3a11</code></a>
metrics: stabilize <code>worker_park_count</code> and
<code>worker_unpark_count</code> (<a
href="https://github.com/tokio-rs/tokio/issues/7276">#7276</a>)</li>
<li><a
href="https://github.com/tokio-rs/tokio/commit/1ea9ce11d4317d767136d489041548408348be77"><code>1ea9ce1</code></a>
ci: fix cfg!(miri) declarations in tests (<a
href="https://github.com/tokio-rs/tokio/issues/7286">#7286</a>)</li>
<li><a
href="https://github.com/tokio-rs/tokio/commit/4d4d12613bb30f6b550421d6ce2c2c54eb5d341d"><code>4d4d126</code></a>
chore: prepare tokio-util v0.7.15 (<a
href="https://github.com/tokio-rs/tokio/issues/7283">#7283</a>)</li>
<li><a
href="https://github.com/tokio-rs/tokio/commit/5490267a79a894c22cc014367e0fcd43f4ad2bb6"><code>5490267</code></a>
fs: update the mockall dev dependency to 0.13.0 (<a
href="https://github.com/tokio-rs/tokio/issues/7234">#7234</a>)</li>
<li><a
href="https://github.com/tokio-rs/tokio/commit/1434b32b5a0df3b38a0d588485cd9a20a8e92a89"><code>1434b32</code></a>
examples: improve echo example consistency (<a
href="https://github.com/tokio-rs/tokio/issues/7256">#7256</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/tokio-rs/tokio/compare/tokio-1.44.2...tokio-1.45.0">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=tokio&package-manager=cargo&previous-version=1.44.2&new-version=1.45.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…uet files (apache#1308)

## Which issue does this PR close?

- Closes apache#1307

## What changes are included in this PR?

I check the type of the literal scalar against the value we read from
the parquet file and convert the literal to match the Parquet Arrow data
type.

## Are these changes tested?

Tested with a new unit test to cover the different cases.
…e#1304)

## Which issue does this PR close?

- Closes apache#1303

## What changes are included in this PR?

This PR makes return type `Return<>` explicit.

## Are these changes tested?

I included a reproduction environment on the issue, and verify it
compiles with no issue after my fix.
## What changes are included in this PR?

This is the error message I get on my end:
```sh
Error: DataInvalid => Content type of entry PositionDeletes should have DataContentType::Data
```
At first sight, I have no idea what this means, until I read the sanity
check code.

In this PR, I updated the error message for manifest write sanity check
to provide more context:
- What's the manifest content file
- What's the filepath, this is useful for devs to directly understand
what's the data file having issues
- What's the expected and actual data content type

## Are these changes tested?

No-op change

Co-authored-by: Scott Donnelly <[email protected]>
## Which issue does this PR close?

continue apache#1286 

- Closes #.

## What changes are included in this PR?

bump up arrow/parquet/datafusion 

## Are these changes tested?

Y

---------

Signed-off-by: Xuanwo <[email protected]>
Co-authored-by: Xuanwo <[email protected]>
## Which issue does this PR close?

- N/A

## What changes are included in this PR?

This PR is to fix a typo in function `Schema.name_by_field_id()` doc
string.

## Are these changes tested?

Yes
## Which issue does this PR close?

- Closes apache#1318

## What changes are included in this PR?

Declare `FileRead` trait `Sync`-safe.

## Are these changes tested?

Not a feature change, so existing unit tests.
## Which issue does this PR close?


- Closes apache#1316 

## What changes are included in this PR?

- Added new API `set_location` in `transaction`

## Are these changes tested?

- Added an unit test to cover the change
@kevinjqliu kevinjqliu force-pushed the kevinjqliu/datafusion-iceberg-table-provider branch from 341e64a to 4d7a2ed Compare May 13, 2025 21:51
@kevinjqliu kevinjqliu closed this May 13, 2025
kevinjqliu pushed a commit that referenced this pull request Nov 13, 2025
…chTransformer (apache#1821)

## Which issue does this PR close?

Partially address apache#1749.

## What changes are included in this PR?

This PR adds partition spec handling to `FileScanTask` and
`RecordBatchTransformer` to correctly implement the Iceberg spec's
"Column Projection" rules for fields "not present" in data files.

### Problem Statement

Prior to this PR, `iceberg-rust`'s `FileScanTask` had no mechanism to
pass partition information to `RecordBatchTransformer`, causing two
issues:

1. **Incorrect handling of bucket partitioning**: Couldn't distinguish
identity transforms (which should use partition metadata constants) from
non-identity transforms like bucket/truncate/year/month (which must read
from data file). For example, `bucket(4, id)` stores
`id_bucket = 2` (bucket number) in partition metadata, but actual `id`
values (100, 200, 300) are only in the data file. iceberg-rust was
incorrectly treating bucket-partitioned source columns as constants,
breaking runtime filtering and returning incorrect query results.

2. **Field ID conflicts in add_files scenarios**: When importing Hive
tables via `add_files`, partition columns could have field IDs
conflicting with Parquet data columns. Example: Parquet has
field_id=1→"name", but Iceberg expects field_id=1→"id" (partition). Per
spec, the
correct field is "not present" and requires name mapping fallback.

### Iceberg Specification Requirements

Per the Iceberg spec
(https://iceberg.apache.org/spec/#column-projection), when a field ID is
"not present" in a data file, it must be resolved using these rules:

1. Return the value from partition metadata if an **Identity Transform**
exists
2. Use `schema.name-mapping.default` metadata to map field id to columns
without field id
3. Return the default value if it has a defined `initial-default`
4. Return null in all other cases

**Why this matters:**
- **Identity transforms** (e.g., `identity(dept)`) store actual column
values in partition metadata that can be used as constants without
reading the data file
- **Non-identity transforms** (e.g., `bucket(4, id)`, `day(timestamp)`)
store transformed values in partition metadata (e.g., bucket number 2,
not the actual `id` values 100, 200, 300) and must read source columns
from the data file

### Changes Made

1. **Added partition fields to `FileScanTask`** (`scan/task.rs`):
- `partition: Option<Struct>` - Partition data from manifest entry
- `partition_spec: Option<Arc<PartitionSpec>>` - For transform-aware
constant detection
- `name_mapping: Option<Arc<NameMapping>>` - Name mapping from table
metadata

2. **Implemented `constants_map()` function**
(`arrow/record_batch_transformer.rs`):
- Replicates Java's `PartitionUtil.constantsMap()` behavior
- Only includes fields where transform is `Transform::Identity`
- Used to determine which fields use partition metadata constants vs.
reading from data files

3. **Enhanced `RecordBatchTransformer`**
(`arrow/record_batch_transformer.rs`):
- Added `build_with_partition_data()` method to accept partition spec,
partition data, and name mapping
- Implements all 4 spec rules for column resolution with
identity-transform awareness
- Detects field ID conflicts by verifying both field ID AND name match
- Falls back to name mapping when field IDs are missing/conflicting
(spec rule #2)

4. **Updated `ArrowReader`** (`arrow/reader.rs`):
- Uses `build_with_partition_data()` when partition information is
available
- Falls back to `build()` when not available

5. **Updated manifest entry processing** (`scan/context.rs`):
- Populates partition fields in `FileScanTask` from manifest entry data

### Tests Added

1. **`bucket_partitioning_reads_source_column_from_file`** - Verifies
that bucket-partitioned source columns are read from data files (not
treated as constants from partition metadata)

2. **`identity_partition_uses_constant_from_metadata`** - Verifies that
identity-transformed fields correctly use partition metadata constants

3. **`test_bucket_partitioning_with_renamed_source_column`** - Verifies
field-ID-based mapping works despite column rename

4. **`add_files_partition_columns_without_field_ids`** - Verifies name
mapping resolution for Hive table imports without field IDs (spec rule
#2)

5. **`add_files_with_true_field_id_conflict`** - Verifies correct field
ID conflict detection with name mapping fallback (spec rule #2)

6. **`test_all_four_spec_rules`** - Integration test verifying all 4
spec rules work together

## Are these changes tested?

Yes, there are 6 new unit tests covering all 4 Iceberg spec rules. This
also resolved approximately 50 Iceberg Java tests when running with
DataFusion Comet's experimental
apache/datafusion-comet#2528 PR.

---------

Co-authored-by: Renjie Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants