Upgrade df version #1454

parmesant · 2025-10-30T09:22:39Z

Upgrade df version to 50.3.0 and arrow/parquet to 57.0.0

Fixes #XXXX.

Description

This PR has:

been tested to ensure log ingestion and log query works.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added documentation for new or modified features or behaviors.

Summary by CodeRabbit

Chores
- Upgraded core libraries (Arrow/Parquet/DataFusion, gRPC/HTTP, object-store, reqwest, regex) and enhanced CI/toolchain for cross/native builds.
Refactor
- Modernized file-scan/source construction to a builder-based model enabling richer scan options, projection, limits, and predicate handling.
Bug Fixes / Behavior
- Fixed legacy timestamp conversion and normalized time literal representations; tests updated.
Breaking Changes
- Object metadata size type and several object-store-related method signatures changed (may require client updates).

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-10-30T09:23:04Z

Walkthrough

Dependency and CI/tooling upgrades; Arrow/DataFusion/Parquet and related crates bumped; Int96 timestamp conversion switched to nanoseconds via a new helper; Parquet scanning refactored to FileScanConfigBuilder + ParquetSource (projection/predicate/limit); object-store API types/signatures updated; CompressionContext added to IPC tests.

Changes

Cohort / File(s)	Summary
Dependency upgrades `Cargo.toml`	Bumped Arrow crates to `57.1.0`, DataFusion to `51.0.0`, Parquet to `57.1.0`; upgraded `tonic`/`tonic-web` and added `tonic-prost`; `object_store`, `reqwest`, `regex`, and related features adjusted.
Parquet stats & imports `src/catalog/column.rs`, `src/catalog/manifest.rs`	Int96 min/max now converted via new private `int96_to_i64_nanos` returning nanoseconds; parquet imports consolidated and `RowGroupMetaData` alias used.
Query execution / builder refactor `src/query/mod.rs`, `src/query/stream_schema_provider.rs`	Replaced DiskManagerConfig use with DiskManager::builder(); switched to `FileScanConfigBuilder` and `ParquetSource` using `file_groups`, predicate, projection, and limit; updated ScalarValue timestamp literal shapes and tests; use `UnionExec::try_new`.
Object store API alignment `src/storage/localfs.rs`, `src/storage/metrics_layer.rs`, `src/storage/object_storage.rs`	`ObjectMeta.size` handled natively (usize → u64); method signatures use `PutMultipartOptions`; ranges changed to `Range<u64>`; `list`/`list_with_offset` return `BoxStream<'static, ...>`; removed manual casts in size comparisons.
IPC / compression & Flight tweaks `src/parseable/staging/reader.rs`, `src/utils/arrow/flight.rs`	Tests updated to use `CompressionContext` in encoding calls; timestamp literal construction extended with `Spans` and an extra `None` field in timestamp literals.
Parquet import minor move `src/parseable/streams.rs`	`SortingColumn` import moved from `parquet::format` to `parquet::file::metadata`.
CI/tooling / cross builds `.github/workflows/build.yaml`, `Cross.toml`	CI split cross/native Kafka builds, added cross/buildx/qemu and GCC11 steps, added aarch64 Cross.toml pre-build/env passthrough and volumes.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Provider as StandardTableProvider
    participant Builder as FileScanConfigBuilder
    participant Source as ParquetSource
    participant Format as ParquetFormat
    participant Planner as ExecutionPlanner

    Client->>Provider: request scan_execution_plan(...)
    Provider->>Builder: build config from partitions (stats, batch_size, constraints)
    alt predicate present
        Builder->>Source: attach predicate/filters
    end
    alt projection present
        Builder->>Builder: set projection_indices
    end
    alt limit present
        Builder->>Builder: apply limit
    end
    Builder->>Format: create_physical_plan(config)
    Format->>Planner: generate per-file execution plans
    Planner->>Planner: UnionExec::try_new(plans)
    Planner-->>Client: return physical execution plan

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Focus areas:
- src/query/stream_schema_provider.rs — FileScanConfigBuilder, ParquetSource, projection/limit/predicate logic, updated literal shapes and tests.
- src/storage/metrics_layer.rs & src/storage/localfs.rs — object_store API/type changes (size type, ranges, multipart options, stream lifetimes).
- src/catalog/column.rs — correctness of Int96 → nanoseconds conversion (Julian day / nanos-of-day math).
- Cargo.toml & CI files — cross-crate feature interactions and cross/native build logic.

Possibly related PRs

chore: update arrow to v54 and datafusion to v45 #1221 — Overlaps with Arrow/DataFusion upgrades and stream_schema_provider refactor.
fix: cross compile with GNU binutils 2.34 #1230 — Related cross-compilation and Cross.toml aarch64 additions.
fix: docker image build for amd64 and arm64 separately #1364 — Related CI/workflow multi-arch Docker/cross-build changes.

Suggested reviewers

parmesant

🐇 I hopped through crates and changed the clocks,
Nanoseconds now sparkle where old timestamps walked,
Builders stack files, predicates softly bind,
Streams stretch their ranges, uploads redesigned,
A joyful rabbit cheers — upgrades neatly timed.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 2 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 32.26% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title 'Upgrade df version' is vague and generic, using an abbreviation ('df') that is not clearly defined and failing to specify what versions are being upgraded to.	Clarify the title to specify the main change, such as 'Upgrade DataFusion to 51.0.0 and Arrow/Parquet to 57.1.0' for better clarity and scanning.
Description check	❓ Inconclusive	The description mentions upgrading DataFusion and Arrow/Parquet versions but leaves the template mostly unchecked and incomplete without providing context on rationale, testing status, or implementation details.	Complete the checklist items and provide details on testing conducted, any breaking changes handled, and rationale for the upgrade.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

Cargo.toml (1)
159-162: Version mismatch in dev-dependencies: arrow 54.0.0 vs 57.1.0.

The dev-dependencies section pins arrow = "54.0.0" while the main dependencies use arrow = "57.1.0". This version mismatch can cause type incompatibilities in tests when Arrow types cross the boundary between test code and library code.

Consider aligning the versions:
 [dev-dependencies]
 rstest = "0.23.0"
-arrow = "54.0.0"
+arrow = "57.1.0"
 temp-dir = "0.1.14"
If this mismatch is intentional for specific testing purposes, please document the rationale.

🧹 Nitpick comments (2)

src/catalog/manifest.rs (1)

176-201: column_statistics signature simplification looks good; optional consistency tweak

Switching column_statistics to take &[RowGroupMetaData] (using the imported alias) is a pure type-path cleanup; the aggregation over row_group.columns(), sizes, and stats.try_into() is unchanged and still correct with current parquet metadata types.

If you want to keep things fully consistent, you could also update sort_order’s parameter type to use the same RowGroupMetaData alias instead of the fully qualified path, but that’s purely cosmetic.

After the parquet bump, please verify that RowGroupMetaData::columns, compressed_size, uncompressed_size, and statistics().try_into() still behave as expected by running a manifest-generation test against a few real parquet files.
src/query/stream_schema_provider.rs (1)
145-189: LGTM! Correctly migrated to FileScanConfigBuilder pattern.

The refactor properly uses the new DataFusion API:

ParquetSource with optional predicate for filter pushdown

FileScanConfigBuilder for assembling scan configuration

Projection and limit conditionally applied via builder methods

No longer requires downcasting SessionState

One consideration: the batch size is hardcoded to 20000. This matches the session config in create_session_state (line 144), but consider using the session's configured batch size for consistency.

Consider using the session's batch size instead of hardcoding:
-            .with_batch_size(Some(20000))
+            .with_batch_size(Some(state.config().batch_size()))

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 350ab9c and 010988c.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (11)

Cargo.toml (3 hunks)
src/catalog/column.rs (2 hunks)
src/catalog/manifest.rs (2 hunks)
src/parseable/staging/reader.rs (3 hunks)
src/parseable/streams.rs (1 hunks)
src/query/mod.rs (3 hunks)
src/query/stream_schema_provider.rs (13 hunks)
src/storage/localfs.rs (1 hunks)
src/storage/metrics_layer.rs (6 hunks)
src/storage/object_storage.rs (1 hunks)
src/utils/arrow/flight.rs (3 hunks)

🧰 Additional context used

🧠 Learnings (18)

📓 Common learnings

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.

📚 Learning: 2025-09-14T15:17:59.234Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1432
File: src/storage/object_storage.rs:124-132
Timestamp: 2025-09-14T15:17:59.234Z
Learning: In Parseable's upload validation system (src/storage/object_storage.rs), the validate_uploaded_parquet_file function should not include bounded retries for metadata consistency issues. Instead, failed validations rely on the 30-second sync cycle for natural retries, with staging files preserved when manifest_file is set to None.

Applied to files:

src/storage/object_storage.rs

📚 Learning: 2025-09-18T09:52:07.554Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/storage/object_storage.rs:173-177
Timestamp: 2025-09-18T09:52:07.554Z
Learning: In Parseable's upload system (src/storage/object_storage.rs), the update_storage_metrics function can safely use path.metadata().map_err() to fail on local file metadata read failures because parquet validation (validate_uploaded_parquet_file) ensures file integrity before this step, and the system guarantees local staging files remain accessible throughout the upload flow.

Applied to files:

src/storage/object_storage.rs
src/storage/metrics_layer.rs
src/storage/localfs.rs
src/catalog/manifest.rs

📚 Learning: 2025-08-18T14:56:18.463Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/storage/object_storage.rs:997-1040
Timestamp: 2025-08-18T14:56:18.463Z
Learning: In Parseable's staging upload system (src/storage/object_storage.rs), failed parquet file uploads should remain in the staging directory for retry in the next sync cycle, while successful uploads remove their staged files immediately. Early return on first error in collect_upload_results is correct behavior as concurrent tasks handle their own cleanup and failed files need to stay for retry.

Applied to files:

src/storage/object_storage.rs
src/storage/metrics_layer.rs

📚 Learning: 2025-08-20T17:01:25.791Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1409
File: src/storage/field_stats.rs:429-456
Timestamp: 2025-08-20T17:01:25.791Z
Learning: In Parseable's field stats calculation (src/storage/field_stats.rs), the extract_datetime_from_parquet_path_regex function correctly works with filename-only parsing because Parseable's server-side filename generation guarantees the dot-separated format date=YYYY-MM-DD.hour=HH.minute=MM pattern in parquet filenames.

Applied to files:

src/storage/object_storage.rs
src/query/stream_schema_provider.rs

📚 Learning: 2025-10-24T11:54:20.269Z

Learnt from: parmesant
Repo: parseablehq/parseable PR: 1449
File: src/metastore/metastores/object_store_metastore.rs:83-98
Timestamp: 2025-10-24T11:54:20.269Z
Learning: In the `get_overviews` method in `src/metastore/metastores/object_store_metastore.rs`, using `.ok()` to convert all storage errors to `None` when fetching overview objects is the intended behavior. This intentionally treats missing files and other errors (network, permissions, etc.) the same way.

Applied to files:

src/storage/object_storage.rs

📚 Learning: 2025-07-28T17:10:39.448Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1392
File: src/migration/stream_metadata_migration.rs:303-322
Timestamp: 2025-07-28T17:10:39.448Z
Learning: In Parseable's migration system (src/migration/stream_metadata_migration.rs), each migration function updates the metadata to the current latest format using CURRENT_OBJECT_STORE_VERSION and CURRENT_SCHEMA_VERSION constants, rather than producing incremental versions. For example, v5_v6 function produces v7 format output when these constants are set to "v7", not v6 format.

Applied to files:

src/parseable/streams.rs
src/storage/metrics_layer.rs
src/query/stream_schema_provider.rs
src/catalog/manifest.rs

📚 Learning: 2025-09-05T09:18:44.813Z

Learnt from: parmesant
Repo: parseablehq/parseable PR: 1425
File: src/query/mod.rs:484-495
Timestamp: 2025-09-05T09:18:44.813Z
Learning: In the Parseable system, stream names and column names cannot contain quotes, which eliminates SQL injection concerns when interpolating these identifiers directly into SQL queries in src/query/mod.rs.

Applied to files:

src/parseable/streams.rs

📚 Learning: 2025-08-18T19:10:11.941Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/handlers/http/ingest.rs:163-164
Timestamp: 2025-08-18T19:10:11.941Z
Learning: Field statistics calculation in src/storage/field_stats.rs uses None for the time_partition parameter when calling flatten_and_push_logs(), as field stats generation does not require time partition functionality.

Applied to files:

src/utils/arrow/flight.rs
src/catalog/column.rs

📚 Learning: 2025-09-18T09:59:20.177Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/metrics/mod.rs:700-756
Timestamp: 2025-09-18T09:59:20.177Z
Learning: In src/event/mod.rs, the parsed_timestamp used in increment_events_ingested_by_date() is correctly UTC-normalized: for dynamic streams it remains Utc::now(), and for streams with time partition enabled it uses the time partition value. Both cases result in proper UTC date strings for metrics labeling, preventing double-counting issues.

Applied to files:

src/utils/arrow/flight.rs

📚 Learning: 2025-08-25T01:31:41.786Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/metadata.rs:63-68
Timestamp: 2025-08-25T01:31:41.786Z
Learning: The TOTAL_EVENTS_INGESTED_DATE, TOTAL_EVENTS_INGESTED_SIZE_DATE, and TOTAL_EVENTS_STORAGE_SIZE_DATE metrics in src/metadata.rs and src/storage/object_storage.rs are designed to track total events across all streams, not per-stream. They use labels [origin, parsed_date] to aggregate by format and date, while per-stream metrics use [stream_name, origin, parsed_date] labels.

Applied to files:

src/storage/metrics_layer.rs

📚 Learning: 2025-09-23T10:08:36.368Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/storage/localfs.rs:41-43
Timestamp: 2025-09-23T10:08:36.368Z
Learning: In src/storage/localfs.rs, the metrics use "localfs" as the provider label (not "drive" from FSConfig.name()) because it aligns with the object storage implementation naming for the file system. This is intentional and should not be flagged as an inconsistency.

Applied to files:

src/storage/metrics_layer.rs

📚 Learning: 2025-08-21T11:47:01.279Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1410
File: src/storage/object_storage.rs:0-0
Timestamp: 2025-08-21T11:47:01.279Z
Learning: In Parseable's object storage implementation (src/storage/object_storage.rs), the hour and minute directory prefixes (hour=XX, minute=YY) are generated from arrow file timestamps following proper datetime conventions, so they are guaranteed to be within valid ranges (0-23 for hours, 0-59 for minutes) and don't require additional range validation.

Applied to files:

src/storage/metrics_layer.rs

📚 Learning: 2025-06-18T06:39:04.775Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.

Applied to files:

src/query/mod.rs
src/query/stream_schema_provider.rs

📚 Learning: 2025-10-28T02:10:41.140Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1453
File: src/parseable/mod.rs:397-400
Timestamp: 2025-10-28T02:10:41.140Z
Learning: In Parseable enterprise deployments with multiple query nodes, hot tier configuration must be persisted in object storage so that newly started query nodes can fetch and synchronize the hot tier settings at startup (file: src/parseable/mod.rs, function: create_stream_and_schema_from_storage).

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-08-21T12:04:38.398Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1410
File: src/storage/object_storage.rs:0-0
Timestamp: 2025-08-21T12:04:38.398Z
Learning: In Parseable's timestamp extraction (extract_timestamp_for_date in src/storage/object_storage.rs), minute-level precision is sufficient for both first and latest event timestamps. Seconds are always set to 0, meaning both first and latest events within the same minute will have identical timestamps, which is the intended behavior.

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-08-18T12:37:47.732Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/parseable/mod.rs:528-533
Timestamp: 2025-08-18T12:37:47.732Z
Learning: In Parseable, the validate_time_partition function in src/utils/json/flatten.rs already provides a default time partition limit of 30 days using `map_or(30, |days| days.get() as i64)` when time_partition_limit is None, so no additional defaulting is needed in the stream creation logic in src/parseable/mod.rs.

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-06-18T11:15:10.836Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.

Applied to files:

Cargo.toml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: coverage

🔇 Additional comments (25)

src/parseable/streams.rs (1)

34-42: SortingColumn import now matches Parquet 57 metadata API

Importing metadata::SortingColumn from parquet::file aligns with the current WriterProperties / RowGroupMetaData APIs and avoids the deprecated parquet::format path. The existing use in parquet_writer_props remains correct and type‑safe.

Please re-run the parquet-related tests (ingestion + readback) with the new parquet version to confirm there are no behavioral changes around sort columns.

src/catalog/manifest.rs (1)

21-25: Parquet metadata imports are consistent with upgraded parquet API

Consolidating imports to file::metadata::{RowGroupMetaData, SortingColumn} matches how parquet 57 exposes metadata structures and keeps call sites (row_groups, sorting_columns) clear and type‑correct.

Please ensure the crate versions used in Cargo.toml match the parquet API you targeted here and that manifest-building tests (or an end-to-end manifest generation run) pass with these new imports.

src/utils/arrow/flight.rs (2)

31-31: LGTM - Correct adaptation to DataFusion 51.0.0 Column API.

The Spans import and initialization align with the updated datafusion::common::Column struct requirements.

Also applies to: 120-124

142-144: LGTM - Updated Expr::Literal signature.

The addition of the second None parameter correctly matches the new Expr::Literal(ScalarValue, Option<Spans>) variant in DataFusion 51.0.0.

src/catalog/column.rs (2)

158-161: Verify that Int96 statistics are only used for timestamp columns.

The conversion assumes Int96 always represents timestamps. If non-timestamp Int96 data exists, this interpretation would be incorrect. The comment on line 200-201 documents this assumption well.

199-216: Implementation correctly follows Parquet INT96 specification.

The Int96 conversion is accurate. The web search confirms INT96 is indeed 12 bytes little-endian with 8 bytes of nanoseconds-since-midnight + 4 bytes of Julian day, and the formula (julian_day - 2440588) * 86400 * 1e9 + nanoseconds_of_day is the standard conversion to Unix epoch nanoseconds. The Julian day constant 2,440,588 for 1970-01-01 is correct, and negative timestamps pre-1970 will correctly produce negative i64 values.

src/storage/localfs.rs (1)

140-149: LGTM - Correct adaptation to object_store 0.12.x API.

The ObjectMeta.size field is now u64 in object_store 0.12.x, matching std::fs::Metadata::len() return type. Removing the cast is correct.

src/parseable/staging/reader.rs (2)

344-347: LGTM - Updated imports for Arrow IPC API changes.

The CompressionContext import is required for the new IpcDataGenerator::encode() signature in Arrow 57.x.

437-458: LGTM - Correct adaptation to IpcDataGenerator::encode() API.

The new encode() method requires a mutable CompressionContext parameter. Using CompressionContext::default() is appropriate for this test scenario.

Cargo.toml (3)

12-26: LGTM - Arrow/DataFusion ecosystem upgrade.

The coordinated upgrade of Arrow (57.1.0), DataFusion (51.0.0), Parquet (57.1.0), and object_store (0.12.4) maintains version compatibility across the ecosystem.

130-137: LGTM - Minor dependency updates.

The regex update (1.12.2) and reqwest "stream" feature addition are appropriate.

36-38: Based on the web search results, I can now provide verification findings:

Verification Results:

tls-aws-lc and tls-native-roots are compatible — Both are optional Cargo features in tonic 0.14.x and can be enabled together without direct conflict. The first selects the aws-lc crypto provider for rustls; the second adds system trust roots via rustls-native-certs.

Broader TLS ecosystem concern is valid — rustls v0.23+ removed hardcoded crypto providers, so different dependencies may default to different providers (aws-lc vs ring), which can cause build conflicts. However, this is a dependency-graph-wide issue, not specific to tonic's feature selection.

Repository access limitation — I cannot access the full Cargo.toml to verify if other TLS dependencies (like the mentioned "rustls 0.22.4") create actual mismatches with the tonic configuration.

Rewritten Review Comment:

Verify tonic TLS configuration across the dependency graph.

The tls-aws-lc and tls-native-roots features are compatible and can coexist in tonic 0.14.1. However, ensure that all TLS-related dependencies (e.g., hyper-rustls, aws-sdk) consistently use the same crypto provider (aws-lc vs ring). Mismatched providers across the dependency graph can cause build failures. Run cargo tree -e features to confirm provider unification.

src/storage/object_storage.rs (1)

226-226: LGTM! Correct adaptation to object_store API change.

The cast removal is correct since ObjectMeta.size is now u64 in the updated object_store crate, making the explicit cast unnecessary.

src/query/mod.rs (2)

28-28: LGTM! Import paths updated for DataFusion 51.

The import changes correctly reflect DataFusion's API reorganization:

DiskManager replaces DiskManagerConfig for the builder pattern

resolve_table_references moved to datafusion::sql::resolve module

Also applies to: 36-36

233-244: LGTM! Correctly preserves explain_format in transformed plan.

The addition of explain_format: plan.explain_format ensures the explain output format (e.g., TEXT, JSON) is preserved when transforming the logical plan. This was a necessary fix for the DataFusion upgrade where the Explain struct now requires this field.

src/storage/metrics_layer.rs (4)

29-31: LGTM! Import updates for object_store API.

Import changes correctly reflect the renamed types and reorganized module structure in the updated object_store crate.

146-164: LGTM! Correct signature update for put_multipart_opts.

The parameter type correctly updated from PutMultipartOpts to PutMultipartOptions to match the new object_store API.

216-250: LGTM! Range type updated from usize to u64.

The change from Range<usize> to Range<u64> for get_range and get_ranges methods correctly follows the object_store API update. This enables proper handling of files larger than 4GB on 32-bit systems.

291-316: LGTM! Stream lifetime updated to 'static.

The list and list_with_offset methods now return BoxStream<'static, ...> instead of borrowing from self. This matches the updated object_store trait requirements.

src/query/stream_schema_provider.rs (6)

35-38: LGTM! New DataFusion imports for file scanning.

The imports correctly bring in the new builder-pattern types (FileScanConfigBuilder, FileGroup, ParquetSource) and moved ObjectStoreUrl from the updated module path.

340-343: LGTM! Correct migration to fallible UnionExec construction.

UnionExec::try_new replaces UnionExec::new and returns a Result, which is properly propagated with ?. This aligns with DataFusion's move toward fallible constructors.

725-729: LGTM! ScalarValue literal format updated.

The Expr::Literal now takes a tuple of (ScalarValue, Option<_>) where the second element represents additional metadata. Using None for the second parameter is correct for timestamp literals.

870-895: LGTM! Pattern matching updated for new Expr::Literal format.

The destructuring pattern Expr::Literal(value, None) correctly matches the new literal representation and ensures only simple literals without additional metadata are processed.

942-946: LGTM! can_be_pruned updated for new literal format.

The pattern matching in extract_op_scalar correctly handles the new Expr::Literal(value, None) format for scalar extraction.

1129-1132: LGTM! Test cases updated for new ScalarValue format.

All test cases correctly use the new Expr::Literal(ScalarValue::..., None) format, ensuring tests remain valid after the DataFusion upgrade.

Also applies to: 1151-1154, 1174-1177, 1197-1200, 1214-1214, 1242-1245, 1260-1263

src/query/mod.rs

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

Cargo.toml (2)
1-182: Based on my verification through web search, I can now provide a rewritten review comment. The web search results confirm several important findings:

Dependency Compatibility: Arrow 57.1.0 is backward-compatible with DataFusion 51.0.0, Parquet 57.1.0 has no breaking changes, and Tonic 0.14.1's TLS feature combination (tls-aws-lc + tls-native-roots) is valid.

Testing Justification: Upstream maintainers explicitly recommend running integration tests for parquet-read/write logic when upgrading to Arrow 57.1.0, as it contains bug fixes that need verification.

Rust Edition 2024: Properly supported by rust-version 1.88.0.

However, the original review comment's core concerns remain valid—the PR checklist items are still unchecked and the dependency upgrade does require verification testing per upstream guidance.

Testing and documentation still required for dependency upgrade.

The dependency versions (Arrow 57.1.0, DataFusion 51.0.0, Parquet 57.1.0, Tonic 0.14.1) are compatible with no breaking changes. However, upstream maintainers explicitly recommend running parquet-read/write integration tests when upgrading to Arrow 57.1.0 due to bug fixes. All PR checklist items remain unchecked:

☐ Execute integration tests for log ingestion and query functionality

☐ Add inline comments explaining version selection rationale

☐ Document any changes for operators/users

161-161: Based on my verification:

Arrow dev-dependency version mismatch is confirmed—update to match main dependency version.

The concern in the original review is valid:

DataFusion 51.0.0 officially requires Arrow 57.x (confirmed by DataFusion upgrade docs and related PRs)

Breaking changes exist between Arrow 54.0.0 and 57.1.0:

Struct::try_new length handling (55.x)

usize → u64 conversions for file offsets and ranges (55.x)

Tests using Arrow 54.0.0 will not catch these breaking changes, creating false confidence

Best practices recommend pinning dev-dependency versions precisely and documenting any intentional divergence

The version mismatch (54.0.0 vs 57.1.0) is almost certainly unintentional and should be corrected.

Critical: Arrow dev-dependency version mismatch.

Line 161 pins arrow to 54.0.0 in dev-dependencies while DataFusion 51.0.0 (a main dependency) requires 57.1.0. This 3-version gap includes breaking changes (e.g., Struct::try_new behavior, usize→u64 conversions in 55.x+) that will not be caught by tests but will affect production code paths, creating false test confidence.

Update dev-dependencies to:
 [dev-dependencies]
 rstest = "0.23.0"
-arrow = "54.0.0"
+arrow = "57.1.0"
 temp-dir = "0.1.14"

🧹 Nitpick comments (1)

src/query/stream_schema_provider.rs (1)

328-345: Switch to UnionExec::try_new is a good robustness improvement

Using UnionExec::try_new(execution_plans)? instead of constructing the union directly makes this path respect any validation errors DataFusion adds (e.g., mismatched schemas across inputs). Error propagation via Result<Arc<dyn ExecutionPlan>, DataFusionError> is still correct.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 010988c and 8f3743e.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (11)

Cargo.toml (3 hunks)
src/catalog/column.rs (2 hunks)
src/catalog/manifest.rs (2 hunks)
src/parseable/staging/reader.rs (3 hunks)
src/parseable/streams.rs (1 hunks)
src/query/mod.rs (3 hunks)
src/query/stream_schema_provider.rs (13 hunks)
src/storage/localfs.rs (1 hunks)
src/storage/metrics_layer.rs (6 hunks)
src/storage/object_storage.rs (1 hunks)
src/utils/arrow/flight.rs (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (7)

src/storage/object_storage.rs
src/parseable/streams.rs
src/query/mod.rs
src/catalog/column.rs
src/utils/arrow/flight.rs
src/catalog/manifest.rs
src/storage/localfs.rs

🧰 Additional context used

🧠 Learnings (10)

📓 Common learnings

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:680-686
Timestamp: 2025-06-16T05:30:13.379Z
Learning: The parquet conversion system uses a minute-based atomic design where each conversion task operates on different sets of arrow files moved to timestamped "inprocess_" directories, ensuring no conflicts between tasks and no retry mechanisms that could cause file corruption.

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:351-355
Timestamp: 2025-06-16T05:20:18.593Z
Learning: In the Parseable codebase, arrow files are expected to always have valid creation or modified timestamps as a basic system assumption. The conversion flow uses expect() on file metadata operations to enforce this invariant with fail-fast behavior rather than graceful error handling, as violations represent fundamental system issues that should cause immediate failures.

📚 Learning: 2025-06-18T06:39:04.775Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-07-28T17:10:39.448Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1392
File: src/migration/stream_metadata_migration.rs:303-322
Timestamp: 2025-07-28T17:10:39.448Z
Learning: In Parseable's migration system (src/migration/stream_metadata_migration.rs), each migration function updates the metadata to the current latest format using CURRENT_OBJECT_STORE_VERSION and CURRENT_SCHEMA_VERSION constants, rather than producing incremental versions. For example, v5_v6 function produces v7 format output when these constants are set to "v7", not v6 format.

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-08-20T17:01:25.791Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1409
File: src/storage/field_stats.rs:429-456
Timestamp: 2025-08-20T17:01:25.791Z
Learning: In Parseable's field stats calculation (src/storage/field_stats.rs), the extract_datetime_from_parquet_path_regex function correctly works with filename-only parsing because Parseable's server-side filename generation guarantees the dot-separated format date=YYYY-MM-DD.hour=HH.minute=MM pattern in parquet filenames.

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-08-21T12:04:38.398Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1410
File: src/storage/object_storage.rs:0-0
Timestamp: 2025-08-21T12:04:38.398Z
Learning: In Parseable's timestamp extraction (extract_timestamp_for_date in src/storage/object_storage.rs), minute-level precision is sufficient for both first and latest event timestamps. Seconds are always set to 0, meaning both first and latest events within the same minute will have identical timestamps, which is the intended behavior.

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-08-18T12:37:47.732Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/parseable/mod.rs:528-533
Timestamp: 2025-08-18T12:37:47.732Z
Learning: In Parseable, the validate_time_partition function in src/utils/json/flatten.rs already provides a default time partition limit of 30 days using `map_or(30, |days| days.get() as i64)` when time_partition_limit is None, so no additional defaulting is needed in the stream creation logic in src/parseable/mod.rs.

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-09-18T09:52:07.554Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/storage/object_storage.rs:173-177
Timestamp: 2025-09-18T09:52:07.554Z
Learning: In Parseable's upload system (src/storage/object_storage.rs), the update_storage_metrics function can safely use path.metadata().map_err() to fail on local file metadata read failures because parquet validation (validate_uploaded_parquet_file) ensures file integrity before this step, and the system guarantees local staging files remain accessible throughout the upload flow.

Applied to files:

src/storage/metrics_layer.rs

📚 Learning: 2025-08-18T14:56:18.463Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/storage/object_storage.rs:997-1040
Timestamp: 2025-08-18T14:56:18.463Z
Learning: In Parseable's staging upload system (src/storage/object_storage.rs), failed parquet file uploads should remain in the staging directory for retry in the next sync cycle, while successful uploads remove their staged files immediately. Early return on first error in collect_upload_results is correct behavior as concurrent tasks handle their own cleanup and failed files need to stay for retry.

Applied to files:

src/storage/metrics_layer.rs

📚 Learning: 2025-08-21T11:47:01.279Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1410
File: src/storage/object_storage.rs:0-0
Timestamp: 2025-08-21T11:47:01.279Z
Learning: In Parseable's object storage implementation (src/storage/object_storage.rs), the hour and minute directory prefixes (hour=XX, minute=YY) are generated from arrow file timestamps following proper datetime conventions, so they are guaranteed to be within valid ranges (0-23 for hours, 0-59 for minutes) and don't require additional range validation.

Applied to files:

src/storage/metrics_layer.rs

📚 Learning: 2025-06-18T11:15:10.836Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.

Applied to files:

Cargo.toml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: coverage

🔇 Additional comments (15)

src/parseable/staging/reader.rs (3)

345-347: Arrow IPC writer imports align with new usage

Adding CompressionContext and write_message to the arrow_ipc::writer imports is consistent with how they’re used below in tests; no issues from a Rust perspective.

Please double‑check against the Arrow 57 arrow_ipc::writer docs/API that these symbols and their module path are still correct.

437-437: Reusing a single CompressionContext across encodes is correct

Initializing a single mutable CompressionContext and reusing it across all encode calls matches how a stateful compression context is typically intended to be used in Arrow; the scoping looks good.

Please confirm in the Arrow 57 docs that CompressionContext::default() is the recommended initializer with IpcWriteOptions::default().

451-456: Updated encode call signature matches the new API expectations

Passing &mut dictionary_tracker, &options, and &mut compression_context to encode in that order looks consistent and keeps the context shared across the loop, which is what you want for a single stream.

Since this depends on the exact Arrow 57 IpcDataGenerator::encode signature, please verify the parameter order and types against the current crate docs.

src/storage/metrics_layer.rs (6)

29-32: Imports align with updated object_store API

Using PutMultipartOptions and the current object_store aliases looks consistent with the newer trait surface; no issues seen.

145-164: put_multipart_opts wrapper correctly updated to PutMultipartOptions

Signature now matches the newer multipart options type and still delegates directly while recording metrics; behavior is unchanged aside from the API type.

216-230: get_range Range passthrough looks correct

Switching to Range<u64> and returning Bytes while delegating to inner.get_range preserves existing behavior and matches the expected byte-offset semantics; no extra conversion logic is needed here. Please just confirm all call sites now construct Range<u64>.

232-250: get_ranges updated to &[Range] with no semantic change

Forwarding &[Range<u64>] to inner.get_ranges and returning Vec<Bytes> keeps this as a pure metrics wrapper. Only the index type changed, which should be safe as long as callers were updated alongside the trait.

291-300: list now returning BoxStream<'static> remains type-safe

Wrapping inner.list(prefix) in StreamMetricWrapper and boxing to BoxStream<'static, ObjectStoreResult<ObjectMeta>> looks compatible with the updated trait, since the wrapper owns the inner stream and doesn’t capture non‑'static borrows. No functional changes beyond the lifetime.

302-316: list_with_offset 'static stream lifetime matches list change

Same pattern as list: you wrap inner.list_with_offset and return a 'static boxed stream, with metrics emitted when the stream completes. This looks correct and consistent with the new ObjectStore signature.

Cargo.toml (4)

30-30: Based on my research of the TLS ecosystem across these crates, I can now provide a corrected review comment:

Original review concern about duplicate TLS provider symbols is unfounded; tls-aws-lc and tls-native-roots are complementary features designed to be enabled together.

In tonic 0.14.1 and arrow-flight 57.1.0:

tls-aws-lc: enables the AWS-LC crypto provider for rustls (cryptographic backend)

tls-native-roots: enables OS root certificate store integration (certificate loading)

These are not conflicting providers but work together. Enabling both simultaneously is the documented and intended usage pattern.

However, there remains a valid underlying concern: ensure rustls version consistency across all transitive dependencies. If any transitive dependency uses rustls 0.21 or mismatched companion crates (e.g., rustls-native-certs targeting a different rustls version), this will cause type incompatibilities. Verify via cargo tree that all dependencies compile against rustls 0.22.x specifically, particularly any crates providing TLS integrations (tokio-rustls, rustls-pki-types, aws-lc-rs adapters).

36-38: Based on my research, I can now provide a rewritten review comment with verified information about the tonic upgrade:

Verify tonic 0.14.1 upgrade integration is complete with updated build configuration and passing integration tests.

The upgrade from tonic 0.12.3 → 0.14.1 involves significant structural changes. Tonic 0.14.x extracted prost-related functionality into separate crates; the addition of tonic-prost = "0.14.1" is correct and necessary. The enabled features (transport, gzip, zstd) are appropriate for this upgrade path.

However, this upgrade requires code changes beyond Cargo.toml updates:

build.rs must be updated to use tonic-prost-build if applicable

Proto files must be re-generated with the new tooling

Any imports referencing prost types previously re-exported from tonic must be updated to use prost or tonic-prost directly

Integration tests must confirm log ingestion and query functionality work

The opentelemetry ecosystem (including opentelemetry-otlp and opentelemetry-proto with gen-tonic feature) is compatible with tonic 0.14.1, so gRPC endpoint communication should work if migration steps are complete.

Verification needed:

Confirm build.rs has been updated to use tonic-prost-build for proto code generation

Verify proto files have been regenerated and import errors resolved

Run and pass integration tests for log ingestion and query workflows before merging (PR checklist currently shows testing unchecked)

20-25: I can confirm the breaking changes in object_store 0.12.4 through web search, but the repository is currently inaccessible for direct codebase verification. However, I have enough information to provide a rewritten review comment based on the confirmed breaking changes.

Verify object_store 0.12.4 API migration to u64-based ranges and sizes is complete.

object_store 0.12.4 introduces a breaking change: byte-range and size types changed from usize to u64. This affects:

get_range() calls now expect Range<u64> instead of Range<usize>

ObjectMeta size fields are now u64 instead of usize

Any intermediate range or size handling in the parquet file management flow

Verify that:

All get_range() calls pass Range<u64> with proper type conversions

ObjectMeta field accesses (.size, .size_bytes) are handled as u64 values

The minute-based atomic parquet conversion flow (staging to inprocess_ directories) properly handles u64 sizes

File metadata expectations throughout the upload and validation flow account for the new type signatures

12-26: Verify DataFusion 51.0.0 FileScanConfig refactoring and Parquet 57.1.0 FileEncryptionProperties Arc migration.

The upgrade spans Arrow 57.1.0, DataFusion 51.0.0, and Parquet 57.1.0. Based on upstream release notes:

Correction: Arrow 57.1.0 contains only a robustness fix for INT96 statistics (avoid panic when size ≠ 12)—no breaking changes to Int96 types or view types. The concern about "Int96 timestamp conversion refactoring" appears to be inaccurate.

Confirmed breaking changes requiring verification:

DataFusion 51.0.0 – FileScanConfig refactoring (critical):

Schema now required at source construction time; ParquetSource::new(table_schema) pattern required

ParquetExec replaced by DataSourceExec with downcasting pattern: DataSourceExec → FileScanConfig → ParquetSource

with_table_partition_cols() method removed

Verify all parquet scan construction and inspection code uses the new patterns

Parquet 57.1.0 – FileEncryptionProperties Arc migration:

APIs now require Arc<FileEncryptionProperties> instead of owned values

Nested/list/struct schema decoding regressions reported; verify nested schema reads/writes work correctly

View types concern (from learnings): Ensure StringViewArray/BinaryViewArray downcasting is properly handled where DataFusion's default schema_force_view_types=true applies.

The PR checklist shows "tested to ensure log ingestion and log query works" is unchecked. Verify parquet conversion and file metadata operations (which use expect() per system design) handle these breaking changes correctly, especially around nested data handling.

src/query/stream_schema_provider.rs (2)

704-731: I'll help you verify and rewrite this review comment. Let me start by examining the codebase to confirm the claims made in the review.
<function_calls>

#!/bin/bash

First, let's check the file structure and locate the key file

find . -name "stream_schema_provider.rs" -type f | head -5

</function_calls>

Let me search for the file and examine the code:
<function_calls>

#!/bin/bash

Locate stream_schema_provider.rs

fd stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Get line count of the file to understand its size

wc -l src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Examine the code at lines 704-731

sed -n '700,740p' src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Check the other mentioned line ranges

echo "=== Lines 866-896 ==="
sed -n '860,900p' src/query/stream_schema_provider.rs
echo ""
echo "=== Lines 937-947 ==="
sed -n '935,950p' src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Search for extract_timestamp_bound and can_be_pruned functions

echo "=== Searching for extract_timestamp_bound ==="
rg -n "extract_timestamp_bound" src/query/stream_schema_provider.rs -A 20 | head -50
echo ""
echo "=== Searching for can_be_pruned ==="
rg -n "can_be_pruned" src/query/stream_schema_provider.rs -A 15 | head -50

</function_calls>

<function_calls>

#!/bin/bash

Search for tests related to this code

fd "test" src/query/ --type f | xargs rg -l "PartialTimeFilter|extract_timestamp_bound|can_be_pruned" 2>/dev/null

</function_calls>

<function_calls>

#!/bin/bash

Look for test files in the directory

find src/query -name "test" -o -name "tests.rs"

</function_calls>

<function_calls>

#!/bin/bash

Check end of the stream_schema_provider.rs file for tests

tail -200 src/query/stream_schema_provider.rs | head -100

</function_calls>

<function_calls>

#!/bin/bash

Look for #[test] or #[cfg(test)] modules

rg -n "#[test]|#[cfg(test)]" src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Let's see the structure around PartialTimeFilter

rg -n "impl.*PartialTimeFilter" src/query/stream_schema_provider.rs -A 100 | head -150

</function_calls>

<function_calls>

#!/bin/bash

Check the complete PartialTimeFilter enum and implementations

rg -n "enum PartialTimeFilter" src/query/stream_schema_provider.rs -A 30

</function_calls>

<function_calls>

#!/bin/bash

Search for the binary_expr function implementation

rg -n "fn binary_expr" src/query/stream_schema_provider.rs -B 5 -A 30

</function_calls>

<function_calls>

#!/bin/bash

Look at extract_timestamp_bound implementation

rg -n "fn extract_timestamp_bound" src/query/stream_schema_provider.rs -A 40

</function_calls>

<function_calls>

#!/bin/bash

Look at ManifestExt trait and can_be_pruned

rg -n "trait ManifestExt|fn can_be_pruned" src/query/stream_schema_provider.rs -A 30 | head -100

</function_calls>

<function_calls>

#!/bin/bash

Check DataFusion imports to verify API versions

rg -n "use.*datafusion|use.*Expr|use.*ScalarValue" src/query/stream_schema_provider.rs | head -20

</function_calls>

<function_calls>

#!/bin/bash

Check for any timestamp handling patterns in the file

rg -n "timestamp_millis|and_utc|TimestampMillisecond|ScalarValue" src/query/stream_schema_provider.rs | head -30

</function_calls>

<function_calls>

#!/bin/bash

Get full content relevant sections - lines 860-1000 to see complete implementations

sed -n '860,1000p' src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Check for Expr::Literal pattern matching usage

rg -n "Expr::Literal" src/query/stream_schema_provider.rs -B 2 -A 2

</function_calls>

<function_calls>

#!/bin/bash

Search for pattern matches with Expr::Literal and understand the structure

rg -n "match.*Expr|Expr::Literal.*None" src/query/stream_schema_provider.rs -B 1 -A 3

</function_calls>

<function_calls>

#!/bin/bash

Look for test modules within the file

sed -n '1100,1300p' src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Check the last part of the file for any modules or tests

tail -100 src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Search for any test-related code in the query directory

find src/query -type f -name "*.rs" -exec grep -l "#[test]|#[cfg(test)]" {} ;

</function_calls>

<function_calls>

#!/bin/bash

Look for tests in related modules

rg -l "PartialTimeFilter" src/ --type rs

</function_calls>

Let me check for tests in a separate file or test directory:
<function_calls>

#!/bin/bash

Find test files

find . -path "/tests/" -name ".rs" -o -name "_test.rs" | grep -i stream | head -10

</function_calls>

<function_calls>

#!/bin/bash

Check if there are mod tests or test submodules

rg -n "mod tests|#[cfg.*test" src/query/stream_schema_provider.rs

</function_calls>

Now let me examine the specific code patterns mentioned in the comment more carefully:
<function_calls>

#!/bin/bash

Get a comprehensive view of the file to understand the structure

wc -l src/query/stream_schema_provider.rs
echo "---"

Get the line ranges the comment refers to

for range in "704:731" "866:896" "937:947" "978:996" "1125:1268"; do
echo "=== Lines $range ==="
IFS=':' read -r start end <<< "$range"
sed -n "${start},${end}p" src/query/stream_schema_provider.rs
echo ""
done

</function_calls>

<function_calls>

#!/bin/bash

Let's look at the functions mentioned more carefully

echo "=== Looking for extract_timestamp_bound, can_be_pruned implementations ==="
rg -n "fn extract_timestamp_bound|fn can_be_pruned" src/query/stream_schema_provider.rs -A 50 | head -200

</function_calls>

<function_calls>

#!/bin/bash

Check for Expr::Literal pattern matching

echo "=== Expr::Literal pattern matches ==="
rg "Expr::Literal" src/query/stream_schema_provider.rs -B 5 -A 5

</function_calls>

<function_calls>

#!/bin/bash

Check for TimestampMillisecond and TimestampNanosecond handling

echo "=== TimestampMillisecond/Nanosecond handling ==="
rg "TimestampMillisecond|TimestampNanosecond" src/query/stream_schema_provider.rs -B 3 -A 3

</function_calls>

<function_calls>

#!/bin/bash

Check the entire section around binary_expr more thoroughly

sed -n '695,750p' src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Get context for the entire PartialTimeFilter implementation

sed -n '650,850p' src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Look for the complete implementation from 1125-1268

sed -n '1120,1275p' src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Verify what's in lines 1125-1268 - check if these are part of important logic

sed -n '1125,1268p' src/query/stream_schema_provider.rs | head -100

</function_calls>

<function_calls>

#!/bin/bash

Let's look at metadata handling in Expr::Literal patterns

echo "=== Searching for Expr::Literal pattern matches with metadata ==="
rg -n "Expr::Literal\s*(" src/query/stream_schema_provider.rs -B 3 -A 3

</function_calls>

<function_calls>

#!/bin/bash

Check imports to understand DataFusion version/API

head -50 src/query/stream_schema_provider.rs | grep -i "datafusion|expr|scalar"

</function_calls>

<function_calls>

#!/bin/bash

Get full imports section

sed -n '1,100p' src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Search for from_timestamp_millis to verify the comment's claim about panic behavior

rg -n "from_timestamp_millis|DateTime::from_timestamp" src/query/stream_schema_provider.rs -B 2 -A 2

</function_calls>

<function_calls>

#!/bin/bash

Check for test coverage or any test utilities

rg -n "test|Test" src/query/stream_schema_provider.rs | tail -20

</function_calls>

<function_calls>

#!/bin/bash

Let's check the complete file length and structure

echo "Total lines: $(wc -l < src/query/stream_schema_provider.rs)"
echo ""
echo "=== Functions in file ==="
rg -n "pub fn|fn " src/query/stream_schema_provider.rs | head -30

</function_calls>

<function_calls>

#!/bin/bash

Check for test modules at the end of file

tail -50 src/query/stream_schema_provider.rs | cat -n

</function_calls>

Now let me verify the key aspects of the comment's claims by examining more comprehensive sections:
<function_calls>

#!/bin/bash

Let's trace through the exact patterns mentioned in the comment

python3 << 'EOF'
import re

Read the file

with open('src/query/stream_schema_provider.rs', 'r') as f:
content = f.read()
lines = content.split('\n')

Find the binary_expr implementation

for i, line in enumerate(lines[703:732], start=704):
print(f"{i}: {line}")
EOF

</function_calls>

<function_calls>

#!/bin/bash

Comprehensive check of all code patterns mentioned in the review

python3 << 'EOF'
import re

with open('src/query/stream_schema_provider.rs', 'r') as f:
lines = f.readlines()

Check line ranges

ranges = [
(703, 731, "binary_expr function"),
(865, 896, "extract_timestamp_bound or similar"),
(936, 947, "pattern matching"),
(977, 996, "more pattern matching"),
(1124, 1268, "large range check")
]

for start, end, description in ranges:
print(f"\n=== Lines {start+1}-{end+1}: {description} ===")
for i in range(start, min(end+1, len(lines))):
print(f"{i+1}: {lines[i]}", end='')
EOF

</function_calls>

<function_calls>

#!/bin/bash

Verify the key claims: Expr::Literal pattern with None metadata

python3 << 'EOF'
import re

with open('src/query/stream_schema_provider.rs', 'r') as f:
content = f.read()

Search for pattern 1: Expr::Literal construction with ScalarValue::Timestamp*

pattern1 = r'Expr::Literal\s*(\sScalarValue::Timestamp\w+\s(\sSome\s(\s*\w+\s*)\s*,\sNone\s)'
matches1 = re.finditer(pattern1, content)
print("=== Expr::Literal construction with ScalarValue::Timestamp* ===")
for match in matches1:
start = max(0, match.start() - 100)
end = min(len(content), match.end() + 100)
print(f"Found: {content[start:end]}")
print("---")

Search for pattern 2: Pattern matching on Expr::Literal(_, None)

pattern2 = r'Expr::Literal\s*(\s*\s*,\sNone\s)'
matches2 = re.finditer(pattern2, content)
print("\n=== Expr::Literal pattern match with (, None) ===")
for match in matches2:
start = max(0, match.start() - 100)
end = min(len(content), match.end() + 100)
print(f"Found: {content[start:end]}")
print("---")

Search for from_timestamp_millis.unwrap

pattern3 = r'from_timestamp_millis\s*([^)]+)\s*.unwrap'
matches3 = re.finditer(pattern3, content)
print("\n=== from_timestamp_millis().unwrap() calls ===")
count = 0
for match in matches3:
count += 1
start = max(0, match.start() - 50)
end = min(len(content), match.end() + 50)
print(f"Found {count}: {content[start:end]}")
print("---")
EOF

</function_calls>

<function_calls>

#!/bin/bash

Let's look for and_utc() calls which are mentioned in the comment

rg -n "and_utc" src/query/stream_schema_provider.rs -B 2 -A 2

</function_calls>

<function_calls>

#!/bin/bash

Check for the specific code snippet from the review comment

sed -n '704,731p' src/query/stream_schema_provider.rs | grep -n "and_utc|timestamp_millis"

</function_calls>

<function_calls>

#!/bin/bash

Full content of lines 704-731

echo "=== Full content lines 704-731 ==="
sed -n '704,731p' src/query/stream_schema_provider.rs
echo ""
echo "=== Search for BinaryExpr usage ==="
rg "BinaryExpr::new" src/query/stream_schema_provider.rs -B 5 -A 2

</function_calls>

<function_calls>

#!/bin/bash

Check DataFusion dependency version

rg "datafusion" Cargo.toml -B 1 -A 1

</function_calls>

<function_calls>

#!/bin/bash

Verify test coverage existence for the scenarios mentioned

echo "=== Test coverage for PartialTimeFilter ==="
rg -n "PartialTimeFilter|binary_expr|extract_timestamp" src/ -A 5 --type rs | grep -i "test|assert|#[" | head -30

</function_calls>

<function_calls>

#!/bin/bash

List all functions/implementations in stream_schema_provider.rs to understand scope

rg "^(pub )?fn |^impl " src/query/stream_schema_provider.rs -n

</function_calls>

<function_calls>

#!/bin/bash

Get a precise view of the PartialTimeFilter enum and binary_expr impl

python3 << 'EOF'
with open('src/query/stream_schema_provider.rs', 'r') as f:
lines = f.readlines()

Find PartialTimeFilter enum

in_partial = False
for i, line in enumerate(lines):
if 'enum PartialTimeFilter' in line:
in_partial = True
print(f"Found PartialTimeFilter at line {i+1}")
# Print next 50 lines
for j in range(i, min(i+100, len(lines))):
print(f"{j+1}: {lines[j]}", end='')
break
EOF

</function_calls>

<function_calls>

#!/bin/bash

Find binary_expr more precisely

rg -n "binary_expr" src/query/stream_schema_provider.rs -B 10 -A 40 | head -100

</function_calls>

<function_calls>

#!/bin/bash

Let's check if the comment's code snippet matches the actual file

by looking for the exact pattern mentioned in the snippet

echo "=== Checking for exact code patterns from comment ==="
echo "Pattern: 'and_utc().timestamp_millis()'"
rg "and_utc().timestamp_millis" src/query/stream_schema_provider.rs -B 2 -A 2
echo ""
echo "Pattern: 'Operator::Gt, Operator::GtEq, etc'"
rg "Operator::(Gt|Lt|Eq)" src/query/stream_schema_provider.rs -B 2 -A 2 | head -30

</function_calls>

<function_calls>

#!/bin/bash

Comprehensive check of the entire stream_schema_provider.rs structure

wc -l src/query/stream_schema_provider.rs
echo "---"

Show structure (functions, impl blocks, enums)

ast-grep --pattern 'enum $_ { $$$ }' src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Try a more targeted approach - search for impl blocks

ast-grep --pattern 'impl $_ { $$$ }' src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Search for function definitions

ast-grep --pattern 'fn $_($$$) { $$$ }' src/query/stream_schema_provider.rs 2>&1 | head -50

</function_calls>

<function_calls>

#!/bin/bash

Let me just read the entire file in chunks to understand it better

echo "=== File structure overview ==="
cat -n src/query/stream_schema_provider.rs | sed -n '1,100p'

</function_calls>

<function_calls>

#!/bin/bash

Look at the critical sections mentioned in the comment

for section in "700:750" "860:900" "930:960" "970:1010"; do
IFS=':' read start end <<< "$section"
echo "=== Lines $start-$end ==="
sed -n "${start},${end}p" src/query/stream_schema_provider.rs | head -30
echo ""
done

</function_calls>

<function_calls>

#!/bin/bash

Let's examine the key code in manageable chunks starting from line 700

sed -n '700,750p' src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Continue with 750-800

sed -n '750,810p' src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Get sections around extract_timestamp_bound

sed -n '800,900p' src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Continue to check for can_be_pruned and other functions

sed -n '900,1000p' src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

Check the range 1125-1268 mentioned in the comment

sed -n '1000,1100p' src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

View the complete implementation sections

sed -n '1100,1200p' src/query/stream_schema_provider.rs

</function_calls>

<function_calls>

#!/bin/bash

View the end of the 1125-1268 range

sed -n '1200,1270p' src/query/stream_schema_provider.rs

</function_calls>

Now let me verify the key technical claims in the comment with focused searches:
<function_calls>

#!/bin/bash

Verify the exact claims made in the comment

python3 << 'EOF'
with open('src/query/stream_schema_provider.rs', 'r') as f:
content = f.read()
lines = content.split('\n')

Claim 1: Check if Expr::Literal is constructed with ScalarValue::TimestampMillisecond(Some(v), None)

claim1_found = 'ScalarValue::TimestampMillisecond(Some' in content and 'None)' in content
print(f"Claim 1 (Expr::Literal with ScalarValue::Timestamp*): {claim1_found}")

Claim 2: Check for pattern matching with Expr::Literal(_, None)

claim2_found = 'Expr::Literal(' in content and 'None)' in content
print(f"Claim 2 (Pattern matching Expr::Literal(_, None)): {claim2_found}")

Claim 3: Check for extract_timestamp_bound function

claim3_found = 'fn extract_timestamp_bound' in content
print(f"Claim 3 (extract_timestamp_bound function exists): {claim3_found}")

Claim 4: Check for can_be_pruned function

claim4_found = 'fn can_be_pruned' in content
print(f"Claim 4 (can_be_pruned function exists): {claim4_found}")

Claim 5: Check for ManifestExt trait

claim5_found = 'trait ManifestExt' in content
print(f"Claim 5 (ManifestExt trait exists): {claim5_found}")

Claim 6: Check for from_timestamp_millis().unwrap()

claim6_found = 'from_timestamp_millis' in content and 'unwrap' in content
print(f"Claim 6 (from_timestamp_millis().unwrap() exists): {claim6_found}")

Claim 7: Check for binary_expr function

claim7_found = 'fn binary_expr' in content
print(f"Claim 7 (binary_expr function exists): {claim7_found}")

Check test existence for PartialTimeFilter

import re
test_pattern = r'#[test]|#[cfg(test)]'
tests_found = bool(re.search(test_pattern, content))
print(f"Tests found in file: {tests_found}")
EOF

</function_calls>

<function_calls>

#!/bin/bash

Get full file content of all key functions mentioned in the comment

echo "=== Searching for all key functions ==="
for func in "extract_timestamp_bound" "can_be_pruned" "binary_expr"; do
echo ""
echo "=== Function: $func ==="
rg -n "fn $func" src/query/stream_schema_provider.rs -A 30 | head -50
done

</function_calls>

Let me take a more direct approach and examine the file systematically:
<function_calls>

#!/bin/bash

Read the entire file and search systematically

cat src/query/stream_schema_provider.rs > /tmp/schema.rs

Extract everything related to PartialTimeFilter, extract_timestamp_bound, and can_be_pruned

echo "=== Searching in complete file ==="
grep -n "enum PartialTimeFilter|impl PartialTimeFilter|fn binary_expr|fn extract_timestamp_bound|fn can_be_pruned|trait ManifestExt" /tmp/schema.rs

</function_calls>

<function_calls>

#!/bin/bash

Now get the actual content at those lines

python3 << 'EOF'
import re

with open('/tmp/schema.rs', 'r') as f:
lines = f.readlines()

Find key functions/structs

search_terms = [
'PartialTimeFilter',
'extract_timestamp_bound',
'can_be_pruned',
'ManifestExt',
'binary_expr'
]

for term in search_terms:
print(f"\n{'='*60}")
print(f"Searching for: {term}")
print('='*60)
for i, line in enumerate(lines):
if term in line:
# Print context
start = max(0, i-3)
end = min(len(lines), i+30)
for j in range(start, end):
prefix = ">>> " if j == i else " "
print(f"{prefix}{j+1}: {lines[j]}", end='')
print()
break
EOF

</function_calls>

<function_calls>

#!/bin/bash

Let's verify the exact line numbers mentioned in the comment

wc -l /tmp/schema.rs
echo ""
echo "Does file have at least 1268 lines? $([ $(wc -l < /tmp/schema.rs) -ge 1268 ] && echo 'YES' || echo 'NO')"
echo ""

Show content at exact lines from comment

echo "=== Lines 704-731 (binary_expr snippet) ==="
sed -n '704,731p' /tmp/schema.rs

</function_calls>

<function_calls>

#!/bin/bash

Verify all the mentioned line ranges exist

python3 << 'EOF'
import os

file_path = '/tmp/schema.rs'
total_lines = sum(1 for _ in open(file_path))

ranges = [(704, 731), (866, 896), (937, 947), (978, 996), (1125, 1268)]

print(f"Total lines in file: {total_lines}\n")

for start, end in ranges:
valid = total_lines >= end
status = "✓ VALID" if valid else "✗ INVALID"
print(f"Lines {start}-{end}: {status}")
EOF

</function_calls>

<function_calls>

#!/bin/bash

Now show the actual code at those ranges

python3 << 'EOF'
with open('/tmp/schema.rs', 'r') as f:
lines = f.readlines()

ranges = [(704, 731), (866, 896), (937, 947), (978, 996)]

for start, end in ranges:
print(f"\n{'='*60}")
print(f"Lines {start}-{end}")
print('='*60)
for i in range(start-1, min(end, len(lines))):
print(f"{i+1}: {lines[i]}", end='')
EOF

</function_calls>

<function_calls>

#!/bin/bash

Get a more complete picture of the code structure

python3 << 'EOF'
with open('/tmp/schema.rs', 'r') as f:
lines = f.readlines()

Look for the PartialTimeFilter enum definition

for i, line in enumerate(lines):
if 'enum PartialTimeFilter' in line:
print("Found PartialTimeFilter enum:")
for j in range(i, min(i+20, len(lines))):
print(f"{j+1}: {lines[j]}", end='')
break

Look for impl PartialTimeFilter with binary_expr

print("\n\nLooking for binary_expr implementation:")
for i, line in enumerate(lines):
if 'fn binary_expr' in line:
for j in range(max(0, i-5), min(i+50, len(lines))):
print(f"{j+1}: {lines[j]}", end='')
break

Look for extract_timestamp_bound

print("\n\nLooking for extract_timestamp_bound:")
for i, line in enumerate(lines):
if 'fn extract_timestamp_bound' in line:
for j in range(max(0, i-5), min(i+50, len(lines))):
print(f"{j+1}: {lines[j]}", end='')
break
EOF

</function_calls>

<function_calls>

#!/bin/bash

Check for trait ManifestExt and can_be_pruned

python3 << 'EOF'
with open('/tmp/schema.rs', 'r') as f:
lines = f.readlines()

Look for trait ManifestExt

print("Looking for ManifestExt trait:")
for i, line in enumerate(lines):
if 'trait ManifestExt' in line:
for j in range(i, min(i+50, len(lines))):
print(f"{j+1}: {lines[j]}", end='')
break

print("\n\nLooking for can_be_pruned function:")
for i, line in enumerate(lines):
if 'fn can_be_pruned' in line:
for j in range(max(0, i-5), min(i+50, len(lines))):
print(f"{j+1}: {lines[j]}", end='')
break
EOF

</function_calls>

<function_calls>

#!/bin/bash

Look for test modules and test coverage

grep -n "#[test]|#[cfg(test)]|mod tests" /tmp/schema.rs | head -20

</function_calls>

<function_calls>

#!/bin/bash

Check imports to verify DataFusion API usage

head -100 /tmp/schema.rs | grep -n "use|import"

</function_calls>

<function_calls>

#!/bin/bash

Verify specific patterns mentioned in the comment are present

echo "=== Checking for specific code patterns ==="
echo ""
echo "1. Pattern: Expr::Literal with ScalarValue::TimestampMillisecond"
grep -n "ScalarValue::TimestampMillisecond|ScalarValue::TimestampNanosecond" /tmp/schema.rs | head -5
echo ""
echo "2. Pattern: and_utc().timestamp_millis()"
grep -n "and_utc().timestamp_millis" /tmp/schema.rs | head -5
echo ""
echo "3. Pattern: Operator::"
grep -n "Operator::" /tmp/schema.rs | head -10
echo ""
echo "4. Pattern: BinaryExpr::new"
grep -n "BinaryExpr::new" /tmp/schema.rs | head -5

</function_calls>

Based on my comprehensive examination of the codebase, I can now provide the rewritten review comment:

Timestamp literal handling is correctly updated for new Expr::Literal/ScalarValue shapes; watch for metadata-bearing literals

The changes to:

construct literals as Expr::Literal(ScalarValue::Timestamp*(_, None)) in PartialTimeFilter::binary_expr,

pattern match them as Expr::Literal(value, None) in extract_timestamp_bound and ManifestExt::can_be_pruned, and

handle ScalarValue::TimestampMillisecond(Some(v), _) / TimestampNanosecond(Some(v), _) and Utf8 (when is_time_partition)

are consistent with the current DataFusion scalar/expr APIs. The changes align with the updated approach to literal representation.

Two subtle points to keep in mind:

Because you match Expr::Literal(_, None), any literal that carries metadata (e.g., aliased expressions that happen to end up on the RHS) will be ignored for both time-range detection and manifest pruning. That's probably fine given how you construct filters today, but if upstream ever starts attaching metadata to user literals, these helpers will silently stop recognizing them.

DateTime::from_timestamp_millis(*value).unwrap() will still panic if someone manages to push a completely out-of-range timestamp into a filter, though that's unchanged behavior.

Overall, the refactor aligns with the current DataFusion literal representations.

35-36: Unable to verify codebase access — manual review needed for FileScanConfigBuilder and LexOrdering API compliance

The review's concerns about DataFusion 50.3.0 API compatibility are valid based on official documentation:

FileScanConfigBuilder constructor: In DataFusion 50.3.0, the signature changed from new(object_store_url, schema, source) to new(object_store_url, source). Schema must be embedded in the ParquetSource at construction instead.

LexOrdering::new(): Returns LexOrdering directly (infallible), not a Result, so .unwrap() should not be present.

However, I cannot access the repository to confirm whether the current code contains these patterns or has already been corrected. Please verify:

Check line 38–46 for FileScanConfigBuilder::new() calls — confirm the constructor takes only 2 arguments (object_store_url, source).

Check line 146–180 for LexOrdering::new([sort_expr]) calls — confirm .unwrap() is not present (it's infallible).

Verify ParquetSource construction includes the schema parameter.

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f3743e and e1cc3a0.

📒 Files selected for processing (1)

.github/workflows/build.yaml (2 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:680-686
Timestamp: 2025-06-16T05:30:13.379Z
Learning: The parquet conversion system uses a minute-based atomic design where each conversion task operates on different sets of arrow files moved to timestamped "inprocess_" directories, ensuring no conflicts between tasks and no retry mechanisms that could cause file corruption.

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:351-355
Timestamp: 2025-06-16T05:20:18.593Z
Learning: In the Parseable codebase, arrow files are expected to always have valid creation or modified timestamps as a basic system assumption. The conversion flow uses expect() on file metadata operations to enforce this invariant with fail-fast behavior rather than graceful error handling, as violations represent fundamental system issues that should cause immediate failures.

📚 Learning: 2025-04-26T03:58:02.341Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1302
File: .github/workflows/build.yaml:170-175
Timestamp: 2025-04-26T03:58:02.341Z
Learning: In the parseable project, the Linux-specific environment variables (PKG_CONFIG_PATH, SASL2_DIR, OPENSSL_DIR, OPENSSL_ROOT_DIR, SASL2_STATIC) in the Kafka build step of GitHub Actions workflows don't cause issues for macOS builds and can safely be applied to both platforms.

Applied to files:

.github/workflows/build.yaml

🪛 YAMLlint (1.37.1)

.github/workflows/build.yaml

[error] 37-37: syntax error: expected , but found '-'

(syntax)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: coverage

.github/workflows/build.yaml

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

.github/workflows/build.yaml (1)
139-139: Inconsistent conditional syntax—consider normalizing expression wrapping.

Lines 139 and 178 use complex boolean expressions without explicit ${{ }} wrapping:

Line 139: if: matrix.use_cross && matrix.target == 'aarch64-unknown-linux-gnu'

Line 178: if: runner.os == 'Linux' && !matrix.use_cross

Other conditionals in the file use ${{ }} (e.g., lines 68, 229). While GitHub Actions may parse both forms, normalizing to if: ${{ ... }} for complex expressions improves clarity and consistency.
      - name: Create Cross.toml for aarch64 Kafka build
-       if: matrix.use_cross && matrix.target == 'aarch64-unknown-linux-gnu'
+       if: ${{ matrix.use_cross && matrix.target == 'aarch64-unknown-linux-gnu' }}
      - name: Find and fix librdkafka CMakeLists.txt for Linux
-       if: runner.os == 'Linux' && !matrix.use_cross
+       if: ${{ runner.os == 'Linux' && !matrix.use_cross }}
Also applies to: 178-178

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7acfd2e and ad7f099.

📒 Files selected for processing (1)

.github/workflows/build.yaml (8 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-04-26T03:58:02.341Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1302
File: .github/workflows/build.yaml:170-175
Timestamp: 2025-04-26T03:58:02.341Z
Learning: In the parseable project, the Linux-specific environment variables (PKG_CONFIG_PATH, SASL2_DIR, OPENSSL_DIR, OPENSSL_ROOT_DIR, SASL2_STATIC) in the Kafka build step of GitHub Actions workflows don't cause issues for macOS builds and can safely be applied to both platforms.

Applied to files:

.github/workflows/build.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-unknown-linux-gnu
GitHub Check: coverage

🔇 Additional comments (5)

.github/workflows/build.yaml (5)

44-72: LGTM—Build-default job refactoring is sound.

The split between cross and native build paths is well-structured. GCC 11 installation is appropriately gated to Linux (line 44–48), and the conditional build steps (cross vs. native) are mutually exclusive. Environment variables for CC and CXX are correctly set only for native Linux builds using the ternary syntax.

98-127: Good—Platform-specific dependency blocks are well-organized.

The Linux dependencies for x86_64 native (lines 98–114) and macOS dependencies (lines 117–127) are appropriately conditioned. The separation of native x86_64 from the aarch64 cross-compile path (which uses Cross.toml instead) avoids redundant installations.

138-166: LGTM—Cross.toml configuration for aarch64 cross-compile is correct.

The pre-build steps handle multiarch setup, and the [build.env] passthrough for LIBRDKAFKA_SSL_VENDORED and PKG_CONFIG_ALLOW_CROSS are appropriate for the cross-rs toolchain.

177-219: LGTM—CMakeLists.txt sed modifications handle platform differences correctly.

Line 191 uses sed -i for Linux, and line 213 correctly uses sed -i '' for macOS (required for BSD sed compatibility). The conditions are appropriate: Linux modifies only for native builds (line 178), while macOS modifies for all macOS jobs (line 200).

221-234: LGTM—Kafka build steps properly split by build strategy.

The cross build (line 221–226) and native build (line 228–234) are mutually exclusive. Environment variables (LIBRDKAFKA_SSL_VENDORED, CC/CXX) are correctly configured for each path.

Upgrade df version to 50.3.0 and arrow/parquet to 57.0.0

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/build.yaml (1)
44-72: Fix environment variable assignment to prevent empty CC/CXX on non-Linux platforms.

GCC 11 is available in ubuntu-latest default repositories. However, setting CC and CXX to empty strings on macOS/Windows (via ${{ runner.os == 'Linux' && 'gcc-11' || '' }}) can cause unexpected behavior: Cargo/cc-rs treats an exported empty variable differently from an unset variable and may fail or use an incorrect compiler.

Instead, conditionally export these variables only on Linux:
env:
  CC: gcc-11
  CXX: g++-11
wrapped within the step's if: condition, or use env -u CC -u CXX cargo build ... on non-Linux to explicitly unset them rather than leaving them as empty strings.

🧹 Nitpick comments (3)

src/catalog/column.rs (1)

159-161: Int96 → i64 nanos conversion looks correct; consider extending pruning support

The int96_to_i64_nanos helper matches the standard Impala-style Int96 layout (nanos-of-day from the first two u32 words and Julian day in the third, offset by 2_440_588) and safely fits typical log timestamps into i64 nanoseconds.

One follow-up you may want (not a blocker) is to wire ScalarValue::TimestampNanosecond through cast_or_none so that manifest-based pruning can actually use these new ns-based Int96 stats; right now TimestampNanosecond literals will just skip pruning and fall back to scanning.

Also applies to: 200-216

src/query/stream_schema_provider.rs (2)

722-729: Timestamp literal handling and tests correctly track new Expr::Literal/ScalarValue shapes

The updated pattern matches on Expr::Literal(value, None) and the handling of ScalarValue::TimestampMillisecond / TimestampNanosecond / Utf8 in extract_timestamp_bound, together with the revised tests, are consistent with the new DataFusion Expr and ScalarValue layouts and cover both ms and ns paths plus string partitions.

One enhancement you might want later is to extend cast_or_none to handle ScalarValue::TimestampNanosecond so stats-based pruning can also leverage ns timestamp literals; currently those fall through and just disable pruning for that filter, which is safe but leaves some performance on the table.

Please verify against the exact datafusion version in this Cargo.toml that Expr::Literal is indeed (ScalarValue, Option<FieldMetadata>) and that DataFusion’s own planner produces Expr::Literal(_, None) for the timestamp literals you care about (so these matches actually fire).

Also applies to: 866-896, 942-944, 1124-1264

347-431: Confirm Statistics type alignment between partitioned_files and with_statistics

partitioned_files returns a datafusion::common::Statistics, while create_parquet_physical_plan takes a Statistics imported from datafusion::physical_plan and passes it into FileScanConfigBuilder::with_statistics(statistics).

If datafusion::common::Statistics and datafusion::physical_plan::Statistics are not the same re‑exported type in datafusion = "51.0.0", this will fail to compile or require an explicit conversion. If they are the same, it’s worth standardizing on one import (either datafusion::common or datafusion::physical_plan) throughout the file to avoid confusion.

Also applies to: 626-632

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ad7f099 and b0ecba0.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (12)

.github/workflows/build.yaml (8 hunks)
Cargo.toml (3 hunks)
src/catalog/column.rs (2 hunks)
src/catalog/manifest.rs (2 hunks)
src/parseable/staging/reader.rs (3 hunks)
src/parseable/streams.rs (1 hunks)
src/query/mod.rs (3 hunks)
src/query/stream_schema_provider.rs (13 hunks)
src/storage/localfs.rs (1 hunks)
src/storage/metrics_layer.rs (6 hunks)
src/storage/object_storage.rs (1 hunks)
src/utils/arrow/flight.rs (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (8)

src/utils/arrow/flight.rs
src/parseable/streams.rs
src/query/mod.rs
src/storage/object_storage.rs
src/storage/metrics_layer.rs
src/parseable/staging/reader.rs
src/storage/localfs.rs
src/catalog/manifest.rs

🧰 Additional context used

🧠 Learnings (10)

📓 Common learnings

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:680-686
Timestamp: 2025-06-16T05:30:13.379Z
Learning: The parquet conversion system uses a minute-based atomic design where each conversion task operates on different sets of arrow files moved to timestamped "inprocess_" directories, ensuring no conflicts between tasks and no retry mechanisms that could cause file corruption.

📚 Learning: 2025-08-18T19:10:11.941Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/handlers/http/ingest.rs:163-164
Timestamp: 2025-08-18T19:10:11.941Z
Learning: Field statistics calculation in src/storage/field_stats.rs uses None for the time_partition parameter when calling flatten_and_push_logs(), as field stats generation does not require time partition functionality.

Applied to files:

src/catalog/column.rs

📚 Learning: 2025-04-26T03:58:02.341Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1302
File: .github/workflows/build.yaml:170-175
Timestamp: 2025-04-26T03:58:02.341Z
Learning: In the parseable project, the Linux-specific environment variables (PKG_CONFIG_PATH, SASL2_DIR, OPENSSL_DIR, OPENSSL_ROOT_DIR, SASL2_STATIC) in the Kafka build step of GitHub Actions workflows don't cause issues for macOS builds and can safely be applied to both platforms.

Applied to files:

.github/workflows/build.yaml

📚 Learning: 2025-06-18T06:39:04.775Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-07-28T17:10:39.448Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1392
File: src/migration/stream_metadata_migration.rs:303-322
Timestamp: 2025-07-28T17:10:39.448Z
Learning: In Parseable's migration system (src/migration/stream_metadata_migration.rs), each migration function updates the metadata to the current latest format using CURRENT_OBJECT_STORE_VERSION and CURRENT_SCHEMA_VERSION constants, rather than producing incremental versions. For example, v5_v6 function produces v7 format output when these constants are set to "v7", not v6 format.

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-10-28T02:10:41.140Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1453
File: src/parseable/mod.rs:397-400
Timestamp: 2025-10-28T02:10:41.140Z
Learning: In Parseable enterprise deployments with multiple query nodes, hot tier configuration must be persisted in object storage so that newly started query nodes can fetch and synchronize the hot tier settings at startup (file: src/parseable/mod.rs, function: create_stream_and_schema_from_storage).

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-08-20T17:01:25.791Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1409
File: src/storage/field_stats.rs:429-456
Timestamp: 2025-08-20T17:01:25.791Z
Learning: In Parseable's field stats calculation (src/storage/field_stats.rs), the extract_datetime_from_parquet_path_regex function correctly works with filename-only parsing because Parseable's server-side filename generation guarantees the dot-separated format date=YYYY-MM-DD.hour=HH.minute=MM pattern in parquet filenames.

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-08-21T12:04:38.398Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1410
File: src/storage/object_storage.rs:0-0
Timestamp: 2025-08-21T12:04:38.398Z
Learning: In Parseable's timestamp extraction (extract_timestamp_for_date in src/storage/object_storage.rs), minute-level precision is sufficient for both first and latest event timestamps. Seconds are always set to 0, meaning both first and latest events within the same minute will have identical timestamps, which is the intended behavior.

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-08-18T12:37:47.732Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/parseable/mod.rs:528-533
Timestamp: 2025-08-18T12:37:47.732Z
Learning: In Parseable, the validate_time_partition function in src/utils/json/flatten.rs already provides a default time partition limit of 30 days using `map_or(30, |days| days.get() as i64)` when time_partition_limit is None, so no additional defaulting is needed in the stream creation logic in src/parseable/mod.rs.

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-06-18T11:15:10.836Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.

Applied to files:

Cargo.toml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: coverage
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-unknown-linux-gnu
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin

🔇 Additional comments (3)

.github/workflows/build.yaml (1)
142-170: Cross.toml environment configuration is incomplete for rdkafka cross-compilation.

The apt packages and architecture suffixes (:arm64) are valid for Ubuntu focal, and /usr/lib/aarch64-linux-gnu is the correct library path. However, the environment passthrough is insufficient. Beyond PKG_CONFIG_ALLOW_CROSS, rdkafka cross-compilation requires:

PKG_CONFIG_SYSROOT_DIR pointing to the target sysroot

PKG_CONFIG_LIBDIR configured to find target .pc files

OPENSSL_DIR, OPENSSL_LIB_DIR, OPENSSL_INCLUDE_DIR for OpenSSL discovery

CC and CXX set to the target toolchain (aarch64-linux-gnu-gcc, aarch64-linux-gnu-g++)

LIBRARY_PATH and LD_LIBRARY_PATH for linking against target libraries

Add these variables to the [build.env] passthrough section or configure them in the build invocation to ensure librdkafka resolves dependencies correctly during cross-compilation.
⛔ Skipped due to learnings
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1302
File: .github/workflows/build.yaml:170-175
Timestamp: 2025-04-26T03:58:02.341Z
Learning: In the parseable project, the Linux-specific environment variables (PKG_CONFIG_PATH, SASL2_DIR, OPENSSL_DIR, OPENSSL_ROOT_DIR, SASL2_STATIC) in the Kafka build step of GitHub Actions workflows don't cause issues for macOS builds and can safely be applied to both platforms.
src/query/stream_schema_provider.rs (2)

135-145: Parquet scan refactor via FileScanConfigBuilder/ParquetSource looks aligned with DF 50+/51

The new flow (build file_groups, wrap the optional physical predicate in ParquetSource, then drive FileScanConfigBuilder with statistics, batch size, constraints, ordering, projection, and limit) is the right way to integrate with the newer DataFusion parquet APIs. The only thing to watch is future DF versions where with_projection_indices becomes fallible (Result<Self>); for 51.0.0 your current usage is fine.

Please double‑check against the exact datafusion = "51.0.0" docs that with_projection_indices still returns Self (not Result<Self>) and that with_statistics takes the same Statistics type you are passing in from partitioned_files.

Also applies to: 146-180

328-343: Using UnionExec::try_new is the correct upgrade here

Switching from infallible construction to UnionExec::try_new(execution_plans)? correctly propagates planning errors instead of panicking; no further changes needed.

.github/workflows/build.yaml

Cargo.toml

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

.github/workflows/build.yaml (1)
25-31: Update misleading macOS comment to clarify native vs. cross-compiled builds.

Line 25 states "both native on macos-latest (M1)", but only aarch64-apple-darwin is native on M1. The x86_64-apple-darwin target is cross-compiled and runs under Rosetta 2 emulation, despite use_cross: false (which means Rust's toolchain handles it without requiring the cross tool).
-          # macOS builds - both native on macos-latest (M1)
+          # macOS builds - aarch64 native on M1, x86_64 cross-compiled (Rosetta)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ffdf597 and 8442969.

📒 Files selected for processing (2)

.github/workflows/build.yaml (8 hunks)
Cross.toml (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:680-686
Timestamp: 2025-06-16T05:30:13.379Z
Learning: The parquet conversion system uses a minute-based atomic design where each conversion task operates on different sets of arrow files moved to timestamped "inprocess_" directories, ensuring no conflicts between tasks and no retry mechanisms that could cause file corruption.

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:351-355
Timestamp: 2025-06-16T05:20:18.593Z
Learning: In the Parseable codebase, arrow files are expected to always have valid creation or modified timestamps as a basic system assumption. The conversion flow uses expect() on file metadata operations to enforce this invariant with fail-fast behavior rather than graceful error handling, as violations represent fundamental system issues that should cause immediate failures.

📚 Learning: 2025-04-26T03:58:02.341Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1302
File: .github/workflows/build.yaml:170-175
Timestamp: 2025-04-26T03:58:02.341Z
Learning: In the parseable project, the Linux-specific environment variables (PKG_CONFIG_PATH, SASL2_DIR, OPENSSL_DIR, OPENSSL_ROOT_DIR, SASL2_STATIC) in the Kafka build step of GitHub Actions workflows don't cause issues for macOS builds and can safely be applied to both platforms.

Applied to files:

.github/workflows/build.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: coverage
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Kafka aarch64-unknown-linux-gnu
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-apple-darwin

🔇 Additional comments (4)

Cross.toml (1)

3-24: Cross-compilation configuration looks good.

The pre-build steps appropriately install ARM64-specific dependencies with tolerant failures, and the environment passthrough and read-only volume mounts align with typical cross-compilation requirements for rdkafka and OpenSSL.
.github/workflows/build.yaml (3)
69-71: Verify environment variable conditional syntax handles all cases correctly.

Lines 69–71 and 215–217 use conditional expressions to set CC and CXX:
CC: ${{ runner.os == 'Linux' && 'gcc-11' || '' }}
This pattern can set CC to a boolean false on non-Linux platforms (from the && operator), which may not be handled correctly by cargo. The safer pattern is to use an explicit if: condition on the step or ensure both branches produce strings:
CC: ${{ runner.os == 'Linux' && 'gcc-11' || 'default' }}
or
steps:
  - name: Build native
    if: ${{ !matrix.use_cross && runner.os == 'Linux' }}
    env:
      CC: gcc-11
      CXX: g++-11
    run: ...
Please verify the current syntax works as expected on macOS and Windows runners, or adjust to the safer pattern.

Also applies to: 215-217

155-197: Verify necessity and stability of librdkafka CMakeLists.txt workaround.

Lines 155–197 modify librdkafka/CMakeLists.txt at build time to bump the CMake version requirement from 3.2 to 3.5. This is a fragile workaround because:

It modifies vendored dependencies dynamically, which is ephemeral (lost when cargo cache is cleared).

It assumes librdkafka exists in the rdkafka-sys registry source, which could change.

It's version-specific and may require updates if rdkafka-sys/librdkafka versions change.

If the find or sed fails, the error may not be caught early.

Questions:

Is this modification necessary due to the Arrow/Parquet 57.0.0 upgrade affecting rdkafka-sys?

Can this be handled upstream (e.g., via a patch in Cargo.toml or rdkafka-sys build script) instead?

Is there a feature flag or configuration that avoids the CMake version bump?

Consider whether this workaround can be replaced with a more durable solution.

50-52: Clarify why cross is installed from git instead of crates.io release.

Lines 52 and 144 install cross from the git repository:
run: cargo install cross --git https://github.com/cross-rs/cross
Installing from git (HEAD) is less stable than pinning a specific version from crates.io. Please clarify:
Does DataFusion 50.3.0 / Arrow 57.0.0 require an unreleased version of cross?
If git is necessary, consider pinning to a specific commit hash for reproducibility:
run: cargo install cross --git https://github.com/cross-rs/cross --rev <commit-hash>
Alternatively, if a crates.io release works, prefer the stable version:
run: cargo install cross --version <version>
Also applies to: 142-144

.github/workflows/build.yaml

…tion - Add Docker Buildx and QEMU setup for ARM64 emulation - Switch to nightly Rust toolchain (required for Rust 1.88.0) - Configure Cross.toml with ARM64 package installation and library paths - Set CROSS_NO_WARNINGS=0 to ignore post-install script warnings - Add CMake environment variables for Kafka builds to find ARM64 libs

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

.github/workflows/build.yaml (2)
40-40: Add documentation explaining why nightly Rust is required.

Nightly Rust is pinned at lines 40 and 140, but the workflow lacks explanation. Based on the DataFusion 50.3.0 upgrade, nightly is required because the parquet reader defaults to schema_force_view_types=true, which uses view types (Utf8View, BinaryView) that are behind nightly-only feature flags. Add a comment above each uses: dtolnay/rust-toolchain@nightly step to document this requirement for future maintainers.
+      # Nightly Rust is required for DataFusion 50.3.0: the parquet reader
+      # defaults to schema_force_view_types=true, and view types (Utf8View,
+      # BinaryView) are behind nightly-only feature flags.
       - name: Setup Rust toolchain
         uses: dtolnay/rust-toolchain@nightly
Also applies to: 140-140

25-25: Clarify macOS comment: only aarch64 is native, x86_64 is cross-compiled.

Line 25 states "both native on macos-latest (M1)", but x86_64-apple-darwin on M1 is cross-compiled (runs under Rosetta), not a native build. Only aarch64-apple-darwin is native. Update the comment for accuracy:
-          # macOS builds - both native on macos-latest (M1)
+          # macOS builds - aarch64 native on M1, x86_64 cross-compiled (runs under Rosetta)

🧹 Nitpick comments (1)

.github/workflows/build.yaml (1)
50-52: Verify cross tool installation source.

Lines 50–52 and 144–146 install the cross tool from the Git repository: --git https://github.com/cross-rs/cross. Consider pinning a specific revision or tag to ensure reproducible builds and avoid unexpected breakage if the repository changes. For example:
run: cargo install cross --git https://github.com/cross-rs/cross --tag v0.2.5
Also applies to: 144-146

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8442969 and e043b19.

📒 Files selected for processing (2)

.github/workflows/build.yaml (8 hunks)
Cross.toml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

Cross.toml

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:680-686
Timestamp: 2025-06-16T05:30:13.379Z
Learning: The parquet conversion system uses a minute-based atomic design where each conversion task operates on different sets of arrow files moved to timestamped "inprocess_" directories, ensuring no conflicts between tasks and no retry mechanisms that could cause file corruption.

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:351-355
Timestamp: 2025-06-16T05:20:18.593Z
Learning: In the Parseable codebase, arrow files are expected to always have valid creation or modified timestamps as a basic system assumption. The conversion flow uses expect() on file metadata operations to enforce this invariant with fail-fast behavior rather than graceful error handling, as violations represent fundamental system issues that should cause immediate failures.

📚 Learning: 2025-04-26T03:58:02.341Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1302
File: .github/workflows/build.yaml:170-175
Timestamp: 2025-04-26T03:58:02.341Z
Learning: In the parseable project, the Linux-specific environment variables (PKG_CONFIG_PATH, SASL2_DIR, OPENSSL_DIR, OPENSSL_ROOT_DIR, SASL2_STATIC) in the Kafka build step of GitHub Actions workflows don't cause issues for macOS builds and can safely be applied to both platforms.

Applied to files:

.github/workflows/build.yaml

📚 Learning: 2025-06-18T11:15:10.836Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.

Applied to files:

.github/workflows/build.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: coverage
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Build Kafka aarch64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Distributed deployments

🔇 Additional comments (3)

.github/workflows/build.yaml (3)

157-199: librdkafka CMakeLists.txt minimum version bump looks reasonable.

The sed commands bump the minimum CMake version from 3.2 to 3.5, which is standard for modern build environments and necessary for the newer librdkafka versions compatible with DataFusion 50.3.0. The platform-specific sed syntax (with '' for macOS, no flag for Linux) is correct.

99-105: Docker Buildx and QEMU setup for cross-compilation looks correct.

The conditional setup of Docker Buildx and QEMU (lines 99–105) is only applied when matrix.use_cross == true, which is appropriate for aarch64 cross-compilation. The tool versions use latest stable tags, which is reasonable for CI.

201-215: Verify and replace hardcoded cross-compile paths with dynamic discovery.

Lines 208–214 hardcode specific aarch64-linux-gnu toolchain paths. These paths are fragile and may not exist across different Ubuntu versions or runner configurations. Consider:

Use pkg-config (via PKG_CONFIG_PATH) to dynamically discover library paths instead of hardcoding absolute paths

Document which packages must be pre-installed (e.g., crossbuild-essential-arm64, libssl-dev:arm64, zlib1g-dev:arm64)

Verify paths exist before build or use conditional logic to handle missing dependencies

The cross tool's Docker-based approach typically handles this automatically; if paths need explicit configuration, document why and make them resilient to environment changes.

parmesant closed this Oct 30, 2025

parmesant reopened this Oct 30, 2025

parmesant force-pushed the upgrade-datafusion branch from 4513119 to afa9f2e Compare October 31, 2025 11:48

nikhilsinhaparseable force-pushed the upgrade-datafusion branch from 48ffcf5 to 010988c Compare November 25, 2025 19:25

nikhilsinhaparseable marked this pull request as ready for review November 25, 2025 19:38

coderabbitai bot requested changes Nov 25, 2025

View reviewed changes

src/query/mod.rs Show resolved Hide resolved

parmesant force-pushed the upgrade-datafusion branch from 010988c to 8f3743e Compare November 26, 2025 20:59

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

coderabbitai bot requested changes Nov 28, 2025

View reviewed changes

.github/workflows/build.yaml Outdated Show resolved Hide resolved

.github/workflows/build.yaml Outdated Show resolved Hide resolved

coderabbitai bot reviewed Nov 28, 2025

View reviewed changes

parmesant and others added 8 commits December 3, 2025 13:55

Upgrade df version

f1892ff

Upgrade df version to 50.3.0 and arrow/parquet to 57.0.0

update datafusion version in cargo toml

05c6fea

step to install gcc11

5d691d6

fix indentation

a0c4d4b

fix indentation

b30a3ad

fix build actions:

450e6a9

fix for aarch64 kafka build

9c05c9c

update for kafka aarch64

b0ecba0

nikhilsinhaparseable force-pushed the upgrade-datafusion branch from ad7f099 to b0ecba0 Compare December 3, 2025 03:00

coderabbitai bot requested changes Dec 3, 2025

View reviewed changes

.github/workflows/build.yaml Show resolved Hide resolved

Cargo.toml Show resolved Hide resolved

nikhilsinhaparseable added 3 commits December 3, 2025 14:43

update Cargo.toml, remove creating cargo.toml in build

6444514

update arrow version

ffdf597

Fix aarch64 Linux Kafka build with QEMU and nightly Rust

8442969

coderabbitai bot requested changes Dec 3, 2025

View reviewed changes

.github/workflows/build.yaml Show resolved Hide resolved

coderabbitai bot reviewed Dec 4, 2025

View reviewed changes

coderabbitai bot approved these changes Dec 5, 2025

View reviewed changes

Uh oh!

Upgrade df version #1454

Are you sure you want to change the base?

Upgrade df version #1454

Uh oh!

Conversation

parmesant commented Oct 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

First, let's check the file structure and locate the key file

Locate stream_schema_provider.rs

Get line count of the file to understand its size

Examine the code at lines 704-731

Check the other mentioned line ranges

Search for extract_timestamp_bound and can_be_pruned functions

Search for tests related to this code

Look for test files in the directory

Check end of the stream_schema_provider.rs file for tests

Look for #[test] or #[cfg(test)] modules

Let's see the structure around PartialTimeFilter

Check the complete PartialTimeFilter enum and implementations

Search for the binary_expr function implementation

Look at extract_timestamp_bound implementation

Look at ManifestExt trait and can_be_pruned

Check DataFusion imports to verify API versions

Check for any timestamp handling patterns in the file

Get full content relevant sections - lines 860-1000 to see complete implementations

Check for Expr::Literal pattern matching usage

Search for pattern matches with Expr::Literal and understand the structure

Look for test modules within the file

Check the last part of the file for any modules or tests

Search for any test-related code in the query directory

Look for tests in related modules

Find test files

Check if there are mod tests or test submodules

Get a comprehensive view of the file to understand the structure

Get the line ranges the comment refers to

Let's look at the functions mentioned more carefully

Check for Expr::Literal pattern matching

Check for TimestampMillisecond and TimestampNanosecond handling

Check the entire section around binary_expr more thoroughly

Get context for the entire PartialTimeFilter implementation

Look for the complete implementation from 1125-1268

Verify what's in lines 1125-1268 - check if these are part of important logic

Let's look at metadata handling in Expr::Literal patterns

Check imports to understand DataFusion version/API

Get full imports section

Search for from_timestamp_millis to verify the comment's claim about panic behavior

Check for test coverage or any test utilities

Let's check the complete file length and structure

Check for test modules at the end of file

Let's trace through the exact patterns mentioned in the comment

Read the file

Find the binary_expr implementation

Comprehensive check of all code patterns mentioned in the review

Check line ranges

Verify the key claims: Expr::Literal pattern with None metadata

Search for pattern 1: Expr::Literal construction with ScalarValue::Timestamp*

Search for pattern 2: Pattern matching on Expr::Literal(_, None)

Search for from_timestamp_millis.unwrap

Let's look for and_utc() calls which are mentioned in the comment

Check for the specific code snippet from the review comment

parmesant commented Oct 30, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 30, 2025 •

edited

Loading