-
-
Notifications
You must be signed in to change notification settings - Fork 157
Upgrade df version #1454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Upgrade df version #1454
Conversation
WalkthroughDependency and CI/tooling upgrades; Arrow/DataFusion/Parquet and related crates bumped; Int96 timestamp conversion switched to nanoseconds via a new helper; Parquet scanning refactored to FileScanConfigBuilder + ParquetSource (projection/predicate/limit); object-store API types/signatures updated; CompressionContext added to IPC tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Provider as StandardTableProvider
participant Builder as FileScanConfigBuilder
participant Source as ParquetSource
participant Format as ParquetFormat
participant Planner as ExecutionPlanner
Client->>Provider: request scan_execution_plan(...)
Provider->>Builder: build config from partitions (stats, batch_size, constraints)
alt predicate present
Builder->>Source: attach predicate/filters
end
alt projection present
Builder->>Builder: set projection_indices
end
alt limit present
Builder->>Builder: apply limit
end
Builder->>Format: create_physical_plan(config)
Format->>Planner: generate per-file execution plans
Planner->>Planner: UnionExec::try_new(plans)
Planner-->>Client: return physical execution plan
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 2 inconclusive)
✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
4513119 to
afa9f2e
Compare
48ffcf5 to
010988c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
Cargo.toml (1)
159-162: Version mismatch in dev-dependencies: arrow 54.0.0 vs 57.1.0.The
dev-dependenciessection pinsarrow = "54.0.0"while the main dependencies usearrow = "57.1.0". This version mismatch can cause type incompatibilities in tests when Arrow types cross the boundary between test code and library code.Consider aligning the versions:
[dev-dependencies] rstest = "0.23.0" -arrow = "54.0.0" +arrow = "57.1.0" temp-dir = "0.1.14"If this mismatch is intentional for specific testing purposes, please document the rationale.
🧹 Nitpick comments (2)
src/catalog/manifest.rs (1)
176-201: column_statistics signature simplification looks good; optional consistency tweakSwitching
column_statisticsto take&[RowGroupMetaData](using the imported alias) is a pure type-path cleanup; the aggregation overrow_group.columns(), sizes, andstats.try_into()is unchanged and still correct with current parquet metadata types.If you want to keep things fully consistent, you could also update
sort_order’s parameter type to use the sameRowGroupMetaDataalias instead of the fully qualified path, but that’s purely cosmetic.After the parquet bump, please verify that
RowGroupMetaData::columns,compressed_size,uncompressed_size, andstatistics().try_into()still behave as expected by running a manifest-generation test against a few real parquet files.src/query/stream_schema_provider.rs (1)
145-189: LGTM! Correctly migrated to FileScanConfigBuilder pattern.The refactor properly uses the new DataFusion API:
ParquetSourcewith optional predicate for filter pushdownFileScanConfigBuilderfor assembling scan configuration- Projection and limit conditionally applied via builder methods
- No longer requires downcasting
SessionStateOne consideration: the batch size is hardcoded to 20000. This matches the session config in
create_session_state(line 144), but consider using the session's configured batch size for consistency.Consider using the session's batch size instead of hardcoding:
- .with_batch_size(Some(20000)) + .with_batch_size(Some(state.config().batch_size()))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
Cargo.toml(3 hunks)src/catalog/column.rs(2 hunks)src/catalog/manifest.rs(2 hunks)src/parseable/staging/reader.rs(3 hunks)src/parseable/streams.rs(1 hunks)src/query/mod.rs(3 hunks)src/query/stream_schema_provider.rs(13 hunks)src/storage/localfs.rs(1 hunks)src/storage/metrics_layer.rs(6 hunks)src/storage/object_storage.rs(1 hunks)src/utils/arrow/flight.rs(3 hunks)
🧰 Additional context used
🧠 Learnings (18)
📓 Common learnings
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.
📚 Learning: 2025-09-14T15:17:59.234Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1432
File: src/storage/object_storage.rs:124-132
Timestamp: 2025-09-14T15:17:59.234Z
Learning: In Parseable's upload validation system (src/storage/object_storage.rs), the validate_uploaded_parquet_file function should not include bounded retries for metadata consistency issues. Instead, failed validations rely on the 30-second sync cycle for natural retries, with staging files preserved when manifest_file is set to None.
Applied to files:
src/storage/object_storage.rs
📚 Learning: 2025-09-18T09:52:07.554Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/storage/object_storage.rs:173-177
Timestamp: 2025-09-18T09:52:07.554Z
Learning: In Parseable's upload system (src/storage/object_storage.rs), the update_storage_metrics function can safely use path.metadata().map_err() to fail on local file metadata read failures because parquet validation (validate_uploaded_parquet_file) ensures file integrity before this step, and the system guarantees local staging files remain accessible throughout the upload flow.
Applied to files:
src/storage/object_storage.rssrc/storage/metrics_layer.rssrc/storage/localfs.rssrc/catalog/manifest.rs
📚 Learning: 2025-08-18T14:56:18.463Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/storage/object_storage.rs:997-1040
Timestamp: 2025-08-18T14:56:18.463Z
Learning: In Parseable's staging upload system (src/storage/object_storage.rs), failed parquet file uploads should remain in the staging directory for retry in the next sync cycle, while successful uploads remove their staged files immediately. Early return on first error in collect_upload_results is correct behavior as concurrent tasks handle their own cleanup and failed files need to stay for retry.
Applied to files:
src/storage/object_storage.rssrc/storage/metrics_layer.rs
📚 Learning: 2025-08-20T17:01:25.791Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1409
File: src/storage/field_stats.rs:429-456
Timestamp: 2025-08-20T17:01:25.791Z
Learning: In Parseable's field stats calculation (src/storage/field_stats.rs), the extract_datetime_from_parquet_path_regex function correctly works with filename-only parsing because Parseable's server-side filename generation guarantees the dot-separated format date=YYYY-MM-DD.hour=HH.minute=MM pattern in parquet filenames.
Applied to files:
src/storage/object_storage.rssrc/query/stream_schema_provider.rs
📚 Learning: 2025-10-24T11:54:20.269Z
Learnt from: parmesant
Repo: parseablehq/parseable PR: 1449
File: src/metastore/metastores/object_store_metastore.rs:83-98
Timestamp: 2025-10-24T11:54:20.269Z
Learning: In the `get_overviews` method in `src/metastore/metastores/object_store_metastore.rs`, using `.ok()` to convert all storage errors to `None` when fetching overview objects is the intended behavior. This intentionally treats missing files and other errors (network, permissions, etc.) the same way.
Applied to files:
src/storage/object_storage.rs
📚 Learning: 2025-07-28T17:10:39.448Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1392
File: src/migration/stream_metadata_migration.rs:303-322
Timestamp: 2025-07-28T17:10:39.448Z
Learning: In Parseable's migration system (src/migration/stream_metadata_migration.rs), each migration function updates the metadata to the current latest format using CURRENT_OBJECT_STORE_VERSION and CURRENT_SCHEMA_VERSION constants, rather than producing incremental versions. For example, v5_v6 function produces v7 format output when these constants are set to "v7", not v6 format.
Applied to files:
src/parseable/streams.rssrc/storage/metrics_layer.rssrc/query/stream_schema_provider.rssrc/catalog/manifest.rs
📚 Learning: 2025-09-05T09:18:44.813Z
Learnt from: parmesant
Repo: parseablehq/parseable PR: 1425
File: src/query/mod.rs:484-495
Timestamp: 2025-09-05T09:18:44.813Z
Learning: In the Parseable system, stream names and column names cannot contain quotes, which eliminates SQL injection concerns when interpolating these identifiers directly into SQL queries in src/query/mod.rs.
Applied to files:
src/parseable/streams.rs
📚 Learning: 2025-08-18T19:10:11.941Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/handlers/http/ingest.rs:163-164
Timestamp: 2025-08-18T19:10:11.941Z
Learning: Field statistics calculation in src/storage/field_stats.rs uses None for the time_partition parameter when calling flatten_and_push_logs(), as field stats generation does not require time partition functionality.
Applied to files:
src/utils/arrow/flight.rssrc/catalog/column.rs
📚 Learning: 2025-09-18T09:59:20.177Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/metrics/mod.rs:700-756
Timestamp: 2025-09-18T09:59:20.177Z
Learning: In src/event/mod.rs, the parsed_timestamp used in increment_events_ingested_by_date() is correctly UTC-normalized: for dynamic streams it remains Utc::now(), and for streams with time partition enabled it uses the time partition value. Both cases result in proper UTC date strings for metrics labeling, preventing double-counting issues.
Applied to files:
src/utils/arrow/flight.rs
📚 Learning: 2025-08-25T01:31:41.786Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/metadata.rs:63-68
Timestamp: 2025-08-25T01:31:41.786Z
Learning: The TOTAL_EVENTS_INGESTED_DATE, TOTAL_EVENTS_INGESTED_SIZE_DATE, and TOTAL_EVENTS_STORAGE_SIZE_DATE metrics in src/metadata.rs and src/storage/object_storage.rs are designed to track total events across all streams, not per-stream. They use labels [origin, parsed_date] to aggregate by format and date, while per-stream metrics use [stream_name, origin, parsed_date] labels.
Applied to files:
src/storage/metrics_layer.rs
📚 Learning: 2025-09-23T10:08:36.368Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/storage/localfs.rs:41-43
Timestamp: 2025-09-23T10:08:36.368Z
Learning: In src/storage/localfs.rs, the metrics use "localfs" as the provider label (not "drive" from FSConfig.name()) because it aligns with the object storage implementation naming for the file system. This is intentional and should not be flagged as an inconsistency.
Applied to files:
src/storage/metrics_layer.rs
📚 Learning: 2025-08-21T11:47:01.279Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1410
File: src/storage/object_storage.rs:0-0
Timestamp: 2025-08-21T11:47:01.279Z
Learning: In Parseable's object storage implementation (src/storage/object_storage.rs), the hour and minute directory prefixes (hour=XX, minute=YY) are generated from arrow file timestamps following proper datetime conventions, so they are guaranteed to be within valid ranges (0-23 for hours, 0-59 for minutes) and don't require additional range validation.
Applied to files:
src/storage/metrics_layer.rs
📚 Learning: 2025-06-18T06:39:04.775Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.
Applied to files:
src/query/mod.rssrc/query/stream_schema_provider.rs
📚 Learning: 2025-10-28T02:10:41.140Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1453
File: src/parseable/mod.rs:397-400
Timestamp: 2025-10-28T02:10:41.140Z
Learning: In Parseable enterprise deployments with multiple query nodes, hot tier configuration must be persisted in object storage so that newly started query nodes can fetch and synchronize the hot tier settings at startup (file: src/parseable/mod.rs, function: create_stream_and_schema_from_storage).
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-08-21T12:04:38.398Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1410
File: src/storage/object_storage.rs:0-0
Timestamp: 2025-08-21T12:04:38.398Z
Learning: In Parseable's timestamp extraction (extract_timestamp_for_date in src/storage/object_storage.rs), minute-level precision is sufficient for both first and latest event timestamps. Seconds are always set to 0, meaning both first and latest events within the same minute will have identical timestamps, which is the intended behavior.
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-08-18T12:37:47.732Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/parseable/mod.rs:528-533
Timestamp: 2025-08-18T12:37:47.732Z
Learning: In Parseable, the validate_time_partition function in src/utils/json/flatten.rs already provides a default time partition limit of 30 days using `map_or(30, |days| days.get() as i64)` when time_partition_limit is None, so no additional defaulting is needed in the stream creation logic in src/parseable/mod.rs.
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-06-18T11:15:10.836Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.
Applied to files:
Cargo.toml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: coverage
🔇 Additional comments (25)
src/parseable/streams.rs (1)
34-42: SortingColumn import now matches Parquet 57 metadata APIImporting
metadata::SortingColumnfromparquet::filealigns with the currentWriterProperties/RowGroupMetaDataAPIs and avoids the deprecatedparquet::formatpath. The existing use inparquet_writer_propsremains correct and type‑safe.Please re-run the parquet-related tests (ingestion + readback) with the new parquet version to confirm there are no behavioral changes around sort columns.
src/catalog/manifest.rs (1)
21-25: Parquet metadata imports are consistent with upgraded parquet APIConsolidating imports to
file::metadata::{RowGroupMetaData, SortingColumn}matches how parquet 57 exposes metadata structures and keeps call sites (row_groups,sorting_columns) clear and type‑correct.Please ensure the crate versions used in
Cargo.tomlmatch the parquet API you targeted here and that manifest-building tests (or an end-to-end manifest generation run) pass with these new imports.src/utils/arrow/flight.rs (2)
31-31: LGTM - Correct adaptation to DataFusion 51.0.0 Column API.The
Spansimport and initialization align with the updateddatafusion::common::Columnstruct requirements.Also applies to: 120-124
142-144: LGTM - UpdatedExpr::Literalsignature.The addition of the second
Noneparameter correctly matches the newExpr::Literal(ScalarValue, Option<Spans>)variant in DataFusion 51.0.0.src/catalog/column.rs (2)
158-161: Verify that Int96 statistics are only used for timestamp columns.The conversion assumes Int96 always represents timestamps. If non-timestamp Int96 data exists, this interpretation would be incorrect. The comment on line 200-201 documents this assumption well.
199-216: Implementation correctly follows Parquet INT96 specification.The Int96 conversion is accurate. The web search confirms INT96 is indeed 12 bytes little-endian with 8 bytes of nanoseconds-since-midnight + 4 bytes of Julian day, and the formula
(julian_day - 2440588) * 86400 * 1e9 + nanoseconds_of_dayis the standard conversion to Unix epoch nanoseconds. The Julian day constant 2,440,588 for 1970-01-01 is correct, and negative timestamps pre-1970 will correctly produce negative i64 values.src/storage/localfs.rs (1)
140-149: LGTM - Correct adaptation to object_store 0.12.x API.The
ObjectMeta.sizefield is nowu64in object_store 0.12.x, matchingstd::fs::Metadata::len()return type. Removing the cast is correct.src/parseable/staging/reader.rs (2)
344-347: LGTM - Updated imports for Arrow IPC API changes.The
CompressionContextimport is required for the newIpcDataGenerator::encode()signature in Arrow 57.x.
437-458: LGTM - Correct adaptation toIpcDataGenerator::encode()API.The new
encode()method requires a mutableCompressionContextparameter. UsingCompressionContext::default()is appropriate for this test scenario.Cargo.toml (3)
12-26: LGTM - Arrow/DataFusion ecosystem upgrade.The coordinated upgrade of Arrow (57.1.0), DataFusion (51.0.0), Parquet (57.1.0), and object_store (0.12.4) maintains version compatibility across the ecosystem.
130-137: LGTM - Minor dependency updates.The regex update (1.12.2) and reqwest "stream" feature addition are appropriate.
36-38: Based on the web search results, I can now provide verification findings:Verification Results:
tls-aws-lc and tls-native-roots are compatible — Both are optional Cargo features in tonic 0.14.x and can be enabled together without direct conflict. The first selects the aws-lc crypto provider for rustls; the second adds system trust roots via rustls-native-certs.
Broader TLS ecosystem concern is valid — rustls v0.23+ removed hardcoded crypto providers, so different dependencies may default to different providers (aws-lc vs ring), which can cause build conflicts. However, this is a dependency-graph-wide issue, not specific to tonic's feature selection.
Repository access limitation — I cannot access the full Cargo.toml to verify if other TLS dependencies (like the mentioned "rustls 0.22.4") create actual mismatches with the tonic configuration.
Rewritten Review Comment:
Verify tonic TLS configuration across the dependency graph.
The
tls-aws-lcandtls-native-rootsfeatures are compatible and can coexist in tonic 0.14.1. However, ensure that all TLS-related dependencies (e.g., hyper-rustls, aws-sdk) consistently use the same crypto provider (aws-lc vs ring). Mismatched providers across the dependency graph can cause build failures. Runcargo tree -e featuresto confirm provider unification.src/storage/object_storage.rs (1)
226-226: LGTM! Correct adaptation to object_store API change.The cast removal is correct since
ObjectMeta.sizeis nowu64in the updatedobject_storecrate, making the explicit cast unnecessary.src/query/mod.rs (2)
28-28: LGTM! Import paths updated for DataFusion 51.The import changes correctly reflect DataFusion's API reorganization:
DiskManagerreplacesDiskManagerConfigfor the builder patternresolve_table_referencesmoved todatafusion::sql::resolvemoduleAlso applies to: 36-36
233-244: LGTM! Correctly preserves explain_format in transformed plan.The addition of
explain_format: plan.explain_formatensures the explain output format (e.g., TEXT, JSON) is preserved when transforming the logical plan. This was a necessary fix for the DataFusion upgrade where theExplainstruct now requires this field.src/storage/metrics_layer.rs (4)
29-31: LGTM! Import updates for object_store API.Import changes correctly reflect the renamed types and reorganized module structure in the updated
object_storecrate.
146-164: LGTM! Correct signature update for put_multipart_opts.The parameter type correctly updated from
PutMultipartOptstoPutMultipartOptionsto match the new object_store API.
216-250: LGTM! Range type updated from usize to u64.The change from
Range<usize>toRange<u64>forget_rangeandget_rangesmethods correctly follows the object_store API update. This enables proper handling of files larger than 4GB on 32-bit systems.
291-316: LGTM! Stream lifetime updated to 'static.The
listandlist_with_offsetmethods now returnBoxStream<'static, ...>instead of borrowing fromself. This matches the updated object_store trait requirements.src/query/stream_schema_provider.rs (6)
35-38: LGTM! New DataFusion imports for file scanning.The imports correctly bring in the new builder-pattern types (
FileScanConfigBuilder,FileGroup,ParquetSource) and movedObjectStoreUrlfrom the updated module path.
340-343: LGTM! Correct migration to fallible UnionExec construction.
UnionExec::try_newreplacesUnionExec::newand returns aResult, which is properly propagated with?. This aligns with DataFusion's move toward fallible constructors.
725-729: LGTM! ScalarValue literal format updated.The
Expr::Literalnow takes a tuple of(ScalarValue, Option<_>)where the second element represents additional metadata. UsingNonefor the second parameter is correct for timestamp literals.
870-895: LGTM! Pattern matching updated for new Expr::Literal format.The destructuring pattern
Expr::Literal(value, None)correctly matches the new literal representation and ensures only simple literals without additional metadata are processed.
942-946: LGTM! can_be_pruned updated for new literal format.The pattern matching in
extract_op_scalarcorrectly handles the newExpr::Literal(value, None)format for scalar extraction.
1129-1132: LGTM! Test cases updated for new ScalarValue format.All test cases correctly use the new
Expr::Literal(ScalarValue::..., None)format, ensuring tests remain valid after the DataFusion upgrade.Also applies to: 1151-1154, 1174-1177, 1197-1200, 1214-1214, 1242-1245, 1260-1263
010988c to
8f3743e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
Cargo.toml (2)
1-182: Based on my verification through web search, I can now provide a rewritten review comment. The web search results confirm several important findings:
Dependency Compatibility: Arrow 57.1.0 is backward-compatible with DataFusion 51.0.0, Parquet 57.1.0 has no breaking changes, and Tonic 0.14.1's TLS feature combination (tls-aws-lc + tls-native-roots) is valid.
Testing Justification: Upstream maintainers explicitly recommend running integration tests for parquet-read/write logic when upgrading to Arrow 57.1.0, as it contains bug fixes that need verification.
Rust Edition 2024: Properly supported by rust-version 1.88.0.
However, the original review comment's core concerns remain valid—the PR checklist items are still unchecked and the dependency upgrade does require verification testing per upstream guidance.
Testing and documentation still required for dependency upgrade.
The dependency versions (Arrow 57.1.0, DataFusion 51.0.0, Parquet 57.1.0, Tonic 0.14.1) are compatible with no breaking changes. However, upstream maintainers explicitly recommend running parquet-read/write integration tests when upgrading to Arrow 57.1.0 due to bug fixes. All PR checklist items remain unchecked:
- ☐ Execute integration tests for log ingestion and query functionality
- ☐ Add inline comments explaining version selection rationale
- ☐ Document any changes for operators/users
161-161: Based on my verification:Arrow dev-dependency version mismatch is confirmed—update to match main dependency version.
The concern in the original review is valid:
- DataFusion 51.0.0 officially requires Arrow 57.x (confirmed by DataFusion upgrade docs and related PRs)
- Breaking changes exist between Arrow 54.0.0 and 57.1.0:
Struct::try_newlength handling (55.x)usize → u64conversions for file offsets and ranges (55.x)- Tests using Arrow 54.0.0 will not catch these breaking changes, creating false confidence
- Best practices recommend pinning dev-dependency versions precisely and documenting any intentional divergence
The version mismatch (54.0.0 vs 57.1.0) is almost certainly unintentional and should be corrected.
Critical: Arrow dev-dependency version mismatch.
Line 161 pins arrow to
54.0.0in dev-dependencies while DataFusion 51.0.0 (a main dependency) requires57.1.0. This 3-version gap includes breaking changes (e.g.,Struct::try_newbehavior,usize→u64conversions in 55.x+) that will not be caught by tests but will affect production code paths, creating false test confidence.Update dev-dependencies to:
[dev-dependencies] rstest = "0.23.0" -arrow = "54.0.0" +arrow = "57.1.0" temp-dir = "0.1.14"
🧹 Nitpick comments (1)
src/query/stream_schema_provider.rs (1)
328-345: Switch toUnionExec::try_newis a good robustness improvementUsing
UnionExec::try_new(execution_plans)?instead of constructing the union directly makes this path respect any validation errors DataFusion adds (e.g., mismatched schemas across inputs). Error propagation viaResult<Arc<dyn ExecutionPlan>, DataFusionError>is still correct.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
-
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
-
Cargo.toml(3 hunks) -
src/catalog/column.rs(2 hunks) -
src/catalog/manifest.rs(2 hunks) -
src/parseable/staging/reader.rs(3 hunks) -
src/parseable/streams.rs(1 hunks) -
src/query/mod.rs(3 hunks) -
src/query/stream_schema_provider.rs(13 hunks) -
src/storage/localfs.rs(1 hunks) -
src/storage/metrics_layer.rs(6 hunks) -
src/storage/object_storage.rs(1 hunks) -
src/utils/arrow/flight.rs(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- src/storage/object_storage.rs
- src/parseable/streams.rs
- src/query/mod.rs
- src/catalog/column.rs
- src/utils/arrow/flight.rs
- src/catalog/manifest.rs
- src/storage/localfs.rs
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:680-686
Timestamp: 2025-06-16T05:30:13.379Z
Learning: The parquet conversion system uses a minute-based atomic design where each conversion task operates on different sets of arrow files moved to timestamped "inprocess_" directories, ensuring no conflicts between tasks and no retry mechanisms that could cause file corruption.
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:351-355
Timestamp: 2025-06-16T05:20:18.593Z
Learning: In the Parseable codebase, arrow files are expected to always have valid creation or modified timestamps as a basic system assumption. The conversion flow uses expect() on file metadata operations to enforce this invariant with fail-fast behavior rather than graceful error handling, as violations represent fundamental system issues that should cause immediate failures.
📚 Learning: 2025-06-18T06:39:04.775Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-07-28T17:10:39.448Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1392
File: src/migration/stream_metadata_migration.rs:303-322
Timestamp: 2025-07-28T17:10:39.448Z
Learning: In Parseable's migration system (src/migration/stream_metadata_migration.rs), each migration function updates the metadata to the current latest format using CURRENT_OBJECT_STORE_VERSION and CURRENT_SCHEMA_VERSION constants, rather than producing incremental versions. For example, v5_v6 function produces v7 format output when these constants are set to "v7", not v6 format.
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-08-20T17:01:25.791Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1409
File: src/storage/field_stats.rs:429-456
Timestamp: 2025-08-20T17:01:25.791Z
Learning: In Parseable's field stats calculation (src/storage/field_stats.rs), the extract_datetime_from_parquet_path_regex function correctly works with filename-only parsing because Parseable's server-side filename generation guarantees the dot-separated format date=YYYY-MM-DD.hour=HH.minute=MM pattern in parquet filenames.
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-08-21T12:04:38.398Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1410
File: src/storage/object_storage.rs:0-0
Timestamp: 2025-08-21T12:04:38.398Z
Learning: In Parseable's timestamp extraction (extract_timestamp_for_date in src/storage/object_storage.rs), minute-level precision is sufficient for both first and latest event timestamps. Seconds are always set to 0, meaning both first and latest events within the same minute will have identical timestamps, which is the intended behavior.
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-08-18T12:37:47.732Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/parseable/mod.rs:528-533
Timestamp: 2025-08-18T12:37:47.732Z
Learning: In Parseable, the validate_time_partition function in src/utils/json/flatten.rs already provides a default time partition limit of 30 days using `map_or(30, |days| days.get() as i64)` when time_partition_limit is None, so no additional defaulting is needed in the stream creation logic in src/parseable/mod.rs.
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-09-18T09:52:07.554Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/storage/object_storage.rs:173-177
Timestamp: 2025-09-18T09:52:07.554Z
Learning: In Parseable's upload system (src/storage/object_storage.rs), the update_storage_metrics function can safely use path.metadata().map_err() to fail on local file metadata read failures because parquet validation (validate_uploaded_parquet_file) ensures file integrity before this step, and the system guarantees local staging files remain accessible throughout the upload flow.
Applied to files:
src/storage/metrics_layer.rs
📚 Learning: 2025-08-18T14:56:18.463Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/storage/object_storage.rs:997-1040
Timestamp: 2025-08-18T14:56:18.463Z
Learning: In Parseable's staging upload system (src/storage/object_storage.rs), failed parquet file uploads should remain in the staging directory for retry in the next sync cycle, while successful uploads remove their staged files immediately. Early return on first error in collect_upload_results is correct behavior as concurrent tasks handle their own cleanup and failed files need to stay for retry.
Applied to files:
src/storage/metrics_layer.rs
📚 Learning: 2025-08-21T11:47:01.279Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1410
File: src/storage/object_storage.rs:0-0
Timestamp: 2025-08-21T11:47:01.279Z
Learning: In Parseable's object storage implementation (src/storage/object_storage.rs), the hour and minute directory prefixes (hour=XX, minute=YY) are generated from arrow file timestamps following proper datetime conventions, so they are guaranteed to be within valid ranges (0-23 for hours, 0-59 for minutes) and don't require additional range validation.
Applied to files:
src/storage/metrics_layer.rs
📚 Learning: 2025-06-18T11:15:10.836Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.
Applied to files:
Cargo.toml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: coverage
🔇 Additional comments (15)
src/parseable/staging/reader.rs (3)
345-347: Arrow IPC writer imports align with new usageAdding
CompressionContextandwrite_messageto thearrow_ipc::writerimports is consistent with how they’re used below in tests; no issues from a Rust perspective.Please double‑check against the Arrow 57
arrow_ipc::writerdocs/API that these symbols and their module path are still correct.
437-437: Reusing a singleCompressionContextacross encodes is correctInitializing a single mutable
CompressionContextand reusing it across allencodecalls matches how a stateful compression context is typically intended to be used in Arrow; the scoping looks good.Please confirm in the Arrow 57 docs that
CompressionContext::default()is the recommended initializer withIpcWriteOptions::default().
451-456: Updatedencodecall signature matches the new API expectationsPassing
&mut dictionary_tracker,&options, and&mut compression_contexttoencodein that order looks consistent and keeps the context shared across the loop, which is what you want for a single stream.Since this depends on the exact Arrow 57
IpcDataGenerator::encodesignature, please verify the parameter order and types against the current crate docs.src/storage/metrics_layer.rs (6)
29-32: Imports align with updated object_store APIUsing
PutMultipartOptionsand the current object_store aliases looks consistent with the newer trait surface; no issues seen.
145-164: put_multipart_opts wrapper correctly updated to PutMultipartOptionsSignature now matches the newer multipart options type and still delegates directly while recording metrics; behavior is unchanged aside from the API type.
216-230: get_range Range passthrough looks correctSwitching to
Range<u64>and returningByteswhile delegating toinner.get_rangepreserves existing behavior and matches the expected byte-offset semantics; no extra conversion logic is needed here. Please just confirm all call sites now constructRange<u64>.
232-250: get_ranges updated to &[Range] with no semantic changeForwarding
&[Range<u64>]toinner.get_rangesand returningVec<Bytes>keeps this as a pure metrics wrapper. Only the index type changed, which should be safe as long as callers were updated alongside the trait.
291-300: list now returning BoxStream<'static> remains type-safeWrapping
inner.list(prefix)inStreamMetricWrapperand boxing toBoxStream<'static, ObjectStoreResult<ObjectMeta>>looks compatible with the updated trait, since the wrapper owns the inner stream and doesn’t capture non‑'static borrows. No functional changes beyond the lifetime.
302-316: list_with_offset 'static stream lifetime matches list changeSame pattern as
list: you wrapinner.list_with_offsetand return a'staticboxed stream, with metrics emitted when the stream completes. This looks correct and consistent with the new ObjectStore signature.Cargo.toml (4)
30-30: Based on my research of the TLS ecosystem across these crates, I can now provide a corrected review comment:
Original review concern about duplicate TLS provider symbols is unfounded; tls-aws-lc and tls-native-roots are complementary features designed to be enabled together.
In tonic 0.14.1 and arrow-flight 57.1.0:
tls-aws-lc: enables the AWS-LC crypto provider for rustls (cryptographic backend)tls-native-roots: enables OS root certificate store integration (certificate loading)These are not conflicting providers but work together. Enabling both simultaneously is the documented and intended usage pattern.
However, there remains a valid underlying concern: ensure rustls version consistency across all transitive dependencies. If any transitive dependency uses rustls 0.21 or mismatched companion crates (e.g., rustls-native-certs targeting a different rustls version), this will cause type incompatibilities. Verify via
cargo treethat all dependencies compile against rustls 0.22.x specifically, particularly any crates providing TLS integrations (tokio-rustls, rustls-pki-types, aws-lc-rs adapters).
36-38: Based on my research, I can now provide a rewritten review comment with verified information about the tonic upgrade:
Verify tonic 0.14.1 upgrade integration is complete with updated build configuration and passing integration tests.
The upgrade from tonic 0.12.3 → 0.14.1 involves significant structural changes. Tonic 0.14.x extracted prost-related functionality into separate crates; the addition of
tonic-prost = "0.14.1"is correct and necessary. The enabled features (transport,gzip,zstd) are appropriate for this upgrade path.However, this upgrade requires code changes beyond Cargo.toml updates:
build.rsmust be updated to usetonic-prost-buildif applicable- Proto files must be re-generated with the new tooling
- Any imports referencing prost types previously re-exported from tonic must be updated to use
prostortonic-prostdirectly- Integration tests must confirm log ingestion and query functionality work
The opentelemetry ecosystem (including
opentelemetry-otlpandopentelemetry-protowithgen-tonicfeature) is compatible with tonic 0.14.1, so gRPC endpoint communication should work if migration steps are complete.Verification needed:
- Confirm
build.rshas been updated to usetonic-prost-buildfor proto code generation- Verify proto files have been regenerated and import errors resolved
- Run and pass integration tests for log ingestion and query workflows before merging (PR checklist currently shows testing unchecked)
20-25: I can confirm the breaking changes in object_store 0.12.4 through web search, but the repository is currently inaccessible for direct codebase verification. However, I have enough information to provide a rewritten review comment based on the confirmed breaking changes.
Verify object_store 0.12.4 API migration to u64-based ranges and sizes is complete.
object_store 0.12.4 introduces a breaking change: byte-range and size types changed from
usizetou64. This affects:
get_range()calls now expectRange<u64>instead ofRange<usize>ObjectMetasize fields are nowu64instead ofusize- Any intermediate range or size handling in the parquet file management flow
Verify that:
- All
get_range()calls passRange<u64>with proper type conversionsObjectMetafield accesses (.size,.size_bytes) are handled asu64values- The minute-based atomic parquet conversion flow (staging to
inprocess_directories) properly handlesu64sizes- File metadata expectations throughout the upload and validation flow account for the new type signatures
12-26: Verify DataFusion 51.0.0 FileScanConfig refactoring and Parquet 57.1.0 FileEncryptionProperties Arc migration.The upgrade spans Arrow 57.1.0, DataFusion 51.0.0, and Parquet 57.1.0. Based on upstream release notes:
Correction: Arrow 57.1.0 contains only a robustness fix for INT96 statistics (avoid panic when size ≠ 12)—no breaking changes to Int96 types or view types. The concern about "Int96 timestamp conversion refactoring" appears to be inaccurate.
Confirmed breaking changes requiring verification:
DataFusion 51.0.0 – FileScanConfig refactoring (critical):
- Schema now required at source construction time;
ParquetSource::new(table_schema)pattern requiredParquetExecreplaced byDataSourceExecwith downcasting pattern:DataSourceExec → FileScanConfig → ParquetSourcewith_table_partition_cols()method removed- Verify all parquet scan construction and inspection code uses the new patterns
Parquet 57.1.0 – FileEncryptionProperties Arc migration:
- APIs now require
Arc<FileEncryptionProperties>instead of owned values- Nested/list/struct schema decoding regressions reported; verify nested schema reads/writes work correctly
View types concern (from learnings): Ensure
StringViewArray/BinaryViewArraydowncasting is properly handled where DataFusion's defaultschema_force_view_types=trueapplies.The PR checklist shows "tested to ensure log ingestion and log query works" is unchecked. Verify parquet conversion and file metadata operations (which use
expect()per system design) handle these breaking changes correctly, especially around nested data handling.src/query/stream_schema_provider.rs (2)
704-731: I'll help you verify and rewrite this review comment. Let me start by examining the codebase to confirm the claims made in the review.
<function_calls>
#!/bin/bashFirst, let's check the file structure and locate the key file
find . -name "stream_schema_provider.rs" -type f | head -5
</function_calls>Let me search for the file and examine the code:
<function_calls>
#!/bin/bashLocate stream_schema_provider.rs
fd stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashGet line count of the file to understand its size
wc -l src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashExamine the code at lines 704-731
sed -n '700,740p' src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashCheck the other mentioned line ranges
echo "=== Lines 866-896 ==="
sed -n '860,900p' src/query/stream_schema_provider.rs
echo ""
echo "=== Lines 937-947 ==="
sed -n '935,950p' src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashSearch for extract_timestamp_bound and can_be_pruned functions
echo "=== Searching for extract_timestamp_bound ==="
rg -n "extract_timestamp_bound" src/query/stream_schema_provider.rs -A 20 | head -50
echo ""
echo "=== Searching for can_be_pruned ==="
rg -n "can_be_pruned" src/query/stream_schema_provider.rs -A 15 | head -50
</function_calls><function_calls>
#!/bin/bashSearch for tests related to this code
fd "test" src/query/ --type f | xargs rg -l "PartialTimeFilter|extract_timestamp_bound|can_be_pruned" 2>/dev/null
</function_calls><function_calls>
#!/bin/bashLook for test files in the directory
find src/query -name "test" -o -name "tests.rs"
</function_calls><function_calls>
#!/bin/bashCheck end of the stream_schema_provider.rs file for tests
tail -200 src/query/stream_schema_provider.rs | head -100
</function_calls><function_calls>
#!/bin/bashLook for #[test] or #[cfg(test)] modules
rg -n "#[test]|#[cfg(test)]" src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashLet's see the structure around PartialTimeFilter
rg -n "impl.*PartialTimeFilter" src/query/stream_schema_provider.rs -A 100 | head -150
</function_calls><function_calls>
#!/bin/bashCheck the complete PartialTimeFilter enum and implementations
rg -n "enum PartialTimeFilter" src/query/stream_schema_provider.rs -A 30
</function_calls><function_calls>
#!/bin/bashSearch for the binary_expr function implementation
rg -n "fn binary_expr" src/query/stream_schema_provider.rs -B 5 -A 30
</function_calls><function_calls>
#!/bin/bashLook at extract_timestamp_bound implementation
rg -n "fn extract_timestamp_bound" src/query/stream_schema_provider.rs -A 40
</function_calls><function_calls>
#!/bin/bashLook at ManifestExt trait and can_be_pruned
rg -n "trait ManifestExt|fn can_be_pruned" src/query/stream_schema_provider.rs -A 30 | head -100
</function_calls><function_calls>
#!/bin/bashCheck DataFusion imports to verify API versions
rg -n "use.*datafusion|use.*Expr|use.*ScalarValue" src/query/stream_schema_provider.rs | head -20
</function_calls><function_calls>
#!/bin/bashCheck for any timestamp handling patterns in the file
rg -n "timestamp_millis|and_utc|TimestampMillisecond|ScalarValue" src/query/stream_schema_provider.rs | head -30
</function_calls><function_calls>
#!/bin/bashGet full content relevant sections - lines 860-1000 to see complete implementations
sed -n '860,1000p' src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashCheck for Expr::Literal pattern matching usage
rg -n "Expr::Literal" src/query/stream_schema_provider.rs -B 2 -A 2
</function_calls><function_calls>
#!/bin/bashSearch for pattern matches with Expr::Literal and understand the structure
rg -n "match.*Expr|Expr::Literal.*None" src/query/stream_schema_provider.rs -B 1 -A 3
</function_calls><function_calls>
#!/bin/bashLook for test modules within the file
sed -n '1100,1300p' src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashCheck the last part of the file for any modules or tests
tail -100 src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashSearch for any test-related code in the query directory
find src/query -type f -name "*.rs" -exec grep -l "#[test]|#[cfg(test)]" {} ;
</function_calls><function_calls>
#!/bin/bashLook for tests in related modules
rg -l "PartialTimeFilter" src/ --type rs
</function_calls>Let me check for tests in a separate file or test directory:
<function_calls>
#!/bin/bashFind test files
find . -path "/tests/" -name ".rs" -o -name "_test.rs" | grep -i stream | head -10
</function_calls><function_calls>
#!/bin/bashCheck if there are mod tests or test submodules
rg -n "mod tests|#[cfg.*test" src/query/stream_schema_provider.rs
</function_calls>Now let me examine the specific code patterns mentioned in the comment more carefully:
<function_calls>
#!/bin/bashGet a comprehensive view of the file to understand the structure
wc -l src/query/stream_schema_provider.rs
echo "---"Get the line ranges the comment refers to
for range in "704:731" "866:896" "937:947" "978:996" "1125:1268"; do
echo "=== Lines $range ==="
IFS=':' read -r start end <<< "$range"
sed -n "${start},${end}p" src/query/stream_schema_provider.rs
echo ""
done
</function_calls><function_calls>
#!/bin/bashLet's look at the functions mentioned more carefully
echo "=== Looking for extract_timestamp_bound, can_be_pruned implementations ==="
rg -n "fn extract_timestamp_bound|fn can_be_pruned" src/query/stream_schema_provider.rs -A 50 | head -200
</function_calls><function_calls>
#!/bin/bashCheck for Expr::Literal pattern matching
echo "=== Expr::Literal pattern matches ==="
rg "Expr::Literal" src/query/stream_schema_provider.rs -B 5 -A 5
</function_calls><function_calls>
#!/bin/bashCheck for TimestampMillisecond and TimestampNanosecond handling
echo "=== TimestampMillisecond/Nanosecond handling ==="
rg "TimestampMillisecond|TimestampNanosecond" src/query/stream_schema_provider.rs -B 3 -A 3
</function_calls><function_calls>
#!/bin/bashCheck the entire section around binary_expr more thoroughly
sed -n '695,750p' src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashGet context for the entire PartialTimeFilter implementation
sed -n '650,850p' src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashLook for the complete implementation from 1125-1268
sed -n '1120,1275p' src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashVerify what's in lines 1125-1268 - check if these are part of important logic
sed -n '1125,1268p' src/query/stream_schema_provider.rs | head -100
</function_calls><function_calls>
#!/bin/bashLet's look at metadata handling in Expr::Literal patterns
echo "=== Searching for Expr::Literal pattern matches with metadata ==="
rg -n "Expr::Literal\s*(" src/query/stream_schema_provider.rs -B 3 -A 3
</function_calls><function_calls>
#!/bin/bashCheck imports to understand DataFusion version/API
head -50 src/query/stream_schema_provider.rs | grep -i "datafusion|expr|scalar"
</function_calls><function_calls>
#!/bin/bashGet full imports section
sed -n '1,100p' src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashSearch for from_timestamp_millis to verify the comment's claim about panic behavior
rg -n "from_timestamp_millis|DateTime::from_timestamp" src/query/stream_schema_provider.rs -B 2 -A 2
</function_calls><function_calls>
#!/bin/bashCheck for test coverage or any test utilities
rg -n "test|Test" src/query/stream_schema_provider.rs | tail -20
</function_calls><function_calls>
#!/bin/bashLet's check the complete file length and structure
echo "Total lines: $(wc -l < src/query/stream_schema_provider.rs)"
echo ""
echo "=== Functions in file ==="
rg -n "pub fn|fn " src/query/stream_schema_provider.rs | head -30
</function_calls><function_calls>
#!/bin/bashCheck for test modules at the end of file
tail -50 src/query/stream_schema_provider.rs | cat -n
</function_calls>Now let me verify the key aspects of the comment's claims by examining more comprehensive sections:
<function_calls>
#!/bin/bashLet's trace through the exact patterns mentioned in the comment
python3 << 'EOF'
import reRead the file
with open('src/query/stream_schema_provider.rs', 'r') as f:
content = f.read()
lines = content.split('\n')Find the binary_expr implementation
for i, line in enumerate(lines[703:732], start=704):
print(f"{i}: {line}")
EOF
</function_calls><function_calls>
#!/bin/bashComprehensive check of all code patterns mentioned in the review
python3 << 'EOF'
import rewith open('src/query/stream_schema_provider.rs', 'r') as f:
lines = f.readlines()Check line ranges
ranges = [
(703, 731, "binary_expr function"),
(865, 896, "extract_timestamp_bound or similar"),
(936, 947, "pattern matching"),
(977, 996, "more pattern matching"),
(1124, 1268, "large range check")
]for start, end, description in ranges:
print(f"\n=== Lines {start+1}-{end+1}: {description} ===")
for i in range(start, min(end+1, len(lines))):
print(f"{i+1}: {lines[i]}", end='')
EOF
</function_calls><function_calls>
#!/bin/bashVerify the key claims: Expr::Literal pattern with None metadata
python3 << 'EOF'
import rewith open('src/query/stream_schema_provider.rs', 'r') as f:
content = f.read()Search for pattern 1: Expr::Literal construction with ScalarValue::Timestamp*
pattern1 = r'Expr::Literal\s*(\sScalarValue::Timestamp\w+\s(\sSome\s(\s*\w+\s*)\s*,\sNone\s)'
matches1 = re.finditer(pattern1, content)
print("=== Expr::Literal construction with ScalarValue::Timestamp* ===")
for match in matches1:
start = max(0, match.start() - 100)
end = min(len(content), match.end() + 100)
print(f"Found: {content[start:end]}")
print("---")Search for pattern 2: Pattern matching on Expr::Literal(_, None)
pattern2 = r'Expr::Literal\s*(\s*\s*,\sNone\s)'
matches2 = re.finditer(pattern2, content)
print("\n=== Expr::Literal pattern match with (, None) ===")
for match in matches2:
start = max(0, match.start() - 100)
end = min(len(content), match.end() + 100)
print(f"Found: {content[start:end]}")
print("---")Search for from_timestamp_millis.unwrap
pattern3 = r'from_timestamp_millis\s*([^)]+)\s*.unwrap'
matches3 = re.finditer(pattern3, content)
print("\n=== from_timestamp_millis().unwrap() calls ===")
count = 0
for match in matches3:
count += 1
start = max(0, match.start() - 50)
end = min(len(content), match.end() + 50)
print(f"Found {count}: {content[start:end]}")
print("---")
EOF
</function_calls><function_calls>
#!/bin/bashLet's look for and_utc() calls which are mentioned in the comment
rg -n "and_utc" src/query/stream_schema_provider.rs -B 2 -A 2
</function_calls><function_calls>
#!/bin/bashCheck for the specific code snippet from the review comment
sed -n '704,731p' src/query/stream_schema_provider.rs | grep -n "and_utc|timestamp_millis"
</function_calls><function_calls>
#!/bin/bashFull content of lines 704-731
echo "=== Full content lines 704-731 ==="
sed -n '704,731p' src/query/stream_schema_provider.rs
echo ""
echo "=== Search for BinaryExpr usage ==="
rg "BinaryExpr::new" src/query/stream_schema_provider.rs -B 5 -A 2
</function_calls><function_calls>
#!/bin/bashCheck DataFusion dependency version
rg "datafusion" Cargo.toml -B 1 -A 1
</function_calls><function_calls>
#!/bin/bashVerify test coverage existence for the scenarios mentioned
echo "=== Test coverage for PartialTimeFilter ==="
rg -n "PartialTimeFilter|binary_expr|extract_timestamp" src/ -A 5 --type rs | grep -i "test|assert|#[" | head -30
</function_calls><function_calls>
#!/bin/bashList all functions/implementations in stream_schema_provider.rs to understand scope
rg "^(pub )?fn |^impl " src/query/stream_schema_provider.rs -n
</function_calls><function_calls>
#!/bin/bashGet a precise view of the PartialTimeFilter enum and binary_expr impl
python3 << 'EOF'
with open('src/query/stream_schema_provider.rs', 'r') as f:
lines = f.readlines()Find PartialTimeFilter enum
in_partial = False
for i, line in enumerate(lines):
if 'enum PartialTimeFilter' in line:
in_partial = True
print(f"Found PartialTimeFilter at line {i+1}")
# Print next 50 lines
for j in range(i, min(i+100, len(lines))):
print(f"{j+1}: {lines[j]}", end='')
break
EOF
</function_calls><function_calls>
#!/bin/bashFind binary_expr more precisely
rg -n "binary_expr" src/query/stream_schema_provider.rs -B 10 -A 40 | head -100
</function_calls><function_calls>
#!/bin/bashLet's check if the comment's code snippet matches the actual file
by looking for the exact pattern mentioned in the snippet
echo "=== Checking for exact code patterns from comment ==="
echo "Pattern: 'and_utc().timestamp_millis()'"
rg "and_utc().timestamp_millis" src/query/stream_schema_provider.rs -B 2 -A 2
echo ""
echo "Pattern: 'Operator::Gt, Operator::GtEq, etc'"
rg "Operator::(Gt|Lt|Eq)" src/query/stream_schema_provider.rs -B 2 -A 2 | head -30
</function_calls><function_calls>
#!/bin/bashComprehensive check of the entire stream_schema_provider.rs structure
wc -l src/query/stream_schema_provider.rs
echo "---"Show structure (functions, impl blocks, enums)
ast-grep --pattern 'enum
$_ { $ $$ }' src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashTry a more targeted approach - search for impl blocks
ast-grep --pattern 'impl
$_ { $ $$ }' src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashSearch for function definitions
ast-grep --pattern 'fn
$_($ $$) { $$$ }' src/query/stream_schema_provider.rs 2>&1 | head -50
</function_calls><function_calls>
#!/bin/bashLet me just read the entire file in chunks to understand it better
echo "=== File structure overview ==="
cat -n src/query/stream_schema_provider.rs | sed -n '1,100p'
</function_calls><function_calls>
#!/bin/bashLook at the critical sections mentioned in the comment
for section in "700:750" "860:900" "930:960" "970:1010"; do
IFS=':' read start end <<< "$section"
echo "=== Lines $start-$end ==="
sed -n "${start},${end}p" src/query/stream_schema_provider.rs | head -30
echo ""
done
</function_calls><function_calls>
#!/bin/bashLet's examine the key code in manageable chunks starting from line 700
sed -n '700,750p' src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashContinue with 750-800
sed -n '750,810p' src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashGet sections around extract_timestamp_bound
sed -n '800,900p' src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashContinue to check for can_be_pruned and other functions
sed -n '900,1000p' src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashCheck the range 1125-1268 mentioned in the comment
sed -n '1000,1100p' src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashView the complete implementation sections
sed -n '1100,1200p' src/query/stream_schema_provider.rs
</function_calls><function_calls>
#!/bin/bashView the end of the 1125-1268 range
sed -n '1200,1270p' src/query/stream_schema_provider.rs
</function_calls>Now let me verify the key technical claims in the comment with focused searches:
<function_calls>
#!/bin/bashVerify the exact claims made in the comment
python3 << 'EOF'
with open('src/query/stream_schema_provider.rs', 'r') as f:
content = f.read()
lines = content.split('\n')Claim 1: Check if Expr::Literal is constructed with ScalarValue::TimestampMillisecond(Some(v), None)
claim1_found = 'ScalarValue::TimestampMillisecond(Some' in content and 'None)' in content
print(f"Claim 1 (Expr::Literal with ScalarValue::Timestamp*): {claim1_found}")Claim 2: Check for pattern matching with Expr::Literal(_, None)
claim2_found = 'Expr::Literal(' in content and 'None)' in content
print(f"Claim 2 (Pattern matching Expr::Literal(_, None)): {claim2_found}")Claim 3: Check for extract_timestamp_bound function
claim3_found = 'fn extract_timestamp_bound' in content
print(f"Claim 3 (extract_timestamp_bound function exists): {claim3_found}")Claim 4: Check for can_be_pruned function
claim4_found = 'fn can_be_pruned' in content
print(f"Claim 4 (can_be_pruned function exists): {claim4_found}")Claim 5: Check for ManifestExt trait
claim5_found = 'trait ManifestExt' in content
print(f"Claim 5 (ManifestExt trait exists): {claim5_found}")Claim 6: Check for from_timestamp_millis().unwrap()
claim6_found = 'from_timestamp_millis' in content and 'unwrap' in content
print(f"Claim 6 (from_timestamp_millis().unwrap() exists): {claim6_found}")Claim 7: Check for binary_expr function
claim7_found = 'fn binary_expr' in content
print(f"Claim 7 (binary_expr function exists): {claim7_found}")Check test existence for PartialTimeFilter
import re
test_pattern = r'#[test]|#[cfg(test)]'
tests_found = bool(re.search(test_pattern, content))
print(f"Tests found in file: {tests_found}")
EOF
</function_calls><function_calls>
#!/bin/bashGet full file content of all key functions mentioned in the comment
echo "=== Searching for all key functions ==="
for func in "extract_timestamp_bound" "can_be_pruned" "binary_expr"; do
echo ""
echo "=== Function: $func ==="
rg -n "fn $func" src/query/stream_schema_provider.rs -A 30 | head -50
done
</function_calls>Let me take a more direct approach and examine the file systematically:
<function_calls>
#!/bin/bashRead the entire file and search systematically
cat src/query/stream_schema_provider.rs > /tmp/schema.rs
Extract everything related to PartialTimeFilter, extract_timestamp_bound, and can_be_pruned
echo "=== Searching in complete file ==="
grep -n "enum PartialTimeFilter|impl PartialTimeFilter|fn binary_expr|fn extract_timestamp_bound|fn can_be_pruned|trait ManifestExt" /tmp/schema.rs
</function_calls><function_calls>
#!/bin/bashNow get the actual content at those lines
python3 << 'EOF'
import rewith open('/tmp/schema.rs', 'r') as f:
lines = f.readlines()Find key functions/structs
search_terms = [
'PartialTimeFilter',
'extract_timestamp_bound',
'can_be_pruned',
'ManifestExt',
'binary_expr'
]for term in search_terms:
print(f"\n{'='*60}")
print(f"Searching for: {term}")
print('='*60)
for i, line in enumerate(lines):
if term in line:
# Print context
start = max(0, i-3)
end = min(len(lines), i+30)
for j in range(start, end):
prefix = ">>> " if j == i else " "
print(f"{prefix}{j+1}: {lines[j]}", end='')
print()
break
EOF
</function_calls><function_calls>
#!/bin/bashLet's verify the exact line numbers mentioned in the comment
wc -l /tmp/schema.rs
echo ""
echo "Does file have at least 1268 lines?$([ $ (wc -l < /tmp/schema.rs) -ge 1268 ] && echo 'YES' || echo 'NO')"
echo ""Show content at exact lines from comment
echo "=== Lines 704-731 (binary_expr snippet) ==="
sed -n '704,731p' /tmp/schema.rs
</function_calls><function_calls>
#!/bin/bashVerify all the mentioned line ranges exist
python3 << 'EOF'
import osfile_path = '/tmp/schema.rs'
total_lines = sum(1 for _ in open(file_path))ranges = [(704, 731), (866, 896), (937, 947), (978, 996), (1125, 1268)]
print(f"Total lines in file: {total_lines}\n")
for start, end in ranges:
valid = total_lines >= end
status = "✓ VALID" if valid else "✗ INVALID"
print(f"Lines {start}-{end}: {status}")
EOF
</function_calls><function_calls>
#!/bin/bashNow show the actual code at those ranges
python3 << 'EOF'
with open('/tmp/schema.rs', 'r') as f:
lines = f.readlines()ranges = [(704, 731), (866, 896), (937, 947), (978, 996)]
for start, end in ranges:
print(f"\n{'='*60}")
print(f"Lines {start}-{end}")
print('='*60)
for i in range(start-1, min(end, len(lines))):
print(f"{i+1}: {lines[i]}", end='')
EOF
</function_calls><function_calls>
#!/bin/bashGet a more complete picture of the code structure
python3 << 'EOF'
with open('/tmp/schema.rs', 'r') as f:
lines = f.readlines()Look for the PartialTimeFilter enum definition
for i, line in enumerate(lines):
if 'enum PartialTimeFilter' in line:
print("Found PartialTimeFilter enum:")
for j in range(i, min(i+20, len(lines))):
print(f"{j+1}: {lines[j]}", end='')
breakLook for impl PartialTimeFilter with binary_expr
print("\n\nLooking for binary_expr implementation:")
for i, line in enumerate(lines):
if 'fn binary_expr' in line:
for j in range(max(0, i-5), min(i+50, len(lines))):
print(f"{j+1}: {lines[j]}", end='')
breakLook for extract_timestamp_bound
print("\n\nLooking for extract_timestamp_bound:")
for i, line in enumerate(lines):
if 'fn extract_timestamp_bound' in line:
for j in range(max(0, i-5), min(i+50, len(lines))):
print(f"{j+1}: {lines[j]}", end='')
break
EOF
</function_calls><function_calls>
#!/bin/bashCheck for trait ManifestExt and can_be_pruned
python3 << 'EOF'
with open('/tmp/schema.rs', 'r') as f:
lines = f.readlines()Look for trait ManifestExt
print("Looking for ManifestExt trait:")
for i, line in enumerate(lines):
if 'trait ManifestExt' in line:
for j in range(i, min(i+50, len(lines))):
print(f"{j+1}: {lines[j]}", end='')
breakprint("\n\nLooking for can_be_pruned function:")
for i, line in enumerate(lines):
if 'fn can_be_pruned' in line:
for j in range(max(0, i-5), min(i+50, len(lines))):
print(f"{j+1}: {lines[j]}", end='')
break
EOF
</function_calls><function_calls>
#!/bin/bashLook for test modules and test coverage
grep -n "#[test]|#[cfg(test)]|mod tests" /tmp/schema.rs | head -20
</function_calls><function_calls>
#!/bin/bashCheck imports to verify DataFusion API usage
head -100 /tmp/schema.rs | grep -n "use|import"
</function_calls><function_calls>
#!/bin/bashVerify specific patterns mentioned in the comment are present
echo "=== Checking for specific code patterns ==="
echo ""
echo "1. Pattern: Expr::Literal with ScalarValue::TimestampMillisecond"
grep -n "ScalarValue::TimestampMillisecond|ScalarValue::TimestampNanosecond" /tmp/schema.rs | head -5
echo ""
echo "2. Pattern: and_utc().timestamp_millis()"
grep -n "and_utc().timestamp_millis" /tmp/schema.rs | head -5
echo ""
echo "3. Pattern: Operator::"
grep -n "Operator::" /tmp/schema.rs | head -10
echo ""
echo "4. Pattern: BinaryExpr::new"
grep -n "BinaryExpr::new" /tmp/schema.rs | head -5
</function_calls>Based on my comprehensive examination of the codebase, I can now provide the rewritten review comment:
Timestamp literal handling is correctly updated for new
Expr::Literal/ScalarValueshapes; watch for metadata-bearing literalsThe changes to:
- construct literals as
Expr::Literal(ScalarValue::Timestamp*(_, None))inPartialTimeFilter::binary_expr,- pattern match them as
Expr::Literal(value, None)inextract_timestamp_boundandManifestExt::can_be_pruned, and- handle
ScalarValue::TimestampMillisecond(Some(v), _)/TimestampNanosecond(Some(v), _)andUtf8(whenis_time_partition)are consistent with the current DataFusion scalar/expr APIs. The changes align with the updated approach to literal representation.
Two subtle points to keep in mind:
- Because you match
Expr::Literal(_, None), any literal that carries metadata (e.g., aliased expressions that happen to end up on the RHS) will be ignored for both time-range detection and manifest pruning. That's probably fine given how you construct filters today, but if upstream ever starts attaching metadata to user literals, these helpers will silently stop recognizing them.DateTime::from_timestamp_millis(*value).unwrap()will still panic if someone manages to push a completely out-of-range timestamp into a filter, though that's unchanged behavior.Overall, the refactor aligns with the current DataFusion literal representations.
35-36: Unable to verify codebase access — manual review needed for FileScanConfigBuilder and LexOrdering API complianceThe review's concerns about DataFusion 50.3.0 API compatibility are valid based on official documentation:
- FileScanConfigBuilder constructor: In DataFusion 50.3.0, the signature changed from
new(object_store_url, schema, source)tonew(object_store_url, source). Schema must be embedded in theParquetSourceat construction instead.- LexOrdering::new(): Returns
LexOrderingdirectly (infallible), not aResult, so.unwrap()should not be present.However, I cannot access the repository to confirm whether the current code contains these patterns or has already been corrected. Please verify:
- Check line 38–46 for
FileScanConfigBuilder::new()calls — confirm the constructor takes only 2 arguments (object_store_url, source).- Check line 146–180 for
LexOrdering::new([sort_expr])calls — confirm.unwrap()is not present (it's infallible).- Verify
ParquetSourceconstruction includes the schema parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/build.yaml(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:680-686
Timestamp: 2025-06-16T05:30:13.379Z
Learning: The parquet conversion system uses a minute-based atomic design where each conversion task operates on different sets of arrow files moved to timestamped "inprocess_" directories, ensuring no conflicts between tasks and no retry mechanisms that could cause file corruption.
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:351-355
Timestamp: 2025-06-16T05:20:18.593Z
Learning: In the Parseable codebase, arrow files are expected to always have valid creation or modified timestamps as a basic system assumption. The conversion flow uses expect() on file metadata operations to enforce this invariant with fail-fast behavior rather than graceful error handling, as violations represent fundamental system issues that should cause immediate failures.
📚 Learning: 2025-04-26T03:58:02.341Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1302
File: .github/workflows/build.yaml:170-175
Timestamp: 2025-04-26T03:58:02.341Z
Learning: In the parseable project, the Linux-specific environment variables (PKG_CONFIG_PATH, SASL2_DIR, OPENSSL_DIR, OPENSSL_ROOT_DIR, SASL2_STATIC) in the Kafka build step of GitHub Actions workflows don't cause issues for macOS builds and can safely be applied to both platforms.
Applied to files:
.github/workflows/build.yaml
🪛 YAMLlint (1.37.1)
.github/workflows/build.yaml
[error] 37-37: syntax error: expected , but found '-'
(syntax)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
.github/workflows/build.yaml (1)
139-139: Inconsistent conditional syntax—consider normalizing expression wrapping.Lines 139 and 178 use complex boolean expressions without explicit
${{ }}wrapping:
- Line 139:
if: matrix.use_cross && matrix.target == 'aarch64-unknown-linux-gnu'- Line 178:
if: runner.os == 'Linux' && !matrix.use_crossOther conditionals in the file use
${{ }}(e.g., lines 68, 229). While GitHub Actions may parse both forms, normalizing toif: ${{ ... }}for complex expressions improves clarity and consistency.- name: Create Cross.toml for aarch64 Kafka build - if: matrix.use_cross && matrix.target == 'aarch64-unknown-linux-gnu' + if: ${{ matrix.use_cross && matrix.target == 'aarch64-unknown-linux-gnu' }}- name: Find and fix librdkafka CMakeLists.txt for Linux - if: runner.os == 'Linux' && !matrix.use_cross + if: ${{ runner.os == 'Linux' && !matrix.use_cross }}Also applies to: 178-178
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/build.yaml(8 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-04-26T03:58:02.341Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1302
File: .github/workflows/build.yaml:170-175
Timestamp: 2025-04-26T03:58:02.341Z
Learning: In the parseable project, the Linux-specific environment variables (PKG_CONFIG_PATH, SASL2_DIR, OPENSSL_DIR, OPENSSL_ROOT_DIR, SASL2_STATIC) in the Kafka build step of GitHub Actions workflows don't cause issues for macOS builds and can safely be applied to both platforms.
Applied to files:
.github/workflows/build.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-unknown-linux-gnu
- GitHub Check: coverage
🔇 Additional comments (5)
.github/workflows/build.yaml (5)
44-72: LGTM—Build-default job refactoring is sound.The split between cross and native build paths is well-structured. GCC 11 installation is appropriately gated to Linux (line 44–48), and the conditional build steps (cross vs. native) are mutually exclusive. Environment variables for
CCandCXXare correctly set only for native Linux builds using the ternary syntax.
98-127: Good—Platform-specific dependency blocks are well-organized.The Linux dependencies for x86_64 native (lines 98–114) and macOS dependencies (lines 117–127) are appropriately conditioned. The separation of native x86_64 from the aarch64 cross-compile path (which uses Cross.toml instead) avoids redundant installations.
138-166: LGTM—Cross.toml configuration for aarch64 cross-compile is correct.The pre-build steps handle multiarch setup, and the
[build.env]passthrough forLIBRDKAFKA_SSL_VENDOREDandPKG_CONFIG_ALLOW_CROSSare appropriate for the cross-rs toolchain.
177-219: LGTM—CMakeLists.txt sed modifications handle platform differences correctly.Line 191 uses
sed -ifor Linux, and line 213 correctly usessed -i ''for macOS (required for BSD sed compatibility). The conditions are appropriate: Linux modifies only for native builds (line 178), while macOS modifies for all macOS jobs (line 200).
221-234: LGTM—Kafka build steps properly split by build strategy.The cross build (line 221–226) and native build (line 228–234) are mutually exclusive. Environment variables (
LIBRDKAFKA_SSL_VENDORED,CC/CXX) are correctly configured for each path.
Upgrade df version to 50.3.0 and arrow/parquet to 57.0.0
ad7f099 to
b0ecba0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/build.yaml (1)
44-72: Fix environment variable assignment to prevent empty CC/CXX on non-Linux platforms.GCC 11 is available in ubuntu-latest default repositories. However, setting
CCandCXXto empty strings on macOS/Windows (via${{ runner.os == 'Linux' && 'gcc-11' || '' }}) can cause unexpected behavior: Cargo/cc-rs treats an exported empty variable differently from an unset variable and may fail or use an incorrect compiler.Instead, conditionally export these variables only on Linux:
env: CC: gcc-11 CXX: g++-11wrapped within the step's
if:condition, or useenv -u CC -u CXX cargo build ...on non-Linux to explicitly unset them rather than leaving them as empty strings.
🧹 Nitpick comments (3)
src/catalog/column.rs (1)
159-161: Int96 → i64 nanos conversion looks correct; consider extending pruning supportThe
int96_to_i64_nanoshelper matches the standard Impala-style Int96 layout (nanos-of-day from the first two u32 words and Julian day in the third, offset by 2_440_588) and safely fits typical log timestamps intoi64nanoseconds.One follow-up you may want (not a blocker) is to wire
ScalarValue::TimestampNanosecondthroughcast_or_noneso that manifest-based pruning can actually use these new ns-based Int96 stats; right now TimestampNanosecond literals will just skip pruning and fall back to scanning.Also applies to: 200-216
src/query/stream_schema_provider.rs (2)
722-729: Timestamp literal handling and tests correctly track newExpr::Literal/ScalarValueshapesThe updated pattern matches on
Expr::Literal(value, None)and the handling ofScalarValue::TimestampMillisecond/TimestampNanosecond/Utf8inextract_timestamp_bound, together with the revised tests, are consistent with the new DataFusionExprandScalarValuelayouts and cover both ms and ns paths plus string partitions.One enhancement you might want later is to extend
cast_or_noneto handleScalarValue::TimestampNanosecondso stats-based pruning can also leverage ns timestamp literals; currently those fall through and just disable pruning for that filter, which is safe but leaves some performance on the table.Please verify against the exact
datafusionversion in this Cargo.toml thatExpr::Literalis indeed(ScalarValue, Option<FieldMetadata>)and that DataFusion’s own planner producesExpr::Literal(_, None)for the timestamp literals you care about (so these matches actually fire).Also applies to: 866-896, 942-944, 1124-1264
347-431: ConfirmStatisticstype alignment betweenpartitioned_filesandwith_statistics
partitioned_filesreturns adatafusion::common::Statistics, whilecreate_parquet_physical_plantakes aStatisticsimported fromdatafusion::physical_planand passes it intoFileScanConfigBuilder::with_statistics(statistics).If
datafusion::common::Statisticsanddatafusion::physical_plan::Statisticsare not the same re‑exported type indatafusion = "51.0.0", this will fail to compile or require an explicit conversion. If they are the same, it’s worth standardizing on one import (eitherdatafusion::commonordatafusion::physical_plan) throughout the file to avoid confusion.Also applies to: 626-632
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (12)
.github/workflows/build.yaml(8 hunks)Cargo.toml(3 hunks)src/catalog/column.rs(2 hunks)src/catalog/manifest.rs(2 hunks)src/parseable/staging/reader.rs(3 hunks)src/parseable/streams.rs(1 hunks)src/query/mod.rs(3 hunks)src/query/stream_schema_provider.rs(13 hunks)src/storage/localfs.rs(1 hunks)src/storage/metrics_layer.rs(6 hunks)src/storage/object_storage.rs(1 hunks)src/utils/arrow/flight.rs(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- src/utils/arrow/flight.rs
- src/parseable/streams.rs
- src/query/mod.rs
- src/storage/object_storage.rs
- src/storage/metrics_layer.rs
- src/parseable/staging/reader.rs
- src/storage/localfs.rs
- src/catalog/manifest.rs
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:680-686
Timestamp: 2025-06-16T05:30:13.379Z
Learning: The parquet conversion system uses a minute-based atomic design where each conversion task operates on different sets of arrow files moved to timestamped "inprocess_" directories, ensuring no conflicts between tasks and no retry mechanisms that could cause file corruption.
📚 Learning: 2025-08-18T19:10:11.941Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/handlers/http/ingest.rs:163-164
Timestamp: 2025-08-18T19:10:11.941Z
Learning: Field statistics calculation in src/storage/field_stats.rs uses None for the time_partition parameter when calling flatten_and_push_logs(), as field stats generation does not require time partition functionality.
Applied to files:
src/catalog/column.rs
📚 Learning: 2025-04-26T03:58:02.341Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1302
File: .github/workflows/build.yaml:170-175
Timestamp: 2025-04-26T03:58:02.341Z
Learning: In the parseable project, the Linux-specific environment variables (PKG_CONFIG_PATH, SASL2_DIR, OPENSSL_DIR, OPENSSL_ROOT_DIR, SASL2_STATIC) in the Kafka build step of GitHub Actions workflows don't cause issues for macOS builds and can safely be applied to both platforms.
Applied to files:
.github/workflows/build.yaml
📚 Learning: 2025-06-18T06:39:04.775Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-07-28T17:10:39.448Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1392
File: src/migration/stream_metadata_migration.rs:303-322
Timestamp: 2025-07-28T17:10:39.448Z
Learning: In Parseable's migration system (src/migration/stream_metadata_migration.rs), each migration function updates the metadata to the current latest format using CURRENT_OBJECT_STORE_VERSION and CURRENT_SCHEMA_VERSION constants, rather than producing incremental versions. For example, v5_v6 function produces v7 format output when these constants are set to "v7", not v6 format.
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-10-28T02:10:41.140Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1453
File: src/parseable/mod.rs:397-400
Timestamp: 2025-10-28T02:10:41.140Z
Learning: In Parseable enterprise deployments with multiple query nodes, hot tier configuration must be persisted in object storage so that newly started query nodes can fetch and synchronize the hot tier settings at startup (file: src/parseable/mod.rs, function: create_stream_and_schema_from_storage).
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-08-20T17:01:25.791Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1409
File: src/storage/field_stats.rs:429-456
Timestamp: 2025-08-20T17:01:25.791Z
Learning: In Parseable's field stats calculation (src/storage/field_stats.rs), the extract_datetime_from_parquet_path_regex function correctly works with filename-only parsing because Parseable's server-side filename generation guarantees the dot-separated format date=YYYY-MM-DD.hour=HH.minute=MM pattern in parquet filenames.
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-08-21T12:04:38.398Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1410
File: src/storage/object_storage.rs:0-0
Timestamp: 2025-08-21T12:04:38.398Z
Learning: In Parseable's timestamp extraction (extract_timestamp_for_date in src/storage/object_storage.rs), minute-level precision is sufficient for both first and latest event timestamps. Seconds are always set to 0, meaning both first and latest events within the same minute will have identical timestamps, which is the intended behavior.
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-08-18T12:37:47.732Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/parseable/mod.rs:528-533
Timestamp: 2025-08-18T12:37:47.732Z
Learning: In Parseable, the validate_time_partition function in src/utils/json/flatten.rs already provides a default time partition limit of 30 days using `map_or(30, |days| days.get() as i64)` when time_partition_limit is None, so no additional defaulting is needed in the stream creation logic in src/parseable/mod.rs.
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-06-18T11:15:10.836Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.
Applied to files:
Cargo.toml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: coverage
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
🔇 Additional comments (3)
.github/workflows/build.yaml (1)
142-170: Cross.toml environment configuration is incomplete for rdkafka cross-compilation.The apt packages and architecture suffixes (
:arm64) are valid for Ubuntu focal, and/usr/lib/aarch64-linux-gnuis the correct library path. However, the environment passthrough is insufficient. BeyondPKG_CONFIG_ALLOW_CROSS, rdkafka cross-compilation requires:
PKG_CONFIG_SYSROOT_DIRpointing to the target sysrootPKG_CONFIG_LIBDIRconfigured to find target.pcfilesOPENSSL_DIR,OPENSSL_LIB_DIR,OPENSSL_INCLUDE_DIRfor OpenSSL discoveryCCandCXXset to the target toolchain (aarch64-linux-gnu-gcc,aarch64-linux-gnu-g++)LIBRARY_PATHandLD_LIBRARY_PATHfor linking against target librariesAdd these variables to the
[build.env]passthrough section or configure them in the build invocation to ensure librdkafka resolves dependencies correctly during cross-compilation.⛔ Skipped due to learnings
Learnt from: nikhilsinhaparseable Repo: parseablehq/parseable PR: 1302 File: .github/workflows/build.yaml:170-175 Timestamp: 2025-04-26T03:58:02.341Z Learning: In the parseable project, the Linux-specific environment variables (PKG_CONFIG_PATH, SASL2_DIR, OPENSSL_DIR, OPENSSL_ROOT_DIR, SASL2_STATIC) in the Kafka build step of GitHub Actions workflows don't cause issues for macOS builds and can safely be applied to both platforms.src/query/stream_schema_provider.rs (2)
135-145: Parquet scan refactor via FileScanConfigBuilder/ParquetSource looks aligned with DF 50+/51The new flow (build
file_groups, wrap the optional physical predicate inParquetSource, then driveFileScanConfigBuilderwith statistics, batch size, constraints, ordering, projection, and limit) is the right way to integrate with the newer DataFusion parquet APIs. The only thing to watch is future DF versions wherewith_projection_indicesbecomes fallible (Result<Self>); for 51.0.0 your current usage is fine.Please double‑check against the exact
datafusion = "51.0.0"docs thatwith_projection_indicesstill returnsSelf(notResult<Self>) and thatwith_statisticstakes the sameStatisticstype you are passing in frompartitioned_files.Also applies to: 146-180
328-343: UsingUnionExec::try_newis the correct upgrade hereSwitching from infallible construction to
UnionExec::try_new(execution_plans)?correctly propagates planning errors instead of panicking; no further changes needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.github/workflows/build.yaml (1)
25-31: Update misleading macOS comment to clarify native vs. cross-compiled builds.Line 25 states "both native on macos-latest (M1)", but only aarch64-apple-darwin is native on M1. The x86_64-apple-darwin target is cross-compiled and runs under Rosetta 2 emulation, despite
use_cross: false(which means Rust's toolchain handles it without requiring thecrosstool).- # macOS builds - both native on macos-latest (M1) + # macOS builds - aarch64 native on M1, x86_64 cross-compiled (Rosetta)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/workflows/build.yaml(8 hunks)Cross.toml(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:680-686
Timestamp: 2025-06-16T05:30:13.379Z
Learning: The parquet conversion system uses a minute-based atomic design where each conversion task operates on different sets of arrow files moved to timestamped "inprocess_" directories, ensuring no conflicts between tasks and no retry mechanisms that could cause file corruption.
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:351-355
Timestamp: 2025-06-16T05:20:18.593Z
Learning: In the Parseable codebase, arrow files are expected to always have valid creation or modified timestamps as a basic system assumption. The conversion flow uses expect() on file metadata operations to enforce this invariant with fail-fast behavior rather than graceful error handling, as violations represent fundamental system issues that should cause immediate failures.
📚 Learning: 2025-04-26T03:58:02.341Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1302
File: .github/workflows/build.yaml:170-175
Timestamp: 2025-04-26T03:58:02.341Z
Learning: In the parseable project, the Linux-specific environment variables (PKG_CONFIG_PATH, SASL2_DIR, OPENSSL_DIR, OPENSSL_ROOT_DIR, SASL2_STATIC) in the Kafka build step of GitHub Actions workflows don't cause issues for macOS builds and can safely be applied to both platforms.
Applied to files:
.github/workflows/build.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: coverage
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Kafka aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-apple-darwin
🔇 Additional comments (4)
Cross.toml (1)
3-24: Cross-compilation configuration looks good.The pre-build steps appropriately install ARM64-specific dependencies with tolerant failures, and the environment passthrough and read-only volume mounts align with typical cross-compilation requirements for rdkafka and OpenSSL.
.github/workflows/build.yaml (3)
69-71: Verify environment variable conditional syntax handles all cases correctly.Lines 69–71 and 215–217 use conditional expressions to set
CCandCXX:CC: ${{ runner.os == 'Linux' && 'gcc-11' || '' }}This pattern can set
CCto a booleanfalseon non-Linux platforms (from the&&operator), which may not be handled correctly by cargo. The safer pattern is to use an explicitif:condition on the step or ensure both branches produce strings:CC: ${{ runner.os == 'Linux' && 'gcc-11' || 'default' }}or
steps: - name: Build native if: ${{ !matrix.use_cross && runner.os == 'Linux' }} env: CC: gcc-11 CXX: g++-11 run: ...Please verify the current syntax works as expected on macOS and Windows runners, or adjust to the safer pattern.
Also applies to: 215-217
155-197: Verify necessity and stability of librdkafka CMakeLists.txt workaround.Lines 155–197 modify
librdkafka/CMakeLists.txtat build time to bump the CMake version requirement from 3.2 to 3.5. This is a fragile workaround because:
- It modifies vendored dependencies dynamically, which is ephemeral (lost when cargo cache is cleared).
- It assumes
librdkafkaexists in the rdkafka-sys registry source, which could change.- It's version-specific and may require updates if rdkafka-sys/librdkafka versions change.
- If the find or sed fails, the error may not be caught early.
Questions:
- Is this modification necessary due to the Arrow/Parquet 57.0.0 upgrade affecting rdkafka-sys?
- Can this be handled upstream (e.g., via a patch in
Cargo.tomlorrdkafka-sysbuild script) instead?- Is there a feature flag or configuration that avoids the CMake version bump?
Consider whether this workaround can be replaced with a more durable solution.
50-52: Clarify whycrossis installed from git instead of crates.io release.Lines 52 and 144 install
crossfrom the git repository:run: cargo install cross --git https://github.com/cross-rs/crossInstalling from git (HEAD) is less stable than pinning a specific version from crates.io. Please clarify:
- Does DataFusion 50.3.0 / Arrow 57.0.0 require an unreleased version of
cross?- If git is necessary, consider pinning to a specific commit hash for reproducibility:
run: cargo install cross --git https://github.com/cross-rs/cross --rev <commit-hash>Alternatively, if a crates.io release works, prefer the stable version:
run: cargo install cross --version <version>Also applies to: 142-144
…tion - Add Docker Buildx and QEMU setup for ARM64 emulation - Switch to nightly Rust toolchain (required for Rust 1.88.0) - Configure Cross.toml with ARM64 package installation and library paths - Set CROSS_NO_WARNINGS=0 to ignore post-install script warnings - Add CMake environment variables for Kafka builds to find ARM64 libs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
.github/workflows/build.yaml (2)
40-40: Add documentation explaining why nightly Rust is required.Nightly Rust is pinned at lines 40 and 140, but the workflow lacks explanation. Based on the DataFusion 50.3.0 upgrade, nightly is required because the parquet reader defaults to
schema_force_view_types=true, which uses view types (Utf8View, BinaryView) that are behind nightly-only feature flags. Add a comment above eachuses: dtolnay/rust-toolchain@nightlystep to document this requirement for future maintainers.+ # Nightly Rust is required for DataFusion 50.3.0: the parquet reader + # defaults to schema_force_view_types=true, and view types (Utf8View, + # BinaryView) are behind nightly-only feature flags. - name: Setup Rust toolchain uses: dtolnay/rust-toolchain@nightlyAlso applies to: 140-140
25-25: Clarify macOS comment: only aarch64 is native, x86_64 is cross-compiled.Line 25 states "both native on macos-latest (M1)", but x86_64-apple-darwin on M1 is cross-compiled (runs under Rosetta), not a native build. Only aarch64-apple-darwin is native. Update the comment for accuracy:
- # macOS builds - both native on macos-latest (M1) + # macOS builds - aarch64 native on M1, x86_64 cross-compiled (runs under Rosetta)
🧹 Nitpick comments (1)
.github/workflows/build.yaml (1)
50-52: Verify cross tool installation source.Lines 50–52 and 144–146 install the
crosstool from the Git repository:--git https://github.com/cross-rs/cross. Consider pinning a specific revision or tag to ensure reproducible builds and avoid unexpected breakage if the repository changes. For example:run: cargo install cross --git https://github.com/cross-rs/cross --tag v0.2.5Also applies to: 144-146
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/workflows/build.yaml(8 hunks)Cross.toml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- Cross.toml
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:680-686
Timestamp: 2025-06-16T05:30:13.379Z
Learning: The parquet conversion system uses a minute-based atomic design where each conversion task operates on different sets of arrow files moved to timestamped "inprocess_" directories, ensuring no conflicts between tasks and no retry mechanisms that could cause file corruption.
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1346
File: src/parseable/streams.rs:351-355
Timestamp: 2025-06-16T05:20:18.593Z
Learning: In the Parseable codebase, arrow files are expected to always have valid creation or modified timestamps as a basic system assumption. The conversion flow uses expect() on file metadata operations to enforce this invariant with fail-fast behavior rather than graceful error handling, as violations represent fundamental system issues that should cause immediate failures.
📚 Learning: 2025-04-26T03:58:02.341Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1302
File: .github/workflows/build.yaml:170-175
Timestamp: 2025-04-26T03:58:02.341Z
Learning: In the parseable project, the Linux-specific environment variables (PKG_CONFIG_PATH, SASL2_DIR, OPENSSL_DIR, OPENSSL_ROOT_DIR, SASL2_STATIC) in the Kafka build step of GitHub Actions workflows don't cause issues for macOS builds and can safely be applied to both platforms.
Applied to files:
.github/workflows/build.yaml
📚 Learning: 2025-06-18T11:15:10.836Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.
Applied to files:
.github/workflows/build.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: coverage
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Kafka aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (3)
.github/workflows/build.yaml (3)
157-199: librdkafka CMakeLists.txt minimum version bump looks reasonable.The sed commands bump the minimum CMake version from 3.2 to 3.5, which is standard for modern build environments and necessary for the newer librdkafka versions compatible with DataFusion 50.3.0. The platform-specific sed syntax (with
''for macOS, no flag for Linux) is correct.
99-105: Docker Buildx and QEMU setup for cross-compilation looks correct.The conditional setup of Docker Buildx and QEMU (lines 99–105) is only applied when
matrix.use_cross == true, which is appropriate for aarch64 cross-compilation. The tool versions use latest stable tags, which is reasonable for CI.
201-215: Verify and replace hardcoded cross-compile paths with dynamic discovery.Lines 208–214 hardcode specific aarch64-linux-gnu toolchain paths. These paths are fragile and may not exist across different Ubuntu versions or runner configurations. Consider:
- Use
pkg-config(viaPKG_CONFIG_PATH) to dynamically discover library paths instead of hardcoding absolute paths- Document which packages must be pre-installed (e.g.,
crossbuild-essential-arm64,libssl-dev:arm64,zlib1g-dev:arm64)- Verify paths exist before build or use conditional logic to handle missing dependencies
The
crosstool's Docker-based approach typically handles this automatically; if paths need explicit configuration, document why and make them resilient to environment changes.
Upgrade df version to 50.3.0 and arrow/parquet to 57.0.0
Fixes #XXXX.
Description
This PR has:
Summary by CodeRabbit
Chores
Refactor
Bug Fixes / Behavior
Breaking Changes
✏️ Tip: You can customize this high-level summary in your review settings.