Skip to content

Conversation

@alamb
Copy link

@alamb alamb commented Jan 7, 2026

These are some code suggestions I had while reviewing the following PR in DataFusion

Merging this PR will update the PR in datafusion

I think it ok to merge apache#19616 without this PR and then merge it as a follow on PR in DataFusion

feniljain and others added 7 commits January 6, 2026 20:55
…e#19622)

## Which issue does this PR close?

- Closes apache#19591

## Rationale for this change

Explained in issue itself

## What changes are included in this PR?

- Removed coalesce batches rule
- Deprecate `CoalesceBatchesExec`

## Are these changes tested?

Yes

## Are there any user-facing changes?

Yes, added a deprecation tag on `CoalesceBatchesExec`
… indexing (apache#19590)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

This PR improves the performance of the `substring_index` function by
optimizing
delimiter search and substring extraction: 

- Single-byte fast path: introduces a specialized byte-based search for
single-byte delimiters (e.g. `.`, `,`), avoiding UTF-8 pattern matching
overhead.
- Efficient index discovery: replaces the split-and-sum-length approach
with direct index location using `match_indices` / `rmatch_indices`.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

- Added a fast path for `delimiter.len() == 1` using byte-based search.
- Refactored the general path to use `match_indices` and
`rmatch_indices` for more efficient positioning.

### Benchmarks
- Single-byte delimiter benchmarks show ~2–3× speedup across batch
sizes.
- Multi-byte delimiters see a consistent ~10–15% improvement.
```
group                                               main_substrindex                       perf_substrindex
-----                                               ----------------                       ----------------
substr_index/substr_index_10000_long_delimiter      1.12   548.2±18.39µs        ? ?/sec    1.00   488.4±15.62µs        ? ?/sec
substr_index/substr_index_10000_single_delimiter    2.14   543.4±15.12µs        ? ?/sec    1.00    254.0±7.80µs        ? ?/sec
substr_index/substr_index_1000_long_delimiter       1.12     43.1±1.63µs        ? ?/sec    1.00     38.6±2.03µs        ? ?/sec
substr_index/substr_index_1000_single_delimiter     3.51     46.4±2.21µs        ? ?/sec    1.00     13.2±0.99µs        ? ?/sec
substr_index/substr_index_100_long_delimiter        1.01      3.7±0.18µs        ? ?/sec    1.00      3.7±0.20µs        ? ?/sec
substr_index/substr_index_100_single_delimiter      2.15      3.6±0.14µs        ? ?/sec    1.00  1675.9±79.15ns        ? ?/sec
```

## Are these changes tested?

- Yes, Existing unit tests pass.
- New benchmarks added to verify performance improvement.

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

No.

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
## Which issue does this PR close?

- Closes apache#19623

## Rationale for this change

Explained in issue

## What changes are included in this PR?

Implement baseline metrics for `AsyncFuncExec` operator

## Are these changes tested?

Have added a test with PR

## Are there any user-facing changes?

There are user facing changes, now `explain analyze` will show metrics
of `AsyncFuncExec` too, do we need to update any documentation regarding
this? Asking cause I couldn't find anything on a quick search.

---------

Co-authored-by: Yongting You <[email protected]>
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Part of apache#14763.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

Replace `TypeSignature::Exact` patterns with cleaner APIs:
- `isnan/iszero`: `Signature::coercible` with
`TypeSignatureClass::Float`
- `nanvl`: `Signature::uniform(2, [Float32, Float64])`

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

Yes

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Jeffrey Vo <[email protected]>
…ructors (apache#19668)

### Which issue does this PR close?

Closes apache#19664

---

### Rationale for this change

After apache#18880, `BatchPartitioner::try_new` gained additional parameters
that are
only relevant for round-robin repartitioning. This made the constructor
API
confusing, as hash repartitioning received parameters it does not use.

Splitting the constructor improves clarity and avoids passing
round-robin–specific parameters to hash partitioning.

---

### What changes are included in this PR?

- Introduce `BatchPartitioner::try_new_hash`
- Introduce `BatchPartitioner::try_new_round_robin`
- Refactor callers to use the specialized constructors
- Retain `BatchPartitioner::try_new` as a delegator for backward
compatibility

This is a pure refactor; behavior is unchanged.

---

### Are these changes tested?

Yes. Existing tests cover this code path.
All builds pass locally.

---

### Are there any user-facing changes?

No. This change is internal only and does not affect user-facing
behavior or APIs.

---------

Co-authored-by: Your Name <[email protected]>
…ng decoding (apache#19545)

## Which issue does this PR close?

* Closes apache#18560.

## Rationale for this change

DataFusion’s Parquet row-level filter pushdown previously rejected all
nested Arrow types (lists/structs), which prevented common and
performance-sensitive filters on list columns (for example `array_has`,
`array_has_all`, `array_has_any`) from being evaluated during Parquet
decoding.

Enabling safe pushdown for a small, well-defined set of list-aware
predicates allows Parquet decoding to apply these filters earlier,
reducing materialization work and improving scan performance, while
still keeping unsupported nested projections (notably structs) evaluated
after batches are materialized.

## What changes are included in this PR?

* Allow a registry of list-aware predicates to be considered
pushdown-compatible:

  * `array_has`, `array_has_all`, `array_has_any`
  * `IS NULL` / `IS NOT NULL`
* Introduce `supported_predicates` module to detect whether an
expression tree contains supported list predicates.
* Update Parquet filter candidate selection to:

  * Accept list columns only when the predicate semantics are supported.
* Continue rejecting struct columns (and other unsupported nested
types).
* Switch Parquet projection mask construction from root indices to
**leaf indices** (`ProjectionMask::leaves`) so nested list filters
project the correct leaf columns for decoding-time evaluation.
* Expand root column indices to leaf indices for nested columns using
the Parquet `SchemaDescriptor`.
* Add unit tests verifying:

* List columns are accepted for pushdown when used by supported
predicates.
* Struct columns (and mixed struct+primitive predicates) prevent
pushdown.
* `array_has`, `array_has_all`, `array_has_any` actually filter rows
during decoding using a temp Parquet file.
* Add sqllogictest coverage proving both correctness and plan behavior:

  * Queries return expected results.
  * `EXPLAIN` shows predicates pushed into `DataSourceExec` for Parquet.

## Are these changes tested?

Yes.

* Rust unit tests in `datafusion/datasource-parquet/src/row_filter.rs`:

  * Validate pushdown eligibility for list vs struct predicates.
* Create a temp Parquet file and confirm list predicates prune/match the
expected rows via Parquet decoding row filtering.
* SQL logic tests in
`datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt`:

* Add end-to-end coverage for `array_has`, `array_has_all`,
`array_has_any` and combinations (OR / AND with other predicates).
* Confirm pushdown appears in the physical plan (`DataSourceExec ...
predicate=...`).

## Are there any user-facing changes?

Yes.

* Parquet filter pushdown now supports list columns for the following
predicates:

  * `array_has`, `array_has_all`, `array_has_any`
  * `IS NULL`, `IS NOT NULL`

This can improve query performance for workloads that filter on
array/list columns.

No breaking changes are introduced; unsupported nested types (for
example structs) continue to be evaluated after decoding.

## LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated
content has been manually reviewed and tested.

---------

Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Qi Zhu <[email protected]>
…001` (apache#19666)

## Which issue does this PR close?

- Part of apache#19656


## Rationale for this change
rust_decimal is a one person crate and is released somewhat infrequently
* https://github.com/paupino/rust-decimal
* https://crates.io/crates/rust_decimal

It also uses a non trivial number of dependencies, including
https://crates.io/crates/rkyv, some sort of zero copy deserialization
framework that was recently subject to a RUSTSEC security advisory, see
apache#19656 /
apache#19657

Since `rust_decimal` is only used for sqllogictests to parse the results
from postgres, we can pretty easily remove the dependency on
`rust_decimal` and inline the very small amount functionality we need
for sqllogictests

This will both decrease the build time and our dependency trail.


## What changes are included in this PR?

Removes the `rust_decimal` dependency from DataFusion
and inlines the smallest required subset of decimal functionality we
need for
sqllogictests (which turns out to be pretty small)

## Are these changes tested?

Yes by CI

## Are there any user-facing changes?
No, this is all internal testing infrastructure
cj-zhukov and others added 7 commits January 8, 2026 00:23
) (apache#19319)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes #apache#19141.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Sergey Zhukov <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
… SimplifyInfo (apache#19505)

In trying to fix apache#19418 I kept getting turned around about what was
needed where. The `SimplifyInfo` trait made it extra hard to understand.
I ended up realizing that the main reason for the trait to exist was
tests. Removing the trait and adding a builder style API to
`SimplifyContext` made it IMO more ergonomic for tests and other call
sites, easier to track the code (no trait opaqueness) and clearer what
simplification capabilities are available in each site. This got rid of
e.g. some places where we were calling `ExecutionProps::new()` just to
pass that into `SimplifyContext` which in turn would hand out references
to the default query time, a default `ContigOptions`, etc; or in
`datafusion/core/src/execution/session_state.rs` where we did `let
dummy_schema = DFSchema::empty()`.

This let me solve several problems:
- Can store optimized logical plans for prepared statements
- Users can optionally run an optimizer pass on logical plans without
evaluating time functions

Compared to apache#19426 this avoids adding a config option and is actually
less lines of code (negative diff).

Fixes apache#19418, closes apache#19426 (replaces it).

---------

Co-authored-by: Claude Opus 4.5 <[email protected]>
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Part of apache#19025.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

- Added Time64/Time32 signatures to date_trunc
- Added time truncation logic (hour, minute, second, millisecond,
microsecond)
- Error for invalid granularities (day, week, month, quarter, year)

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

Yes

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Jeffrey Vo <[email protected]>
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->
`ParquetOpener::open()` is a critical function for parquet planning,
it's the entry point for many major steps like row-group/file pruning.

It has almost 400 lines of code now, this PR adds some markers to the
code blocks/important steps, to make this function easier to navigate.
(though I may have overlooked some critical steps)

Ideally, we should break these blocks into utilities. I tried extracting
some of them with AI, but the resulting utilities still have unclear
semantics, with many input arguments and output items. Overall, the
complexity doesn’t seem reduced after the change. I think it’s possible
to factor them into helper functions with clear semantics, but that
likely requires someone who understands the implementation details very
well.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
## Which issue does this PR close?
Closes apache#19347 

## Rationale for this change
Native decimal log() (added in apache#17023) only supports integer bases.
Non-integer bases like log(2.5, x) error out, which is a regression from
the previous float-based implementation.

## What changes are included in this PR?

Changes :
- Fallback to f64 computation when base is non-integer
- Integer bases (2, 10, etc.) still use efficient ilog() algorithm

Refactoring:
- Unified log_decimal32, log_decimal64, log_decimal128 into single
generic log_decimal<T> using num_traits::ToPrimitive.
- Used ToPrimitive::to_f64() and to_u128()
- Invalid bases (≤1, NaN, Inf) now return NaN instead of erroring -
matches f64::log behavior
Large Decimal256 values that don't fit in i128 now work via f64
fallback.

## Are these changes tested?
Yes:

- All existing log_decimal* unit tests pass
- Updated test_log_decimal128_invalid_base - expects NaN instead of
error
- Updated test_log_decimal256_large - now succeeds via fallback

## Are there any user-facing changes?
Yes:
```sql
-- Previously errored, now works
SELECT log(2.5, 100::decimal(38,0));

-- Invalid base now returns NaN instead of error (consistent with float behavior)
SELECT log(-2, 64::decimal(38,0));  -- Returns NaN

---------

Co-authored-by: Jeffrey Vo <[email protected]>
…che#19369)

## Which issue does this PR close?
Closes apache#19348

## Rationale for this change
Previously, pow() on decimal types would error for negative exponents
and non-integer exponents with messages like:

- Arrow error: Arithmetic overflow: Unsupported exp value: -5
- Compute error: Cannot use non-integer exp
- This was a regression from when decimals were cast to float before
pow(). The efficient integer-based algorithm for computing power on
scaled integers cannot handle these cases.

## What changes are included in this PR?
- Modified pow_decimal_int to fallback to pow_decimal_float for negative
exponents
- Modified pow_decimal_float to use an efficient integer path for
non-negative integer exponents, otherwise fallback to f64 computation

Added pow_decimal_float_fallback function that:
- Converts the decimal to f64
- Computes powf(exp)
- Converts back to the original decimal type with proper scaling
- Added decimal_from_i128 helper to convert i128 results back to generic
decimal types (needed for Decimal256 support)
- Updated sqllogictests to expect success for negative/non-integer
exponents

## Are these changes tested?
Yes:

Unit tests for pow_decimal_float_fallback covering negative exponents,
fractional exponents, cube roots
Updated SQL logic tests in decimal.slt

## Are there any user-facing changes?
Yes. The following queries now work instead of returning errors:

```sql
-- Negative exponent
SELECT power(4::decimal(38, 5), -1);  -- Returns 0.25

-- Non-integer exponent  
SELECT power(2.5, 4.2);  -- Returns 46.9

-- Square root via power
SELECT power(4::decimal, 0.5);  -- Returns 2
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->
- Builds on apache#19388
- Closes apache#19573

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

This PR explores one way to make `ListFilesCache` table scoped. A
session level cache is still used, but the cache key is made a
"table-scoped" path, for which a new struct
```
pub struct TableScopedPath(pub Option<TableReference>, pub Path);
```
is defined. `TableReference` comes from `CreateExternalTable` passed to
`ListingTableFactory::create`.

Additionally, when a table is dropped, all entries related to a table is
dropped by modifying `SessionContext::find_and_deregister` method.

Testing (change on adding `list_files_cache()` for cli is included for
easier testing).
- Testing cache reuse on a single table.
```
> \object_store_profiling summary
ObjectStore Profile mode set to Summary
> create external table test
stored as parquet
location 's3://overturemaps-us-west-2/release/2025-12-17.0/theme=base/';
0 row(s) fetched. 
Elapsed 14.878 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Summary, inner: AmazonS3(overturemaps-us-west-2)
Summaries:
+-----------+----------+-----------+-----------+-------------+-------------+-------+
| Operation | Metric   | min       | max       | avg         | sum         | count |
+-----------+----------+-----------+-----------+-------------+-------------+-------+
| Get       | duration | 0.030597s | 0.209235s | 0.082396s   | 36.254189s  | 440   |
| Get       | size     | 204782 B  | 857230 B  | 497304.88 B | 218814144 B | 440   |
| List      | duration | 0.192037s | 0.192037s | 0.192037s   | 0.192037s   | 1     |
| List      | size     |           |           |             |             | 1     |
+-----------+----------+-----------+-----------+-------------+-------------+-------+
> select table, path, unnest(metadata_list) from list_files_cache() limit 1;
+-------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| table | path                            | UNNEST(list_files_cache().metadata_list)                                                                                                                                                                                                                  |
+-------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| test  | release/2025-12-17.0/theme=base | {file_path: release/2025-12-17.0/theme=base/type=bathymetry/part-00000-dd0f2f50-b436-4710-996f-f1b06181a3a1-c000.zstd.parquet, file_modified: 2025-12-17T22:32:50, file_size_bytes: 40280159, e_tag: "15090401f8f936c3f83bb498cb99a41d-3", version: NULL} |
+-------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.058 seconds.

Object Store Profiling
> select count(*) from test where type = 'infrastructure';
+-----------+
| count(*)  |
+-----------+
| 142969564 |
+-----------+
1 row(s) fetched. 
Elapsed 0.028 seconds.

Object Store Profiling
```
- Test separate cache entries are created for two tables with same path
```
> create external table test2
stored as parquet
location 's3://overturemaps-us-west-2/release/2025-12-17.0/theme=base/';
0 row(s) fetched. 
Elapsed 14.798 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Summary, inner: AmazonS3(overturemaps-us-west-2)
Summaries:
+-----------+----------+-----------+-----------+-------------+-------------+-------+
| Operation | Metric   | min       | max       | avg         | sum         | count |
+-----------+----------+-----------+-----------+-------------+-------------+-------+
| Get       | duration | 0.030238s | 0.350465s | 0.073670s   | 32.414654s  | 440   |
| Get       | size     | 204782 B  | 857230 B  | 497304.88 B | 218814144 B | 440   |
| List      | duration | 0.133334s | 0.133334s | 0.133334s   | 0.133334s   | 1     |
| List      | size     |           |           |             |             | 1     |
+-----------+----------+-----------+-----------+-------------+-------------+-------+
> select count(*) from test2 where type = 'bathymetry';
+----------+
| count(*) |
+----------+
| 59963    |
+----------+
1 row(s) fetched. 
Elapsed 0.009 seconds.

Object Store Profiling
> select table, path from list_files_cache();
+-------+---------------------------------+
| table | path                            |
+-------+---------------------------------+
| test  | release/2025-12-17.0/theme=base |
| test2 | release/2025-12-17.0/theme=base |
+-------+---------------------------------+
2 row(s) fetched. 
Elapsed 0.004 seconds.
```
- Test cache associated with a table is dropped when table is dropped,
and the other table with same path is unaffected.
```
> drop table test;
0 row(s) fetched. 
Elapsed 0.015 seconds.

Object Store Profiling
> select table, path from list_files_cache();
+-------+---------------------------------+
| table | path                            |
+-------+---------------------------------+
| test2 | release/2025-12-17.0/theme=base |
+-------+---------------------------------+
1 row(s) fetched. 
Elapsed 0.005 seconds.

Object Store Profiling
> select count(*) from list_files_cache() where table = 'test';
+----------+
| count(*) |
+----------+
| 0        |
+----------+
1 row(s) fetched. 
Elapsed 0.014 seconds.
> select count(*) from test2 where type = 'infrastructure';
+-----------+
| count(*)  |
+-----------+
| 142969564 |
+-----------+
1 row(s) fetched. 
Elapsed 0.013 seconds.

Object Store Profiling
```
- Test that dropping a view does not remove cache
```
> create view test2_view as (select * from test2 where type = 'infrastructure');
0 row(s) fetched. 
Elapsed 0.103 seconds.

Object Store Profiling
> select count(*) from test2_view;
+-----------+
| count(*)  |
+-----------+
| 142969564 |
+-----------+
1 row(s) fetched. 
Elapsed 0.094 seconds.

Object Store Profiling
> drop view test2_view;
0 row(s) fetched. 
Elapsed 0.002 seconds.

Object Store Profiling
> select table, path from list_files_cache();
+-------+---------------------------------+
| table | path                            |
+-------+---------------------------------+
| test2 | release/2025-12-17.0/theme=base |
+-------+---------------------------------+
1 row(s) fetched. 
Elapsed 0.007 seconds.
```
## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

for (path, entry) in list_files_cache.list_entries() {
table_arr.push(path.table.map_or("NULL".to_string(), |t| t.to_string()));
table_arr.push(path.table.map(|t| t.to_string()));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be nicer to us actual nulls rather than a "NULL" string here

table_path = table_path
.with_glob(glob.as_ref())?
.with_table_ref(cmd.name.clone());
table_path = table_path.with_glob(glob.as_ref())?;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated with_glob to preserve the other fields too

Self::try_new(self.url, Some(glob))
pub fn with_glob(mut self, glob: &str) -> Result<Self> {
self.glob =
Some(Pattern::new(glob).map_err(|e| DataFusionError::External(Box::new(e)))?);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self::try_new doesn't do anything to the url, so there is no need to reparse the url

@alamb
Copy link
Author

alamb commented Jan 8, 2026

Superceded by apache#19703

@alamb alamb closed this Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.