Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ members = [
"rust/compression/fsst",
"rust/compression/bitpacking",
"rust/arrow-scalar",
"rust/arrow-stats",
]
exclude = ["python", "java/lance-jni"]
# Python package needs to be built by maturin.
Expand Down Expand Up @@ -74,6 +75,7 @@ approx = "0.5.1"
# Note that this one does not include pyarrow
arrow = { version = "57.0.0", optional = false, features = ["prettyprint"] }
lance-arrow-scalar = { version = "=57.0.0", path = "./rust/arrow-scalar" }
lance-arrow-stats = { version = "=57.0.0", path = "./rust/arrow-stats" }
arrow-arith = "57.0.0"
arrow-array = "57.0.0"
arrow-buffer = "57.0.0"
Expand Down
26 changes: 26 additions & 0 deletions rust/arrow-stats/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
[package]
name = "lance-arrow-stats"
version = "57.0.0"
Comment thread
HaochengLIU marked this conversation as resolved.
edition.workspace = true
authors.workspace = true
license.workspace = true
repository.workspace = true
description = "Statistics accumulator for Arrow arrays (min, max, null_count, nan_count)"
keywords.workspace = true
categories.workspace = true
rust-version.workspace = true
readme = "README.md"
Comment thread
HaochengLIU marked this conversation as resolved.

[dependencies]
arrow-array = { workspace = true }
arrow-schema = { workspace = true }
lance-arrow-scalar = { workspace = true }
half = { workspace = true }

[dev-dependencies]
arrow-select = { workspace = true }
proptest = { workspace = true }
rstest = { workspace = true }

[lints]
workspace = true
62 changes: 62 additions & 0 deletions rust/arrow-stats/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# lance-arrow-stats

Statistics accumulator for [Apache Arrow](https://arrow.apache.org/) arrays.

Computes min, max, null count, NaN count, and buffer memory usage over one or
more batches of Arrow data. Designed for use in Lance's columnar storage layer
where page-level statistics drive predicate pushdown and query planning.

## Usage

```rust
use arrow_array::{Int32Array, ArrayRef};
use lance_arrow_stats::StatisticsAccumulator;
use arrow_schema::DataType;
use std::sync::Arc;

let mut acc = StatisticsAccumulator::new(&DataType::Int32);

let batch: ArrayRef = Arc::new(Int32Array::from(vec![Some(3), None, Some(1), Some(4)]));
acc.update(&batch).unwrap();

let stats = acc.finish();
assert_eq!(stats.null_count, 1);
```

## Tracked Statistics

| Statistic | Description |
| --------------- | -------------------------------------------------------- |
| `min` | Minimum non-null, non-NaN value (`ArrowScalar`) |
| `max` | Maximum non-null, non-NaN value (`ArrowScalar`) |
| `null_count` | Total number of null values |
| `nan_count` | Total NaN values (float and float-list types only) |
| `item_nulls` | Null items inside list entries (list types only) |
| `buffer_memory` | Total Arrow buffer memory in bytes |

## Supported Types

- **Numeric** — Int8–Int64, UInt8–UInt64, Float16/32/64
- **Temporal** — Date32/64, Time32/64, Timestamp, Duration
- **Boolean**
- **String** — Utf8, LargeUtf8
- **Binary** — Binary, LargeBinary
- **List** — List, LargeList, FixedSizeList (computes stats over items)

Dictionary, run-end encoded, and view types are accepted but min/max will be
`None`.

## Merging

Accumulators of the same data type can be merged, which is useful for combining
statistics computed in parallel across different pages or files:

```rust
use lance_arrow_stats::StatisticsAccumulator;
use arrow_schema::DataType;

let mut a = StatisticsAccumulator::new(&DataType::Float32);
let mut b = StatisticsAccumulator::new(&DataType::Float32);
// ... update each with different batches ...
a.merge(&b).unwrap();
```
8 changes: 8 additions & 0 deletions rust/arrow-stats/proptest-regressions/lib.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Seeds for failure cases proptest has generated in the past. It is
# automatically read and these particular cases re-run before any
# novel cases are generated.
#
# It is recommended to check this file in to source control so that
# everyone who runs the test benefits from these saved cases.
cc 81b0445f36fa8f491c1fb3162f51b61c8be140d5b2a1e792c42b4bdb7f1b6a62 # shrinks to values = [0.0, -0.0]
cc 8651fce939497f33c6dafd842937d95965af97833bfbbd10df30d5ea00dbd07d # shrinks to values = [Some(0.0), Some(-0.0)]
Loading
Loading