feat: support array indices in VariantPath dot notation by foskey51 · Pull Request #9012 · apache/arrow-rs

foskey51 · 2025-12-17T18:22:46Z

Which issue does this PR close?

Closes [Variant] Support array access when parsing VariantPath #8946

What changes are included in this PR?

The PR adds support for parsing array index (eg. foo.bar[3]) with the help of parse_path fn. Currently the parser silently parses invalid segments as Field (eg., foo[0, [0](parsed as index), foo0], foo[0][)

Feedback requested

Whether to add stricter validation (throw an error) and reject the segment ? Or to keep the current behavior ?

Are these changes tested?

yes, only for valid inputs

Are there any user-facing changes?

no

mhilton

LGTM

parquet-variant/src/path.rs

alamb

looks like an improvement to me. Thank you @foskey51 and @mhilton

My only question is about empty path segments

cc @scovich

alamb · 2025-12-19T18:39:40Z

parquet-variant/src/path.rs

-        } else {
-            VariantPath::new(path.split('.').map(Into::into).collect())
-        }
+        VariantPath::new(path.split(".").flat_map(parse_path).collect())


this is beautiful

Does the flat_map mean this will parse paths like foo..bar as [foo] [bar]?

I am not sure exactly what should happen when parsing a path like that 🤔

Now that you mention, it seems like the existing code already had that bug?

In theory, that should be a malformed path that causes an error. Ditto for mismatched [ or ], or e.g. [a.b] (which this PR would incorrectly attempt to treat as [a and b]).

It seems like we'll need a more sophisticated (and integrated) parsing approach here?

A bit of quality time with an LLM produced the following (along with a bunch of unit tests to validate it works):

Details

/// Create from &str with support for dot notation and array indices. /// /// # Example /// ``` /// # use parquet_variant::VariantPath; /// let path: VariantPath = "foo.bar[0]".try_into().unwrap(); /// ``` impl<'a> TryFrom<&'a str> for VariantPath<'a> { type Error = ArrowError; fn try_from(path: &'a str) -> Result<Self, Self::Error> { parse_path(path).map(VariantPath::new) } } /// Parse a path string into a vector of [`VariantPathElement`]. /// /// Supports the following syntax: /// - `""` - empty path /// - `"foo"` - single field /// - `"foo.bar"` - nested fields /// - `"[1]"` - single array index /// - `"[1][2]"` - multiple array indices /// - `"foo[1]"` - field with array index /// - `"foo[1][2]"` - field with multiple array indices /// - `"foo[1].bar"` - mixed field and array access /// - etc. /// /// Field names can contain any characters except `.`, `[`, and `]`. fn parse_path(s: &str) -> Result<Vec<VariantPathElement<'_>>, ArrowError> { let scan_field = |start: usize| { s[start..].find(['.', '[', ']']).map_or_else(|| s.len(), |p| start + p) }; let bytes = s.as_bytes(); if let Some(b'.') = bytes.first() { return Err(ArrowError::ParseError("unexpected leading '.'".into())); } let mut elements = Vec::new(); let mut i = 0; while i < bytes.len() { let (elem, end) = match bytes[i] { b'.' => { i += 1; // skip the dot; a field must follow let end = scan_field(i); if end == i { return Err(ArrowError::ParseError(match bytes.get(i) { None => "unexpected trailing '.'".into(), Some(&c) => format!("unexpected '{}' at byte {i}", c as char), })); } (VariantPathElement::field(&s[i..end]), end) } b'[' => { let (idx, end) = parse_index(s, i)?; (VariantPathElement::index(idx), end) } b']' => { return Err(ArrowError::ParseError(format!( "unexpected ']' at byte {i}" ))); } _ => { let end = scan_field(i); (VariantPathElement::field(&s[i..end]), end) } }; elements.push(elem); i = end; } Ok(elements) } /// Parse `[digits]` starting at `i` (which points to `[`). /// Returns (index_value, position after `]`). fn parse_index(s: &str, i: usize) -> Result<(usize, usize), ArrowError> { let start = i + 1; // skip '[' // Find closing ']' let end = match s[start..].find(']') { Some(p) => start + p, None => return Err(ArrowError::ParseError(format!("unterminated '[' at byte {i}"))), }; let idx = s[start..end] .parse() .map_err(|_| ArrowError::ParseError(format!("invalid index at byte {start}")))?; Ok((idx, end + 1)) }

I initially considered using TryFrom with explicit error boundaries , but doing so would require changing an existing public API, which would be a breaking change

arrow-rs/parquet-variant-compute/src/shred_variant.rs

Lines 709 to 717 in 240cbf4

pub fn with_path<'a, P, F>(mut self, path: P, field: F) -> Self

where

P: Into<VariantPath<'a>>,

F: IntoShreddingField,

{

let path: VariantPath<'a> = path.into();

self.root.insert_path(&path, field.into_shredding_field());

self

}

and as @scovich mentioned the current approach can also parse invalid inputs as fields, making debugging hard. So I'm a bit skeptical about how to proceed from here.

AFAIK, variant is still experimental, so breaking changes are ok?

Meanwhile, even the parser I cooked up above is technically incomplete, because JSON allows field names to contain any character (including ., [ and ] which my suggestion would reject as ambiguous). The only way to avoid that is with some kind of escaping syntax. I believe Iceberg uses canonical JSONpath, for example.

There's an open issue regarding the escaping syntax #8954

Interesting. It looks like the mini-parser I posted above is a strict subset of the databricks syntax linked above. Would it be a reasonable starting point to merge this PR and add the missing ['fieldName'] syntax as a follow-up? Or just expand the above suggestion into a full parser right away?

Filed [Variant] support .. and ['fieldName'] syntax in the VariantPath parser #9050

alamb · 2025-12-27T12:00:55Z

Thanks again @foskey51 @mhilton and @scovich

feat: support array indices in VariantPath dot notation

e905c4e

github-actions bot added the parquet-variant parquet-variant* crates label Dec 17, 2025

mhilton approved these changes Dec 18, 2025

View reviewed changes

parquet-variant/src/path.rs Show resolved Hide resolved

test: add test cases

0729c45

alamb approved these changes Dec 19, 2025

View reviewed changes

alamb mentioned this pull request Dec 27, 2025

[Variant] support .. and ['fieldName'] syntax in the VariantPath parser #9050

Closed

alamb merged commit de1686a into apache:main Dec 27, 2025
17 checks passed

alamb mentioned this pull request Jan 6, 2026

[Variant] Support array access when parsing VariantPath #8946

Closed

	pub fn with_path<'a, P, F>(mut self, path: P, field: F) -> Self
	where
	P: Into<VariantPath<'a>>,
	F: IntoShreddingField,
	{
	let path: VariantPath<'a> = path.into();
	self.root.insert_path(&path, field.into_shredding_field());
	self
	}

Conversation

foskey51 commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Feedback requested

Are these changes tested?

Are there any user-facing changes?

Uh oh!

mhilton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

foskey51 commented Dec 17, 2025 •

edited

Loading

scovich Dec 20, 2025 •

edited

Loading

scovich Dec 21, 2025 •

edited

Loading