feat: support array indices in VariantPath dot notation#9012
feat: support array indices in VariantPath dot notation#9012alamb merged 2 commits intoapache:mainfrom
Conversation
| } else { | ||
| VariantPath::new(path.split('.').map(Into::into).collect()) | ||
| } | ||
| VariantPath::new(path.split(".").flat_map(parse_path).collect()) |
There was a problem hiding this comment.
Does the flat_map mean this will parse paths like foo..bar as [foo] [bar]?
I am not sure exactly what should happen when parsing a path like that 🤔
There was a problem hiding this comment.
Now that you mention, it seems like the existing code already had that bug?
In theory, that should be a malformed path that causes an error. Ditto for mismatched [ or ], or e.g. [a.b] (which this PR would incorrectly attempt to treat as [a and b]).
It seems like we'll need a more sophisticated (and integrated) parsing approach here?
There was a problem hiding this comment.
A bit of quality time with an LLM produced the following (along with a bunch of unit tests to validate it works):
Details
/// Create from &str with support for dot notation and array indices.
///
/// # Example
/// ```
/// # use parquet_variant::VariantPath;
/// let path: VariantPath = "foo.bar[0]".try_into().unwrap();
/// ```
impl<'a> TryFrom<&'a str> for VariantPath<'a> {
type Error = ArrowError;
fn try_from(path: &'a str) -> Result<Self, Self::Error> {
parse_path(path).map(VariantPath::new)
}
}
/// Parse a path string into a vector of [`VariantPathElement`].
///
/// Supports the following syntax:
/// - `""` - empty path
/// - `"foo"` - single field
/// - `"foo.bar"` - nested fields
/// - `"[1]"` - single array index
/// - `"[1][2]"` - multiple array indices
/// - `"foo[1]"` - field with array index
/// - `"foo[1][2]"` - field with multiple array indices
/// - `"foo[1].bar"` - mixed field and array access
/// - etc.
///
/// Field names can contain any characters except `.`, `[`, and `]`.
fn parse_path(s: &str) -> Result<Vec<VariantPathElement<'_>>, ArrowError> {
let scan_field = |start: usize| {
s[start..].find(['.', '[', ']']).map_or_else(|| s.len(), |p| start + p)
};
let bytes = s.as_bytes();
if let Some(b'.') = bytes.first() {
return Err(ArrowError::ParseError("unexpected leading '.'".into()));
}
let mut elements = Vec::new();
let mut i = 0;
while i < bytes.len() {
let (elem, end) = match bytes[i] {
b'.' => {
i += 1; // skip the dot; a field must follow
let end = scan_field(i);
if end == i {
return Err(ArrowError::ParseError(match bytes.get(i) {
None => "unexpected trailing '.'".into(),
Some(&c) => format!("unexpected '{}' at byte {i}", c as char),
}));
}
(VariantPathElement::field(&s[i..end]), end)
}
b'[' => {
let (idx, end) = parse_index(s, i)?;
(VariantPathElement::index(idx), end)
}
b']' => {
return Err(ArrowError::ParseError(format!(
"unexpected ']' at byte {i}"
)));
}
_ => {
let end = scan_field(i);
(VariantPathElement::field(&s[i..end]), end)
}
};
elements.push(elem);
i = end;
}
Ok(elements)
}
/// Parse `[digits]` starting at `i` (which points to `[`).
/// Returns (index_value, position after `]`).
fn parse_index(s: &str, i: usize) -> Result<(usize, usize), ArrowError> {
let start = i + 1; // skip '['
// Find closing ']'
let end = match s[start..].find(']') {
Some(p) => start + p,
None => return Err(ArrowError::ParseError(format!("unterminated '[' at byte {i}"))),
};
let idx = s[start..end]
.parse()
.map_err(|_| ArrowError::ParseError(format!("invalid index at byte {start}")))?;
Ok((idx, end + 1))
}There was a problem hiding this comment.
I initially considered using TryFrom with explicit error boundaries , but doing so would require changing an existing public API, which would be a breaking change
arrow-rs/parquet-variant-compute/src/shred_variant.rs
Lines 709 to 717 in 240cbf4
and as @scovich mentioned the current approach can also parse invalid inputs as fields, making debugging hard. So I'm a bit skeptical about how to proceed from here.
There was a problem hiding this comment.
AFAIK, variant is still experimental, so breaking changes are ok?
There was a problem hiding this comment.
Meanwhile, even the parser I cooked up above is technically incomplete, because JSON allows field names to contain any character (including ., [ and ] which my suggestion would reject as ambiguous). The only way to avoid that is with some kind of escaping syntax. I believe Iceberg uses canonical JSONpath, for example.
There was a problem hiding this comment.
There's an open issue regarding the escaping syntax #8954
There was a problem hiding this comment.
Interesting. It looks like the mini-parser I posted above is a strict subset of the databricks syntax linked above. Would it be a reasonable starting point to merge this PR and add the missing ['fieldName'] syntax as a follow-up? Or just expand the above suggestion into a full parser right away?
There was a problem hiding this comment.
Which issue does this PR close?
VariantPath#8946What changes are included in this PR?
The PR adds support for parsing array index (eg.
foo.bar[3]) with the help of parse_path fn. Currently the parser silently parses invalid segments as Field (eg.,foo[0,[0](parsed as index),foo0],foo[0][)Feedback requested
Whether to add stricter validation (throw an error) and reject the segment ? Or to keep the current behavior ?
Are these changes tested?
yes, only for valid inputs
Are there any user-facing changes?
no