-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF]: swap out json_deserializer for simd_json #2228
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a preliminary review, I'm really liking the cleanup that the simd_json
port enables!
Biggest remaining thing to figure out is how to preserve key order in Object
values, which is supported in json_deserializer
but is not supported in simd_json
: simd-lite/simd-json#270
BorrowedValue::Static(StaticNode::I64(v)) => T::from(*v), | ||
BorrowedValue::Static(StaticNode::U64(v)) => T::from(*v), | ||
BorrowedValue::Static(StaticNode::F64(v)) => T::from(*v), | ||
BorrowedValue::Static(StaticNode::Bool(v)) => T::from(*v as u8), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I like this a good bit more than the int/float-specific closures!
write!(scratch, "{node}").unwrap(); | ||
target.push(Some(scratch.as_str())); | ||
scratch.clear(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For integers and floats that contain exponents, I'm assuming that the string representation of StaticNode::I64
and StaticNode::F64
values won't preserve the exponent format, right? E.g. 1.5e9
, when decoded into a string field, will be decoded as "1500000000"
. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, without manually parsing from simd_json's Tape
, there isn't a way to preserve the exponent format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that change in behavior is fine, we can always special-case that if it ends up being an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, just a few questions around the unsafe code!
Object(Object<'value>), | ||
} | ||
|
||
struct OrderPreservingDeserializer<'de>(Deserializer<'de>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I'm happy that this order-preserving extension ended up being pretty thin.
.map(|unparsed_record| { | ||
json_deserializer::parse(unparsed_record.as_bytes()).map_err(|e| { | ||
super::Error::JsonDeserializationError { | ||
crate::deserializer::to_value(unsafe { unparsed_record.as_bytes_mut() }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming that the safety contract here for as_bytes_mut()
, i.e. that the post-deserialization content of the slice is valid UTF8 before the parsed record is used, is trivially satisfied by the deserializer + records
never being used after parsing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah exactly. simd_json parser is "destructive" in nature, so as long as you don't attempt to reuse the original data after parsing, this is generally safe
} | ||
|
||
pub fn parse(&mut self) -> Value<'de> { | ||
match unsafe { self.0.next_() } { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Safety condition here is that a top-level .parse()
will never be called on an empty byte slice - do we know that this can't happen since Deserializer::from_slice()
errors on an empty byte slice? The only condition check I found when skimming the Deserializer::from_slice()
implementation to suggest that it errors when given an empty slice is this: https://docs.rs/simd-json/latest/src/simd_json/lib.rs.html#1081-1083
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we know that this can't happen since Deserializer::from_slice() errors on an empty byte slice?
I would assume the Deserializer
catches any errors before it makes it into this wrapper, but I can't say with certainty. They do have a pretty extensive test suite that I would hope catches this. So I think unfortunately we just assume that simd_json
does what it's supposed to do.
Anecdotally, polars has been using simd_json::Deserializer
for a couple years and has never encountered an out of bounds here.
write!(scratch, "{node}").unwrap(); | ||
target.push(Some(scratch.as_str())); | ||
scratch.clear(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that change in behavior is fine, we can always special-case that if it ends up being an issue.
This swaps out json_deserializer for simd_json. This does show some pretty noticeable performance gains across the board (~10-20%). This is nice as not only does the local reader show improvements, the object store readers should also benefit from this.
some perf tests I ran locally
Edit:
Added support for preserving order
It isnt' quite as performant as the unordered version, but it is still noticeably faster than using json_deserializer