-
Notifications
You must be signed in to change notification settings - Fork 146
More reproducibility fixes / switch to canon-json #1347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to improve reproducibility in JSON serialization by introducing a custom to_canonical_json_string function. The intent is good, and the changes replacing HashMap with BTreeMap in ostree-ext/src/container/store.rs are a positive step towards this goal as they ensure sorted map keys.
However, the core of this PR, the use of olpc_cjson::CanonicalFormatter in to_canonical_json_string, raises a critical concern. As you've noted in the PR description, and as confirmed by the olpc-cjson documentation, this formatter can produce JSON that is not strictly valid according to standard JSON specifications (e.g., unquoted keys). This could lead to significant interoperability problems if the output is consumed by standard JSON parsers.
Additionally, there are a few places where BufWriter has been removed, which might have minor performance implications for very large JSON outputs. These are secondary to the main concern about JSON validity.
Overall, the goal of reproducible JSON is important, but the current approach for canonicalization needs careful reconsideration to ensure standard JSON validity.
Summary of Findings
- Potential for Non-Standard JSON Output: The core change relies on
olpc_cjson::CanonicalFormatter, which is documented to produce JSON that may not be strictly valid (e.g., unquoted keys). This is a critical issue if the output needs to be consumed by standard JSON parsers, as it could lead to interoperability problems. - Removal of Buffered Writers:
BufWriterhas been removed in several places where JSON is written to files or stdout. This is likely due to the newto_canonical_json_stringfunction returning aString. For very large JSON objects, serializing to a full string in memory first and then writing might be less efficient in terms of memory and I/O patterns compared to streaming to a buffered writer. This should be re-evaluated if the canonicalization strategy changes. - Improved Map Serialization with BTreeMap: The changes from
HashMaptoBTreeMapin various parts ofostree-ext/src/container/store.rsare a positive step towards reproducible JSON, asBTreeMapensures keys are serialized in a consistent, sorted order.
Merge Readiness
This pull request addresses an important aspect of reproducibility. The move to BTreeMap for map serialization is a good improvement. However, the use of olpc_cjson::CanonicalFormatter introduces a critical risk of producing non-standard JSON, which could break consumers expecting valid JSON.
I strongly recommend that this critical issue regarding JSON validity be resolved before merging. The PR author has already noted this concern in the description, and further discussion on the best path forward (potentially using standard serde_json with BTreeMap for canonical and valid output) is needed.
As a reviewer, I am not authorized to approve pull requests. Please ensure further review and approval from authorized maintainers after addressing the highlighted concerns.
837b3ff to
fd68fdf
Compare
|
Should I move |
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a robust solution for ensuring reproducible JSON serialization by leveraging serde_json::Value which internally uses BTreeMap for objects, thus guaranteeing key order. The new JsonOrderedSerialize trait provides a clean abstraction for this, and its application throughout the codebase, along with the switch from HashMap to BTreeMap for relevant data structures, effectively addresses the non-deterministic output issues. The changes are well-targeted and improve the predictability of serialized outputs. Great work!
Summary of Findings
- Documentation for
JsonOrderedSerializetrait: Inostree-ext/src/json.rs, the newJsonOrderedSerializetrait could benefit from a doc comment explaining its purpose: to ensure reproducible JSON output by ordering map keys, typically by first converting toserde_json::Value. This would enhance maintainability. (Severity: low, not commented due to review settings) - Clarity of inline comment: In
ostree-ext/src/container/store.rs(line 1007), the comment// Use serde_json::Value to make output reproducibleis accurate. For slightly improved clarity, it could be rephrased to focus on the outcome, e.g.,// Serialize to JSON with ordered keys for reproducibility viaJsonOrderedSerializetrait.(Severity: low, not commented due to review settings)
Merge Readiness
The changes in this pull request are well-implemented and effectively address the issue of non-reproducible JSON output. The introduction of the JsonOrderedSerialize trait is a clean solution. There are a couple of minor suggestions for documentation and comment clarity (mentioned in the findings summary) that the author might consider, but they are not blockers. The code appears to be in good shape for merging. As an AI, I am not authorized to approve pull requests; please ensure further human review and approval before merging.
Replace all serde_json::to_{string,vec,writer} with
equivalent canon_json::CanonJsonSerialize to make the
output stable / reproducible.
Signed-off-by: Etienne Champetier <[email protected]>
| ); | ||
| let cached_manifest = serde_json::to_string(manifest).context("Serializing manifest")?; | ||
| let cached_manifest = manifest | ||
| .to_canon_json_string() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: this isn't a new problem, but we probably should have always been storing the original manifest instead of re-serializing.
| // serde is guaranteed not to output newlines here | ||
| let buf = serde_json::to_vec(&v)?; | ||
| // canon_json is guaranteed not to output newlines here | ||
| let buf = v.to_canon_json_vec()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the progress stuff needs reproducibility, but I guess it doesn't hurt either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but ideally i would like to just ban serde_json::to_{vec,string,writer} with some linter (haven't looked how to do it yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cgwalters ok to add those 3 functions to clippy disallowed-methods ? (in a follow up PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to it but not really excited by it either...it doesn't seem to me we will have that many places where we really need the reproducibility that we need to explicitly ban it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will leave it as is then
Edit: we now have canon-json, use it