Skip to content

Conversation

@champtar
Copy link
Contributor

@champtar champtar commented Jun 2, 2025

Edit: we now have canon-json, use it

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve reproducibility in JSON serialization by introducing a custom to_canonical_json_string function. The intent is good, and the changes replacing HashMap with BTreeMap in ostree-ext/src/container/store.rs are a positive step towards this goal as they ensure sorted map keys.

However, the core of this PR, the use of olpc_cjson::CanonicalFormatter in to_canonical_json_string, raises a critical concern. As you've noted in the PR description, and as confirmed by the olpc-cjson documentation, this formatter can produce JSON that is not strictly valid according to standard JSON specifications (e.g., unquoted keys). This could lead to significant interoperability problems if the output is consumed by standard JSON parsers.

Additionally, there are a few places where BufWriter has been removed, which might have minor performance implications for very large JSON outputs. These are secondary to the main concern about JSON validity.

Overall, the goal of reproducible JSON is important, but the current approach for canonicalization needs careful reconsideration to ensure standard JSON validity.

Summary of Findings

  • Potential for Non-Standard JSON Output: The core change relies on olpc_cjson::CanonicalFormatter, which is documented to produce JSON that may not be strictly valid (e.g., unquoted keys). This is a critical issue if the output needs to be consumed by standard JSON parsers, as it could lead to interoperability problems.
  • Removal of Buffered Writers: BufWriter has been removed in several places where JSON is written to files or stdout. This is likely due to the new to_canonical_json_string function returning a String. For very large JSON objects, serializing to a full string in memory first and then writing might be less efficient in terms of memory and I/O patterns compared to streaming to a buffered writer. This should be re-evaluated if the canonicalization strategy changes.
  • Improved Map Serialization with BTreeMap: The changes from HashMap to BTreeMap in various parts of ostree-ext/src/container/store.rs are a positive step towards reproducible JSON, as BTreeMap ensures keys are serialized in a consistent, sorted order.

Merge Readiness

This pull request addresses an important aspect of reproducibility. The move to BTreeMap for map serialization is a good improvement. However, the use of olpc_cjson::CanonicalFormatter introduces a critical risk of producing non-standard JSON, which could break consumers expecting valid JSON.

I strongly recommend that this critical issue regarding JSON validity be resolved before merging. The PR author has already noted this concern in the description, and further discussion on the best path forward (potentially using standard serde_json with BTreeMap for canonical and valid output) is needed.

As a reviewer, I am not authorized to approve pull requests. Please ensure further review and approval from authorized maintainers after addressing the highlighted concerns.

@champtar champtar force-pushed the repro2 branch 3 times, most recently from 837b3ff to fd68fdf Compare June 3, 2025 09:14
@champtar
Copy link
Contributor Author

champtar commented Jun 3, 2025

Should I move JsonOrderedSerialize to bootc-utils and cleanup the whole codebase of serde_json::to_* ?

@champtar

This comment was marked as outdated.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust solution for ensuring reproducible JSON serialization by leveraging serde_json::Value which internally uses BTreeMap for objects, thus guaranteeing key order. The new JsonOrderedSerialize trait provides a clean abstraction for this, and its application throughout the codebase, along with the switch from HashMap to BTreeMap for relevant data structures, effectively addresses the non-deterministic output issues. The changes are well-targeted and improve the predictability of serialized outputs. Great work!

Summary of Findings

  • Documentation for JsonOrderedSerialize trait: In ostree-ext/src/json.rs, the new JsonOrderedSerialize trait could benefit from a doc comment explaining its purpose: to ensure reproducible JSON output by ordering map keys, typically by first converting to serde_json::Value. This would enhance maintainability. (Severity: low, not commented due to review settings)
  • Clarity of inline comment: In ostree-ext/src/container/store.rs (line 1007), the comment // Use serde_json::Value to make output reproducible is accurate. For slightly improved clarity, it could be rephrased to focus on the outcome, e.g., // Serialize to JSON with ordered keys for reproducibility via JsonOrderedSerialize trait. (Severity: low, not commented due to review settings)

Merge Readiness

The changes in this pull request are well-implemented and effectively address the issue of non-reproducible JSON output. The introduction of the JsonOrderedSerialize trait is a clean solution. There are a couple of minor suggestions for documentation and comment clarity (mentioned in the findings summary) that the author might consider, but they are not blockers. The code appears to be in good shape for merging. As an AI, I am not authorized to approve pull requests; please ensure further human review and approval before merging.

@github-actions github-actions bot added the area/install Issues related to `bootc install` label Jun 24, 2025
@champtar champtar changed the title More reproducibility fixes More reproducibility fixes / switch to canon-json Jun 24, 2025
Replace all serde_json::to_{string,vec,writer} with
equivalent canon_json::CanonJsonSerialize to make the
output stable / reproducible.

Signed-off-by: Etienne Champetier <[email protected]>
);
let cached_manifest = serde_json::to_string(manifest).context("Serializing manifest")?;
let cached_manifest = manifest
.to_canon_json_string()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: this isn't a new problem, but we probably should have always been storing the original manifest instead of re-serializing.

// serde is guaranteed not to output newlines here
let buf = serde_json::to_vec(&v)?;
// canon_json is guaranteed not to output newlines here
let buf = v.to_canon_json_vec()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the progress stuff needs reproducibility, but I guess it doesn't hurt either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but ideally i would like to just ban serde_json::to_{vec,string,writer} with some linter (haven't looked how to do it yet)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgwalters ok to add those 3 functions to clippy disallowed-methods ? (in a follow up PR)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to it but not really excited by it either...it doesn't seem to me we will have that many places where we really need the reproducibility that we need to explicitly ban it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will leave it as is then

@cgwalters cgwalters enabled auto-merge June 24, 2025 12:10
@cgwalters cgwalters merged commit 5c948a5 into bootc-dev:main Jun 24, 2025
30 of 34 checks passed
@champtar champtar deleted the repro2 branch June 24, 2025 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/install Issues related to `bootc install`

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants