Skip to content

Check artifact integrity before execution#8833

Merged
AndreiEres merged 19 commits intomasterfrom
AndreiEres/check-artifact-integrity
Jun 17, 2025
Merged

Check artifact integrity before execution#8833
AndreiEres merged 19 commits intomasterfrom
AndreiEres/check-artifact-integrity

Conversation

@AndreiEres
Copy link
Copy Markdown
Contributor

@AndreiEres AndreiEres commented Jun 12, 2025

Fixes #677
Fixes #2399

Description

To detect potential corruption of PVF artifacts on disk, we store their checksums and verify if they match before execution. In case of a mismatch, we recreate the artifact.

Integration

In Candidate Validation, we treat the error similarly to PossiblyInvalidError::RuntimeConstruction due to their close nature.

Review Notes

The Black3 hashing algorithm has already been used. I believe we can switch to twox, as suggested in the issue, because the checksum does not need to be cryptographically hashed, and we do not reveal the checksum in logs.

@AndreiEres AndreiEres added the T0-node This PR/Issue is related to the topic “node”. label Jun 12, 2025
Copy link
Copy Markdown
Contributor Author

@AndreiEres AndreiEres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we come across a corrupted artifact, we should prepare it again. Can it be a possible vulnerability, @s0me0ne-unkn0wn?

unistd::{ForkResult, Pid},
};
use polkadot_node_core_pvf_common::{
executor_interface::{prepare, prevalidate},
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks messy, but I just merged two imports of polkadot_node_core_pvf_common. In fact, only compute_checksum is newly imported.


/// Compute the checksum of the given artifact.
pub fn compute_checksum(data: &[u8]) -> ArtifactChecksum {
blake3::hash(data).into()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we switch to twox?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't have any preference here. The blake3's throughput is more than enough for us, so why wouldn't we use it (especially given that we're already using it).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked Twox; it seems much faster, so I decided to switch eventually.

@AndreiEres AndreiEres marked this pull request as ready for review June 12, 2025 12:06
@AndreiEres
Copy link
Copy Markdown
Contributor Author

/cmd prdoc --audience node_dev --bump patch

@AndreiEres AndreiEres changed the title [WIP] Check artifact integrity before execution Check artifact integrity before execution Jun 12, 2025
Copy link
Copy Markdown
Contributor

@alexggh alexggh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

I added, some comments, I would also be interested if this retry path we have is ever tested with an integration test.

)
})?;

if artifact_checksum != compute_checksum(&compiled_artifact_blob) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much does this take for 10MiB, 100MiB ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blake3's throughput is ~3Gb/sec on what is close to our reference hw AFAIR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the crate's benchmark data, 10 MiB with Blake3 takes 1-2 ms. Twox should be at least 3x faster.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so we are not worried about this eating up too much time.

Comment thread polkadot/node/core/pvf/src/execute/queue.rs
Comment thread polkadot/node/core/pvf/src/execute/queue.rs
Comment thread polkadot/node/core/candidate-validation/src/lib.rs
Copy link
Copy Markdown
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left some comments but none of them are blockers!

Ok(buf)
}

pub type ArtifactChecksum = [u8; 32];
Copy link
Copy Markdown
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In other places of code, we're very idiomatic and usually go with

#[repr(transparent)]
pub struct ArtifactChecksum(H256)

With the following AsRef implementations, if needed, etc. I do not insist it should be implemented like that in this very case, it just seems to be one of our "best practices".


/// Compute the checksum of the given artifact.
pub fn compute_checksum(data: &[u8]) -> ArtifactChecksum {
blake3::hash(data).into()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't have any preference here. The blake3's throughput is more than enough for us, so why wouldn't we use it (especially given that we're already using it).

Ok((pvd, pov, execution_timeout))

let artifact_checksum = framed_recv_blocking(stream)?;
let artifact_checksum =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm NOT encouraging to change this right away, but...

  1. Why do we want to encode a raw 32-byte sequence? Why not transfer it as a raw 32-byte sequence?
  2. If we ought to encode, why don't we encode the entire tuple and do one-by-one instead?

Maybe a good candidate for a refactoring issue? I bet single recv() and single decode() are somewhat more performant than one-by-ones.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it in another pr

Comment thread polkadot/node/core/pvf/tests/it/main.rs
@paritytech-workflow-stopper
Copy link
Copy Markdown

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/15683054285
Failed job name: test-linux-stable

@AndreiEres AndreiEres added this pull request to the merge queue Jun 17, 2025
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 17, 2025
@AndreiEres AndreiEres added this pull request to the merge queue Jun 17, 2025
Merged via the queue into master with commit 310e81d Jun 17, 2025
257 of 259 checks passed
@AndreiEres AndreiEres deleted the AndreiEres/check-artifact-integrity branch June 17, 2025 14:12
@alexggh
Copy link
Copy Markdown
Contributor

alexggh commented Jun 18, 2025

@AndreiEres Can we backport this in stable2506 ? I see no reason to wait until 2509.

@AndreiEres AndreiEres added the A4-backport-stable2506 Pull request must be backported to the stable2506 release branch label Jun 18, 2025
@paritytech-release-backport-bot
Copy link
Copy Markdown

Successfully created backport PR for stable2506:

paritytech-release-backport-bot Bot pushed a commit that referenced this pull request Jun 18, 2025
Fixes #677
Fixes #2399

# Description

To detect potential corruption of PVF artifacts on disk, we store their
checksums and verify if they match before execution. In case of a
mismatch, we recreate the artifact.

## Integration

In Candidate Validation, we treat the error similarly to
PossiblyInvalidError::RuntimeConstruction due to their close nature.

## Review Notes

The Black3 hashing algorithm has already been used. I believe we can
switch to twox, as suggested in the issue, because the checksum does not
need to be cryptographically hashed, and we do not reveal the checksum
in logs.

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
(cherry picked from commit 310e81d)
EgorPopelyaev pushed a commit that referenced this pull request Jun 23, 2025
Backport #8833 into `stable2506` from AndreiEres.

This backport includes a major version bump due to internal API changes
that only affect the polkadot binary. Since stable2506 hasn’t been
released yet and no other downstream users are impacted, the change is
considered safe.

See the
[documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md)
on how to use this bot.

<!--
  # To be used by other automation, do not modify:
  original-pr-number: #${pull_number}
-->

Co-authored-by: Andrei Eres <eresav@me.com>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
alvicsam pushed a commit that referenced this pull request Oct 17, 2025
Fixes #677
Fixes #2399 

# Description

To detect potential corruption of PVF artifacts on disk, we store their
checksums and verify if they match before execution. In case of a
mismatch, we recreate the artifact.

## Integration

In Candidate Validation, we treat the error similarly to
PossiblyInvalidError::RuntimeConstruction due to their close nature.

## Review Notes

The Black3 hashing algorithm has already been used. I believe we can
switch to twox, as suggested in the issue, because the checksum does not
need to be cryptographically hashed, and we do not reveal the checksum
in logs.

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A4-backport-stable2506 Pull request must be backported to the stable2506 release branch T0-node This PR/Issue is related to the topic “node”.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PVF: Re-check file integrity before voting against; document PVF: Compromised artifact file integrity can lead to disputes

5 participants