Skip to content

Conversation

@pawanjay176
Copy link
Member

@pawanjay176 pawanjay176 commented Oct 17, 2025

Issue Addressed

N/A

Proposed Changes

  1. In the batch retry logic, we were failing to set the batch state to AwaitingDownload before attempting a retry. This PR sets it to AwaitingDownload before the retry and sets it back to Downloading if the retry suceeded in sending out a request
  2. Remove all peer scoring logic from retrying and rely on just de priorotizing the failed peer. I finally concede the point to @dapplion 😄
  3. Changes block_components_by_range_request to accept block_peers and column_peers. This is to ensure that we use the full synced peerset for requesting columns in order to avoid splitting the column peers among multiple head chains. During forward sync, we want the block peers to be the peers from the syncing chain and column peers to be all synced peers from the peerdb.
    Also, fixes a typo and calls attempt_send_awaiting_download_batches from more places

@pawanjay176 pawanjay176 added the ready-for-review The code is ready for review label Oct 17, 2025
Copy link
Collaborator

@dapplion dapplion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just two suggestions to document behaviour

if !*exceeded_retries {
// Set the batch back to `AwaitingDownload` before retrying.
// This is to ensure that the batch doesn't get stuck in `Downloading` state.
batch.download_failed(None)?;
Copy link
Member

@jimmygchen jimmygchen Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check the returned outcome here?
What happens if it returns Ok(BatchState::Failed)?

Looking at the code it looks like we may hit this?

other => {
self.state = other;
Err(WrongState(format!(
"Starting download for batch in wrong state {:?}",
self.state
)))

Copy link
Member Author

@pawanjay176 pawanjay176 Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. I realised I was also incrementing the batch failed count when we shouldn't be as we want to keep the batch retries separate from the column retries in the original retry PR.
Fixed in ac8adb4 . Please take a look

@jimmygchen jimmygchen added the v8.0.0 Q4 2025 Fusaka Mainnet Release label Oct 20, 2025
//
// DataColumn retries has a retry limit so calling `downloading_to_awaiting_download`
// is safe.
if let BatchOperationOutcome::Failed { blacklist } =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BatchOperationOutcome::Failed is no longer reachable after changing the call to downloading_to_awaiting_download.

Also i think the function name conveys clearly what it does, but not its intended usage, but I also struggle to come up with something better.

batch.download_failed_skip_attempt_count seems a bit too wordy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think handling BatchOperationOutcome::Failed is a good idea though.

Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me too! Thanks for the cleanup

@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Oct 20, 2025
@mergify mergify bot added the queued label Oct 20, 2025
mergify bot added a commit that referenced this pull request Oct 20, 2025
@mergify
Copy link

mergify bot commented Oct 20, 2025

This pull request has been removed from the queue for the following reason: checks failed.

The merge conditions cannot be satisfied due to failing checks:

You can check the last failing draft PR here: #8242.

You may have to fix your CI before adding the pull request to the queue again.
If you update this pull request, to fix the CI, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio requeue comment.

@mergify mergify bot removed the queued label Oct 20, 2025
@jimmygchen
Copy link
Member

@mergify requeue

@mergify
Copy link

mergify bot commented Oct 20, 2025

requeue

✅ The queue state of this pull request has been cleaned. It can be re-embarked automatically

@mergify mergify bot added the queued label Oct 20, 2025
mergify bot added a commit that referenced this pull request Oct 20, 2025
@mergify
Copy link

mergify bot commented Oct 20, 2025

This pull request has been removed from the queue for the following reason: checks failed.

The merge conditions cannot be satisfied due to failing checks:

You can check the last failing draft PR here: #8243.

You may have to fix your CI before adding the pull request to the queue again.
If you update this pull request, to fix the CI, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio requeue comment.

@mergify mergify bot removed the queued label Oct 20, 2025
@jimmygchen
Copy link
Member

@mergify requeue

@mergify
Copy link

mergify bot commented Oct 20, 2025

requeue

✅ The queue state of this pull request has been cleaned. It can be re-embarked automatically

@mergify mergify bot added the queued label Oct 20, 2025
mergify bot added a commit that referenced this pull request Oct 20, 2025
@mergify
Copy link

mergify bot commented Oct 20, 2025

This pull request has been removed from the queue for the following reason: checks failed.

The merge conditions cannot be satisfied due to failing checks:

You can check the last failing draft PR here: #8244.

You may have to fix your CI before adding the pull request to the queue again.
If you update this pull request, to fix the CI, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio requeue comment.

@mergify mergify bot removed the queued label Oct 20, 2025
@jimmygchen
Copy link
Member

@mergify requeue

@mergify
Copy link

mergify bot commented Oct 20, 2025

requeue

✅ The queue state of this pull request has been cleaned. It can be re-embarked automatically

@mergify mergify bot added the queued label Oct 20, 2025
mergify bot added a commit that referenced this pull request Oct 20, 2025
@mergify mergify bot merged commit 092aaae into sigp:unstable Oct 20, 2025
38 checks passed
@mergify mergify bot removed the queued label Oct 20, 2025
jchavarri pushed a commit to jchavarri/lighthouse that referenced this pull request Oct 21, 2025
N/A


  1. In the batch retry logic, we were failing to set the batch state to `AwaitingDownload` before attempting a retry. This PR sets it to `AwaitingDownload` before the retry and sets it back to `Downloading` if the retry suceeded in sending out a request
2. Remove all peer scoring logic from retrying and rely on just de priorotizing the failed peer. I finally concede the point to @dapplion 😄
3. Changes `block_components_by_range_request` to accept `block_peers` and `column_peers`. This is to ensure that we use the full synced peerset for requesting columns in order to avoid splitting the column peers among multiple head chains. During forward sync, we want the block peers to be the peers from the syncing chain and column peers to be all synced peers from the peerdb.
Also, fixes a typo and calls `attempt_send_awaiting_download_batches` from more places


Co-Authored-By: Pawan Dhananjay <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-merge This PR is ready to merge. syncing v8.0.0 Q4 2025 Fusaka Mainnet Release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants