Skip to content

Conversation

@yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented May 22, 2025

We just saw a sentry report that was issued due to InternalError raised after Dequeue'ing from a disk queue. It's not clear what the error was (since it was redacted), but it might have been a DiskFull error. We already have special handling for it on the Enqueue path, but the Dequeue path can also trigger this error (on the first call to Dequeue after some Enqueue calls - in order to flush the buffered batches), so this commit audits all disk queue methods to use the helper for error propagation.

The only place where we do disk usage accounting is diskQueue.writeFooterAndFlush, so I traced which methods could end up calling it (both Enqueue and Dequeue, but also Close) and their call sites - this is how the affected places were chosen. Additionally, I didn't want to introduce the error propagation via panics if it wasn't there already, so one spot wasn't modified.

Fixes: #147132.

Release note: None

@yuzefovich yuzefovich requested a review from a team as a code owner May 22, 2025 17:48
@yuzefovich yuzefovich requested review from rytaft and removed request for a team May 22, 2025 17:48
@yuzefovich yuzefovich added backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.1.x backport-25.2.x Flags PRs that need to be backported to 25.2 labels May 22, 2025
@blathers-crl
Copy link

blathers-crl bot commented May 22, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich
Copy link
Member Author

friendly ping @cockroachdb/sql-queries-prs

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing this! It's not totally clear to me why you chose these specific locations to check for the DiskFull error and not other locations (only some of them are clearly Deque related as far as I can tell). I guess it's all PartitionedQueue methods?

Is there ever a case in colexec where you'd want to treat DiskFull errors as an internal error? If not, maybe you should get rid of this helper function and just put the logic inside colexecerror.InternalError().

Alternatively, maybe you could add a defer inside the PartitionedQueue functions to check for DiskFull errors so that callers wouldn't need to know to check for it.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)

@michae2
Copy link
Collaborator

michae2 commented Oct 23, 2025

[question from triage] @yuzefovich wondering if this is worth picking up again?

@yuzefovich yuzefovich removed backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.2.x Flags PRs that need to be backported to 25.2 labels Oct 23, 2025
@yuzefovich yuzefovich marked this pull request as draft October 23, 2025 17:41
We just saw a sentry report that was issued due to InternalError raised
after Dequeue'ing from a disk queue. It's not clear what the error was
(since it was redacted), but it might have been a DiskFull error. We
already have special handling for it on the Enqueue path, but the
Dequeue path can also trigger this error (on the first call to Dequeue
after some Enqueue calls - in order to flush the buffered batches), so
this commit audits all disk queue methods to use the helper for error
propagation.

The only place where we do disk usage accounting is
`diskQueue.writeFooterAndFlush`, so I traced which methods could end up
calling it (both Enqueue and Dequeue, but also Close) and their call
sites - this is how the affected places were chosen. Additionally,
I didn't want to introduce the error propagation via panics if it wasn't
there already, so one spot wasn't modified.

Release note: None
@yuzefovich yuzefovich marked this pull request as ready for review November 17, 2025 18:49
@yuzefovich yuzefovich added backport-25.3.x Flags PRs that need to be backported to 25.3 backport-25.4.x Flags PRs that need to be backported to 25.4 labels Nov 17, 2025
@github-actions
Copy link

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@github-actions github-actions bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label Nov 17, 2025
@yuzefovich
Copy link
Member Author

Sorry for the delay on responding to the comments.

I looked into changing how we do error propagation from the disk queue (i.e. to change from explicit error return argument to panic-then-catch approach we use elsewhere in the vectorized engine), and I decided to not proceed with that approach. (The change touched many LOC, it didn't seem safer / cleaner than the current approach, and it seemed not worth pursuing just for this particular case.)

It's not totally clear to me why you chose these specific locations to check for the DiskFull error and not other locations (only some of them are clearly Deque related as far as I can tell). I guess it's all PartitionedQueue methods?

I extended the commit message to indicate how the particular locations were chosen. There is really no downside in using the HandleErrorFromDiskQueue helper in place of colexecerror.InternalError since both do error propagation via panics, and also if I missed some spots by mistake there is very minor downside too (the user could still see a scary trace and we'll get a sentry error that we shouldn't).

Is there ever a case in colexec where you'd want to treat DiskFull errors as an internal error? If not, maybe you should get rid of this helper function and just put the logic inside colexecerror.InternalError().

That's an interesting idea. I do think that we always want to treat DiskFull errors as "expected", but for now I'm a bit hesitant on pushing any complexity / special cases into colexecerror methods - right now the contract of InternalError is pretty clear that it'll be annotated with a stack trace and result in a sentry error, unclear why we'd only have a special case for DiskFull errors.

Alternatively, maybe you could add a defer inside the PartitionedQueue functions to check for DiskFull errors so that callers wouldn't need to know to check for it.

The difficulty with this is that we currently propagate all errors out of the disk queue implementation via the return arguments, and we'd have to change how we do error propagation so that we could add a panic-throw in the defer. I prototyped going in this direction but abandon the approach.

@yuzefovich yuzefovich requested review from a team and DrewKimball and removed request for a team November 17, 2025 19:02
@yuzefovich yuzefovich added the O-AI-Review-Not-Helpful AI reviewer produced result which was incorrect or unhelpful label Nov 17, 2025
Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Thanks!

@michae2 reviewed 2 of 3 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball)

@yuzefovich
Copy link
Member Author

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 17, 2025

@craig craig bot merged commit e837db7 into cockroachdb:master Nov 17, 2025
25 of 26 checks passed
@blathers-crl
Copy link

blathers-crl bot commented Nov 17, 2025

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #147132: branch-release-25.3, branch-release-25.4.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-25.3.x Flags PRs that need to be backported to 25.3 backport-25.4.x Flags PRs that need to be backported to 25.4 O-AI-Review-Not-Helpful AI reviewer produced result which was incorrect or unhelpful o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. target-release-26.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

colexec: v24.1.2: temp storage error is incorrectly propagated as InternalError

5 participants