colexecdisk: propagate DiskFull errors as expected #147168

yuzefovich · 2025-05-22T17:48:46Z

We just saw a sentry report that was issued due to InternalError raised after Dequeue'ing from a disk queue. It's not clear what the error was (since it was redacted), but it might have been a DiskFull error. We already have special handling for it on the Enqueue path, but the Dequeue path can also trigger this error (on the first call to Dequeue after some Enqueue calls - in order to flush the buffered batches), so this commit audits all disk queue methods to use the helper for error propagation.

The only place where we do disk usage accounting is diskQueue.writeFooterAndFlush, so I traced which methods could end up calling it (both Enqueue and Dequeue, but also Close) and their call sites - this is how the affected places were chosen. Additionally, I didn't want to introduce the error propagation via panics if it wasn't there already, so one spot wasn't modified.

Fixes: #147132.

Release note: None

blathers-crl · 2025-05-22T17:48:50Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2025-05-22T17:49:00Z

This change is

yuzefovich · 2025-06-03T23:48:35Z

friendly ping @cockroachdb/sql-queries-prs

rytaft

Sorry for missing this! It's not totally clear to me why you chose these specific locations to check for the DiskFull error and not other locations (only some of them are clearly Deque related as far as I can tell). I guess it's all PartitionedQueue methods?

Is there ever a case in colexec where you'd want to treat DiskFull errors as an internal error? If not, maybe you should get rid of this helper function and just put the logic inside colexecerror.InternalError().

Alternatively, maybe you could add a defer inside the PartitionedQueue functions to check for DiskFull errors so that callers wouldn't need to know to check for it.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)

michae2 · 2025-10-23T16:15:01Z

[question from triage] @yuzefovich wondering if this is worth picking up again?

We just saw a sentry report that was issued due to InternalError raised after Dequeue'ing from a disk queue. It's not clear what the error was (since it was redacted), but it might have been a DiskFull error. We already have special handling for it on the Enqueue path, but the Dequeue path can also trigger this error (on the first call to Dequeue after some Enqueue calls - in order to flush the buffered batches), so this commit audits all disk queue methods to use the helper for error propagation. The only place where we do disk usage accounting is `diskQueue.writeFooterAndFlush`, so I traced which methods could end up calling it (both Enqueue and Dequeue, but also Close) and their call sites - this is how the affected places were chosen. Additionally, I didn't want to introduce the error propagation via panics if it wasn't there already, so one spot wasn't modified. Release note: None

github-actions · 2025-11-17T18:56:26Z

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

yuzefovich · 2025-11-17T19:02:51Z

Sorry for the delay on responding to the comments.

I looked into changing how we do error propagation from the disk queue (i.e. to change from explicit error return argument to panic-then-catch approach we use elsewhere in the vectorized engine), and I decided to not proceed with that approach. (The change touched many LOC, it didn't seem safer / cleaner than the current approach, and it seemed not worth pursuing just for this particular case.)

It's not totally clear to me why you chose these specific locations to check for the DiskFull error and not other locations (only some of them are clearly Deque related as far as I can tell). I guess it's all PartitionedQueue methods?

I extended the commit message to indicate how the particular locations were chosen. There is really no downside in using the HandleErrorFromDiskQueue helper in place of colexecerror.InternalError since both do error propagation via panics, and also if I missed some spots by mistake there is very minor downside too (the user could still see a scary trace and we'll get a sentry error that we shouldn't).

Is there ever a case in colexec where you'd want to treat DiskFull errors as an internal error? If not, maybe you should get rid of this helper function and just put the logic inside colexecerror.InternalError().

That's an interesting idea. I do think that we always want to treat DiskFull errors as "expected", but for now I'm a bit hesitant on pushing any complexity / special cases into colexecerror methods - right now the contract of InternalError is pretty clear that it'll be annotated with a stack trace and result in a sentry error, unclear why we'd only have a special case for DiskFull errors.

Alternatively, maybe you could add a defer inside the PartitionedQueue functions to check for DiskFull errors so that callers wouldn't need to know to check for it.

The difficulty with this is that we currently propagate all errors out of the disk queue implementation via the return arguments, and we'd have to change how we do error propagation so that we could add a panic-throw in the defer. I prototyped going in this direction but abandon the approach.

michae2

Thanks!

@michae2 reviewed 2 of 3 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball)

yuzefovich · 2025-11-17T19:36:05Z

TFTRs!

bors r+

craig · 2025-11-17T21:21:53Z

Build succeeded:

blathers-crl · 2025-11-17T21:22:00Z

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.

Issue #147132: branch-release-25.3, branch-release-25.4.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

yuzefovich requested a review from a team as a code owner May 22, 2025 17:48

yuzefovich requested review from rytaft and removed request for a team May 22, 2025 17:48

yuzefovich added backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.1.x backport-25.2.x Flags PRs that need to be backported to 25.2 labels May 22, 2025

rytaft reviewed Jun 4, 2025

View reviewed changes

celiala removed the backport-25.1.x label Aug 5, 2025

yuzefovich removed backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.2.x Flags PRs that need to be backported to 25.2 labels Oct 23, 2025

yuzefovich marked this pull request as draft October 23, 2025 17:41

yuzefovich force-pushed the disk-error branch from de85006 to a077113 Compare November 17, 2025 18:33

yuzefovich marked this pull request as ready for review November 17, 2025 18:49

yuzefovich added backport-25.3.x Flags PRs that need to be backported to 25.3 backport-25.4.x Flags PRs that need to be backported to 25.4 labels Nov 17, 2025

github-actions bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label Nov 17, 2025

yuzefovich requested review from a team and DrewKimball and removed request for a team November 17, 2025 19:02

yuzefovich added the O-AI-Review-Not-Helpful AI reviewer produced result which was incorrect or unhelpful label Nov 17, 2025

michae2 approved these changes Nov 17, 2025

View reviewed changes

craig bot merged commit e837db7 into cockroachdb:master Nov 17, 2025
25 of 26 checks passed

celeste-cockroachdb bot added the target-release-26.1.0 label Nov 17, 2025

blathers-crl bot mentioned this pull request Nov 17, 2025

colexec: v24.1.2: temp storage error is incorrectly propagated as InternalError #147132

Closed

This was referenced Nov 17, 2025

release-25.3: colexecdisk: propagate DiskFull errors as expected #157943

Merged

release-25.4: colexecdisk: propagate DiskFull errors as expected #157946

Merged

yuzefovich deleted the disk-error branch November 17, 2025 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

colexecdisk: propagate DiskFull errors as expected #147168

colexecdisk: propagate DiskFull errors as expected #147168

Uh oh!

yuzefovich commented May 22, 2025 •

edited

Loading

Uh oh!

blathers-crl bot commented May 22, 2025

Uh oh!

cockroach-teamcity commented May 22, 2025

Uh oh!

yuzefovich commented Jun 3, 2025

Uh oh!

rytaft left a comment

Uh oh!

michae2 commented Oct 23, 2025

Uh oh!

github-actions bot commented Nov 17, 2025

Uh oh!

yuzefovich commented Nov 17, 2025

Uh oh!

michae2 left a comment

Uh oh!

yuzefovich commented Nov 17, 2025

Uh oh!

craig bot commented Nov 17, 2025

Uh oh!

Uh oh!

blathers-crl bot commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

colexecdisk: propagate DiskFull errors as expected #147168

colexecdisk: propagate DiskFull errors as expected #147168

Uh oh!

Conversation

yuzefovich commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blathers-crl bot commented May 22, 2025

Uh oh!

cockroach-teamcity commented May 22, 2025

Uh oh!

yuzefovich commented Jun 3, 2025

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

michae2 commented Oct 23, 2025

Uh oh!

github-actions bot commented Nov 17, 2025

Potential Bug(s) Detected

Uh oh!

yuzefovich commented Nov 17, 2025

Uh oh!

michae2 left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich commented Nov 17, 2025

Uh oh!

craig bot commented Nov 17, 2025

Uh oh!

Uh oh!

blathers-crl bot commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yuzefovich commented May 22, 2025 •

edited

Loading