-
Notifications
You must be signed in to change notification settings - Fork 4k
colexecdisk: propagate DiskFull errors as expected #147168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
friendly ping @cockroachdb/sql-queries-prs |
rytaft
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for missing this! It's not totally clear to me why you chose these specific locations to check for the DiskFull error and not other locations (only some of them are clearly Deque related as far as I can tell). I guess it's all PartitionedQueue methods?
Is there ever a case in colexec where you'd want to treat DiskFull errors as an internal error? If not, maybe you should get rid of this helper function and just put the logic inside colexecerror.InternalError().
Alternatively, maybe you could add a defer inside the PartitionedQueue functions to check for DiskFull errors so that callers wouldn't need to know to check for it.
Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
|
[question from triage] @yuzefovich wondering if this is worth picking up again? |
We just saw a sentry report that was issued due to InternalError raised after Dequeue'ing from a disk queue. It's not clear what the error was (since it was redacted), but it might have been a DiskFull error. We already have special handling for it on the Enqueue path, but the Dequeue path can also trigger this error (on the first call to Dequeue after some Enqueue calls - in order to flush the buffered batches), so this commit audits all disk queue methods to use the helper for error propagation. The only place where we do disk usage accounting is `diskQueue.writeFooterAndFlush`, so I traced which methods could end up calling it (both Enqueue and Dequeue, but also Close) and their call sites - this is how the affected places were chosen. Additionally, I didn't want to introduce the error propagation via panics if it wasn't there already, so one spot wasn't modified. Release note: None
de85006 to
a077113
Compare
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
|
Sorry for the delay on responding to the comments. I looked into changing how we do error propagation from the disk queue (i.e. to change from explicit error return argument to panic-then-catch approach we use elsewhere in the vectorized engine), and I decided to not proceed with that approach. (The change touched many LOC, it didn't seem safer / cleaner than the current approach, and it seemed not worth pursuing just for this particular case.)
I extended the commit message to indicate how the particular locations were chosen. There is really no downside in using the
That's an interesting idea. I do think that we always want to treat DiskFull errors as "expected", but for now I'm a bit hesitant on pushing any complexity / special cases into
The difficulty with this is that we currently propagate all errors out of the disk queue implementation via the return arguments, and we'd have to change how we do error propagation so that we could add a panic-throw in the defer. I prototyped going in this direction but abandon the approach. |
michae2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michae2 reviewed 2 of 3 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball)
|
TFTRs! bors r+ |
|
Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches. Issue #147132: branch-release-25.3, branch-release-25.4. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
We just saw a sentry report that was issued due to InternalError raised after Dequeue'ing from a disk queue. It's not clear what the error was (since it was redacted), but it might have been a DiskFull error. We already have special handling for it on the Enqueue path, but the Dequeue path can also trigger this error (on the first call to Dequeue after some Enqueue calls - in order to flush the buffered batches), so this commit audits all disk queue methods to use the helper for error propagation.
The only place where we do disk usage accounting is
diskQueue.writeFooterAndFlush, so I traced which methods could end up calling it (both Enqueue and Dequeue, but also Close) and their call sites - this is how the affected places were chosen. Additionally, I didn't want to introduce the error propagation via panics if it wasn't there already, so one spot wasn't modified.Fixes: #147132.
Release note: None