-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add snapshot WebSocket API #583
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still haven't tested by myself yet but looks pretty straight forward and promising.
Left some comments for things I noticed so far. I'm planning to do more review next week 👀
Since this is a new connection the server doesn't know the collector was previously paused, so instead wait for the server to re-pause this collector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works as advertised, and reconnects automatically, but I did run into a panic on Ctrl+C:
2024/08/16 14:03:51 V [server1] Uploading snapshot to websocket
^C2024/08/16 14:03:58 I Exiting...
2024/08/16 14:03:58 V Log file fsnotify watcher received stop signal
2024/08/16 14:03:58 E [server1] Error closing websocket: close tcp 127.0.0.1:38368->127.0.0.1:5200: use of closed network connection
2024/08/16 14:03:58 V Stopping log tail for /var/log/postgresql/postgresql-15-main.log (stop requested)
2024/08/16 14:03:58 V Log file fsnotify watcher received stop signal
2024/08/16 14:03:58 V Stopping log tail for /var/log/postgresql/postgresql-16-main.log (stop requested)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xe487ca]
goroutine 9032 [running]:
github.com/gorilla/websocket.(*Conn).Close(...)
/home/maciek/code/collector/vendor/github.com/gorilla/websocket/conn.go:352
github.com/pganalyze/collector/runner.connect.func1()
/home/maciek/code/collector/runner/websocket.go:63 +0x16a
created by github.com/pganalyze/collector/runner.connect in goroutine 126
/home/maciek/code/collector/runner/websocket.go:59 +0x78f
In passing, increase the default websocket buffer size from 4KB to 10KB
It looks like the test suite failed because it's trying to open a WebSocket connection to the production servers, but that side of the code hasn't been merged yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this with https://github.com/pganalyze/pganalyze/pull/4234 and it worked fine. I left some minor comments on the code, but it looks fine, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good - a few minor things pointed out, but seems close to be ready to merge.
gorilla/websocket had to be updated: gorilla/websocket#877
Okay, I believe all of the review comments have been addressed. I also fixed the issue where quitting the collector through Control-C showed a websocket warning. |
This ensures we don't try connecting to the websocket while doing a log test
@@ -397,7 +400,7 @@ func main() { | |||
CollectSystemInformation: !noSystemInformation, | |||
StateFilename: stateFilename, | |||
WriteStateUpdate: (!dryRun && !dryRunLogs && !testRun) || forceStateUpdate, | |||
ForceEmptyGrant: dryRun || dryRunLogs || benchmark, | |||
ForceEmptyGrant: dryRun || dryRunLogs || testRunLogs || benchmark, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed for the test suite to pass (otherwise the test suite would try and fail to establish a websocket connection). It's not clear from the main.go
documentation what the difference is between --test-logs
and --dry-run-logs
. Maybe one of them should be dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so I think the different is:
--test-logs
is supposed to get a log grant, and then either do a log download or log tail--dry-run-logs
is supposed to work without an API server, so skips the grant
However, I'm not sure if the distinction matters.
Really the main use case is --test-logs
, which we document, to verify whether the log setup is correct. I suppose doing a log grant could be beneficial for Enterprise environments (since I think it would trigger the server side to check for correct LOG_ENCRYPTION_KEY / bucket setup?). Though that's soon to be removed anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the right course of action is here. Is there a blocker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a blocker. We might want to remove --dry-run-logs
after #597 is merged, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I also don't think this is a blocker. I haven't seen much benefit of --dry-run-logs
, so I'm good with dropping it.
That said, the regular --dry-run
is useful though (it helps get clarity on what data the collector sent), and I've used that a few times for debugging with a customer, so we shouldn't be messing with that, I think :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't explain my reasoning here. I'm thinking that there's little benefit to actually submitting test snapshots to the API, so in dropping --dry-run-logs
my intention was to make --test-logs
a dry run command. Then we only need to contact the API for a top-level --test
in order to check for a valid grant config.
Not sure where exactly, but it appears there is a problem with the fallback, at least in my local test right now (on the latest tip of this branch) I got the following error, running against the regular server without websocket support:
(note the "Error - can't upload without valid S3 grant", which I don't get when running on |
That was actually a bug from removing an existing write to |
Ah, makes sense - now we know why its needed :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the ForceEmptyGrant
change is okay (comment inline), this looks good.
@@ -397,7 +400,7 @@ func main() { | |||
CollectSystemInformation: !noSystemInformation, | |||
StateFilename: stateFilename, | |||
WriteStateUpdate: (!dryRun && !dryRunLogs && !testRun) || forceStateUpdate, | |||
ForceEmptyGrant: dryRun || dryRunLogs || benchmark, | |||
ForceEmptyGrant: dryRun || dryRunLogs || testRunLogs || benchmark, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the right course of action is here. Is there a blocker?
This introduces a persistent WebSocket connection for all server communication. Additional details:
grant_logs
is still used for every log snapshot, since it's needed to generate a S3 upload URL. This will be deprecated in a separate PR.ServerMessage
s.state.GrantConfig
struct has been replaced with the same structure inside the protobuf, so the server and collector stay in sync.ServerMessage
includes aPause
message that will let us disable duplicate collector instances for the same server, and anExplainRun
message for the Explain Upload feature that's currently in-progress.