Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add snapshot WebSocket API #583

Merged
merged 33 commits into from
Oct 3, 2024
Merged

Add snapshot WebSocket API #583

merged 33 commits into from
Oct 3, 2024

Conversation

seanlinsley
Copy link
Member

@seanlinsley seanlinsley commented Aug 15, 2024

This introduces a persistent WebSocket connection for all server communication. Additional details:

  • If unable to start a WebSocket connection, the collector falls back to the legacy API.
  • grant_logs is still used for every log snapshot, since it's needed to generate a S3 upload URL. This will be deprecated in a separate PR.
  • All messages are zlib-compressed protobuf. The collector sends snapshots to the server, and the server sends ServerMessages.
  • The existing state.GrantConfig struct has been replaced with the same structure inside the protobuf, so the server and collector stay in sync.
  • For future use, ServerMessage includes a Pause message that will let us disable duplicate collector instances for the same server, and an ExplainRun message for the Explain Upload feature that's currently in-progress.

runner/websocket.go Outdated Show resolved Hide resolved
state/state.go Outdated Show resolved Hide resolved
state/state.go Outdated Show resolved Hide resolved
@seanlinsley seanlinsley requested a review from a team August 15, 2024 16:25
Copy link
Contributor

@keiko713 keiko713 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still haven't tested by myself yet but looks pretty straight forward and promising.
Left some comments for things I noticed so far. I'm planning to do more review next week 👀

logs/querysample/tracing.go Show resolved Hide resolved
runner/full.go Show resolved Hide resolved
runner/websocket.go Outdated Show resolved Hide resolved
runner/websocket.go Show resolved Hide resolved
Since this is a new connection the server doesn't know the collector was
previously paused, so instead wait for the server to re-pause this collector.
runner/full.go Outdated Show resolved Hide resolved
runner/full.go Show resolved Hide resolved
runner/logs.go Outdated Show resolved Hide resolved
runner/websocket.go Outdated Show resolved Hide resolved
runner/websocket.go Outdated Show resolved Hide resolved
runner/websocket.go Outdated Show resolved Hide resolved
runner/websocket.go Outdated Show resolved Hide resolved
state/state.go Outdated Show resolved Hide resolved
Copy link
Contributor

@msakrejda msakrejda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as advertised, and reconnects automatically, but I did run into a panic on Ctrl+C:

2024/08/16 14:03:51 V [server1] Uploading snapshot to websocket
^C2024/08/16 14:03:58 I Exiting...
2024/08/16 14:03:58 V Log file fsnotify watcher received stop signal
2024/08/16 14:03:58 E [server1] Error closing websocket: close tcp 127.0.0.1:38368->127.0.0.1:5200: use of closed network connection
2024/08/16 14:03:58 V Stopping log tail for /var/log/postgresql/postgresql-15-main.log (stop requested)
2024/08/16 14:03:58 V Log file fsnotify watcher received stop signal
2024/08/16 14:03:58 V Stopping log tail for /var/log/postgresql/postgresql-16-main.log (stop requested)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xe487ca]

goroutine 9032 [running]:
github.com/gorilla/websocket.(*Conn).Close(...)
	/home/maciek/code/collector/vendor/github.com/gorilla/websocket/conn.go:352
github.com/pganalyze/collector/runner.connect.func1()
	/home/maciek/code/collector/runner/websocket.go:63 +0x16a
created by github.com/pganalyze/collector/runner.connect in goroutine 126
	/home/maciek/code/collector/runner/websocket.go:59 +0x78f

main.go Show resolved Hide resolved
runner/websocket.go Outdated Show resolved Hide resolved
runner/full.go Show resolved Hide resolved
@seanlinsley
Copy link
Member Author

It looks like the test suite failed because it's trying to open a WebSocket connection to the production servers, but that side of the code hasn't been merged yet.

Copy link
Contributor

@msakrejda msakrejda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this with https://github.com/pganalyze/pganalyze/pull/4234 and it worked fine. I left some minor comments on the code, but it looks fine, too.

runner/websocket.go Outdated Show resolved Hide resolved
runner/websocket.go Outdated Show resolved Hide resolved
runner/websocket.go Show resolved Hide resolved
runner/websocket.go Show resolved Hide resolved
Copy link
Member

@lfittl lfittl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good - a few minor things pointed out, but seems close to be ready to merge.

state/state.go Show resolved Hide resolved
runner/websocket.go Outdated Show resolved Hide resolved
runner/websocket.go Outdated Show resolved Hide resolved
runner/websocket.go Outdated Show resolved Hide resolved
runner/websocket.go Outdated Show resolved Hide resolved
output/compact.go Show resolved Hide resolved
protobuf/server_message.proto Show resolved Hide resolved
runner/activity.go Show resolved Hide resolved
runner/full.go Outdated Show resolved Hide resolved
runner/full.go Show resolved Hide resolved
@seanlinsley
Copy link
Member Author

Okay, I believe all of the review comments have been addressed. I also fixed the issue where quitting the collector through Control-C showed a websocket warning.

@@ -397,7 +400,7 @@ func main() {
CollectSystemInformation: !noSystemInformation,
StateFilename: stateFilename,
WriteStateUpdate: (!dryRun && !dryRunLogs && !testRun) || forceStateUpdate,
ForceEmptyGrant: dryRun || dryRunLogs || benchmark,
ForceEmptyGrant: dryRun || dryRunLogs || testRunLogs || benchmark,
Copy link
Member Author

@seanlinsley seanlinsley Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed for the test suite to pass (otherwise the test suite would try and fail to establish a websocket connection). It's not clear from the main.go documentation what the difference is between --test-logs and --dry-run-logs. Maybe one of them should be dropped?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so I think the different is:

  • --test-logs is supposed to get a log grant, and then either do a log download or log tail
  • --dry-run-logs is supposed to work without an API server, so skips the grant

However, I'm not sure if the distinction matters.

Really the main use case is --test-logs, which we document, to verify whether the log setup is correct. I suppose doing a log grant could be beneficial for Enterprise environments (since I think it would trigger the server side to check for correct LOG_ENCRYPTION_KEY / bucket setup?). Though that's soon to be removed anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the right course of action is here. Is there a blocker?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a blocker. We might want to remove --dry-run-logs after #597 is merged, though.

Copy link
Member

@lfittl lfittl Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I also don't think this is a blocker. I haven't seen much benefit of --dry-run-logs, so I'm good with dropping it.

That said, the regular --dry-run is useful though (it helps get clarity on what data the collector sent), and I've used that a few times for debugging with a customer, so we shouldn't be messing with that, I think :)

Copy link
Member Author

@seanlinsley seanlinsley Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't explain my reasoning here. I'm thinking that there's little benefit to actually submitting test snapshots to the API, so in dropping --dry-run-logs my intention was to make --test-logs a dry run command. Then we only need to contact the API for a top-level --test in order to check for a valid grant config.

@lfittl
Copy link
Member

lfittl commented Oct 1, 2024

* If unable to start a WebSocket connection, the collector falls back to the legacy API.

Not sure where exactly, but it appears there is a problem with the fallback, at least in my local test right now (on the latest tip of this branch) I got the following error, running against the regular server without websocket support:

% ./pganalyze-collector --test
2024/10/01 09:42:57 I Running collector test with pganalyze-collector 0.58.0
2024/10/01 09:42:57 W [server2] Checking log_line_prefix: database (%d) and user (%u) not found: pganalyze will not be able to correctly classify some log lines
2024/10/01 09:42:57 I [server2] Testing statistics collection...
2024/10/01 09:42:57 I [server1] Testing statistics collection...
2024/10/01 09:42:57 W [server1] Error starting websocket: websocket: bad handshake &{404 Not Found 404 HTTP/1.1 1 1 map[Content-Length:[36] Content-Type:[text/plain; charset=utf-8] Date:[Tue, 01 Oct 2024 13:42:57 GMT]] {0x140005d40c0} 36 [] false false map[] 0x14000764d00 <nil>}
2024/10/01 09:42:57 W [server2] Error starting websocket: websocket: bad handshake &{404 Not Found 404 HTTP/1.1 1 1 map[Content-Length:[36] Content-Type:[text/plain; charset=utf-8] Date:[Tue, 01 Oct 2024 13:42:57 GMT]] {0x140004d2b70} 36 [] false false map[] 0x14000578400 <nil>}
2024/10/01 09:42:57 I [server2]   Test collection successful for PostgreSQL 16.3 on aarch64-darwin, compiled by clang-14.0.0, 64-bit
2024/10/01 09:42:57 E [server2] Error uploading snapshot: Error - can't upload without valid S3 grant
2024/10/01 09:42:57 E [server2] Could not process server: Error - can't upload without valid S3 grant
2024/10/01 09:42:58 I [server1]   Test collection successful for PostgreSQL 16.3 (Debian 16.3-1.pgdg120+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024/10/01 09:42:58 E [server1] Error uploading snapshot: Error - can't upload without valid S3 grant
2024/10/01 09:42:58 E [server1] Could not process server: Error - can't upload without valid S3 grant
2024/10/01 09:42:58 I [server2] Testing activity snapshots...
2024/10/01 09:42:58 I [server1] Testing activity snapshots...
2024/10/01 09:42:58 I [server2]   Test snapshot successful
2024/10/01 09:42:58 I [server1]   Test snapshot successful

(note the "Error - can't upload without valid S3 grant", which I don't get when running on main)

@seanlinsley
Copy link
Member Author

That was actually a bug from removing an existing write to server.Grant that you asked for in the previous review, which I've added back in 6f2bb88. There's no harm in potentially writing to server.Grant multiple times, so I think it's fine to leave the code as-is.

@lfittl
Copy link
Member

lfittl commented Oct 1, 2024

That was actually a bug from removing an existing write to server.Grant that you asked for in the previous review, which I've added back in 6f2bb88. There's no harm in potentially writing to server.Grant multiple times, so I think it's fine to leave the code as-is.

Ah, makes sense - now we know why its needed :)

Copy link
Contributor

@msakrejda msakrejda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the ForceEmptyGrant change is okay (comment inline), this looks good.

@@ -397,7 +400,7 @@ func main() {
CollectSystemInformation: !noSystemInformation,
StateFilename: stateFilename,
WriteStateUpdate: (!dryRun && !dryRunLogs && !testRun) || forceStateUpdate,
ForceEmptyGrant: dryRun || dryRunLogs || benchmark,
ForceEmptyGrant: dryRun || dryRunLogs || testRunLogs || benchmark,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the right course of action is here. Is there a blocker?

@seanlinsley seanlinsley mentioned this pull request Oct 1, 2024
2 tasks
@seanlinsley seanlinsley merged commit d7a5290 into main Oct 3, 2024
3 checks passed
@seanlinsley seanlinsley deleted the websocket branch October 3, 2024 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants