Firestore backend improvements #28473
Conversation
The backend test suite was not validating that simultaneous CAS operations result in only one attempt succeeding. The test now runs multiple concurrent CAS operations and ensures that only a single operation succeeds. This shortcoming with the test allowed the Firestore backend to pass the compliance test while not perfoming CAS in an atomic manner.
1) CAS now utilizes a transaction to ensure the operation is atomic The original implementation did not use transactions which violated the atomic guarantees of the CAS operation. The backend compliance test was able to catch this when it was updated to run concurrent CAS opertations. 2) Update is limited to updating a value The original implementation of Update was actually doing a get and then upsert. However, there are no guarantees that prevent a delete from occurring between get and upsert, which means Update would upsert the value instead of failing. Instead of get and then upsert we now update the document using the (firestore.DocumentRef) Update method. 3) Watching items from the collection filters out any audit events If Teleport is configured to use the same collection for backend state and audit events the collection watcher ends up consuming all audit events as empty backend items. To avoid this the watcher is now filtering out any collections which have an empty key since it is not possible for backend resources to be written without a key this will only exclude audit events which have a different schema. 4) SearchEvents now filters out backend resources Similar to above, the Firestore events implementation now excludes any documents which have an empty session id to prevent backend resources from getting included in queries for audit events if the collection is being shared.
477cc94 to
9edbc43
Compare
a89f298 to
fd5a13a
Compare
espadolini
left a comment
There was a problem hiding this comment.
Instead of increasing the expiry interval to make the tests work, should we add an optional method to backends to trigger a manual expiry deletion?
| require.NoError(t, err) | ||
| require.Equal(t, []byte("2"), out.Value) | ||
|
|
||
| for i := 0; i < 10; i++ { |
There was a problem hiding this comment.
Did you have a chance to run this test before and after your changes? From what I remember Firestore emulator is strongly consistent, so this will never fail in CI, but it can in production.
There was a problem hiding this comment.
The very first commit of this PR was solely changing this test to capture the CAS failure.
There was a problem hiding this comment.
Also we don't actually run Firestore tests in CI 😢
| @@ -110,73 +130,87 @@ type Constructor func(options ...ConstructionOption) (backend.Backend, clockwork | |||
| // backend under test. | |||
| func RunBackendComplianceSuite(t *testing.T, newBackend Constructor) { | |||
There was a problem hiding this comment.
Can t.Parallel() also be added to the test itself?
| } | ||
|
|
||
| clock := clockwork.NewFakeClock() | ||
| clock := clockwork.NewRealClock() |
| "collection_name": collection, | ||
| "project_id": projectID, | ||
| "endpoint": endpoint, | ||
| "purge_expired_documents_poll_interval": 300 * time.Millisecond, |
There was a problem hiding this comment.
Can you add explain why 300ms?
There was a problem hiding this comment.
We could probably adjust this a bit if needed but I took the value from a similarish value in the dynamodb tests https://github.com/gravitational/teleport/blob/master/lib/backend/dynamo/dynamodbbk_test.go#L52
|
@rosstimothy See the table below for backport results.
|
| if _, err := docSnap.Ref.Delete(ctx, firestore.LastUpdateTime(docSnap.UpdateTime)); err != nil && status.Code(err) == codes.FailedPrecondition { | ||
| // If the document has been updated, then attempt one additional get to see if the | ||
| // resource was updated and is no longer expired. | ||
| docSnap, err := b.svc.Collection(b.CollectionName).Doc(docSnap.Ref.ID).Get(ctx) | ||
| if err != nil { | ||
| return nil, ConvertGRPCError(err) | ||
| } | ||
| r, err := newRecordFromDoc(docSnap) | ||
| if err != nil { | ||
| return nil, trace.Wrap(err) | ||
| } | ||
|
|
||
| if !r.isExpired(b.clock.Now()) { | ||
| values = append(values, r.backendItem()) | ||
| } |
There was a problem hiding this comment.
low-priority nit: this doesn't violate the backend model, but there may be cases where it would result in what I'd call "annoying" behavior:
Say that I have some range R, and the following sequence of operations happens:
- An initial range request occurs across range
Rat timeT1. - A write happens to key
R1at timeT2. - A write happens to key
R2at timeT3. - A delete of the old value of
R2is attempted by the original range request, falling back to a read ofR2at timeT4.
In this sequence of events, the resulting range contains the value of R2 as set at T3, but omits the value of R1 as set at T2.
This is ultimately fine, and can happen for any number of legitimate reasons, since many of our backends either don't guarantee that writes to different keys are well-ordered in the event stream, and/or because range requests are evaluated sequentially from the "latest" state, so values that appear later in the range may also be "newer".
However, all else being equal, less inconsistencies of this nature are probably better, and in this case I think we could have less network round-trips, simpler logic, and more consistency by just partially or completely omitting this logic.
We already treat expiry as a lazy/best-effort affair, and given the limitations of our backends we're going to keep doing that at least to some extent. I'd suggest either completely omitting optimistic cleanup, or simply returning the expired value if optimistic cleanup fails for any reason.
Additionally, it might be worth putting a per-request cap on the amount of optimistic cleanup we're willing to do... if a cluster was offline for a while, we might start up to a backend state with 20_000 expired nodes. We don't really want a GetRange call waiting on 20_000 API requests I think. Better to just let those 20_000 nodes get whittled down asynchronously.
Anyhow, like I said... not actually a violation of the backend model, just a potential annoyance, so take it or leave it.
* Test concurrent compare and swaps The backend test suite was not validating that simultaneous CAS operations result in only one attempt succeeding. The test now runs multiple concurrent CAS operations and ensures that only a single operation succeeds. This shortcoming with the test allowed the Firestore backend to pass the compliance test while not perfoming CAS in an atomic manner. * Firestore backend improvements 1) CAS now utilizes a transaction to ensure the operation is atomic The original implementation did not use transactions which violated the atomic guarantees of the CAS operation. The backend compliance test was able to catch this when it was updated to run concurrent CAS opertations. 2) Update is limited to updating a value The original implementation of Update was actually doing a get and then upsert. However, there are no guarantees that prevent a delete from occurring between get and upsert, which means Update would upsert the value instead of failing. Instead of get and then upsert we now update the document using the (firestore.DocumentRef) Update method. 3) Watching items from the collection filters out any audit events If Teleport is configured to use the same collection for backend state and audit events the collection watcher ends up consuming all audit events as empty backend items. To avoid this the watcher is now filtering out any collections which have an empty key since it is not possible for backend resources to be written without a key this will only exclude audit events which have a different schema. 4) SearchEvents now filters out backend resources Similar to above, the Firestore events implementation now excludes any documents which have an empty session id to prevent backend resources from getting included in queries for audit events if the collection is being shared. * speed up backend test suite * conditionally delete expired items on get * fix: cleanup tests
Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.
Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.
Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.
Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.
Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.
Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.
Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.
Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.
The change to use `Update` in #28473 caused a 0 value to be written to the expires field of the document. When using `Create` or `Set` a zero expiry value would result in an empty expires field for that document. Now that the expires column was containing a 0 value the purgeExpiredDocuments loop was detecting the document as needing to be removed. To prevent removing resources that have no expiry the purge loop now explicitly ignores any resources that have a 0 expiry value. Fixes #30236
The change to use `Update` in #28473 caused a 0 value to be written to the expires field of the document. When using `Create` or `Set` a zero expiry value would result in an empty expires field for that document. Now that the expires column was containing a 0 value the purgeExpiredDocuments loop was detecting the document as needing to be removed. To prevent removing resources that have no expiry the purge loop now explicitly ignores any resources that have a 0 expiry value. Fixes #30236
The change to use `Update` in #28473 caused a 0 value to be written to the expires field of the document. When using `Create` or `Set` a zero expiry value would result in an empty expires field for that document. Now that the expires column was containing a 0 value the purgeExpiredDocuments loop was detecting the document as needing to be removed. To prevent removing resources that have no expiry the purge loop now explicitly ignores any resources that have a 0 expiry value. Fixes #30236
The change to use `Update` in #28473 caused a 0 value to be written to the expires field of the document. When using `Create` or `Set` a zero expiry value would result in an empty expires field for that document. Now that the expires column was containing a 0 value the purgeExpiredDocuments loop was detecting the document as needing to be removed. To prevent removing resources that have no expiry the purge loop now explicitly ignores any resources that have a 0 expiry value. Fixes #30236
The change to use `Update` in #28473 caused a 0 value to be written to the expires field of the document. When using `Create` or `Set` a zero expiry value would result in an empty expires field for that document. Now that the expires column was containing a 0 value the purgeExpiredDocuments loop was detecting the document as needing to be removed. To prevent removing resources that have no expiry the purge loop now explicitly ignores any resources that have a 0 expiry value. Fixes #30236
The change to use `Update` in #28473 caused a 0 value to be written to the expires field of the document. When using `Create` or `Set` a zero expiry value would result in an empty expires field for that document. Now that the expires column was containing a 0 value the purgeExpiredDocuments loop was detecting the document as needing to be removed. To prevent removing resources that have no expiry the purge loop now explicitly ignores any resources that have a 0 expiry value. Fixes #30236
The change to use `Update` in #28473 caused a 0 value to be written to the expires field of the document. When using `Create` or `Set` a zero expiry value would result in an empty expires field for that document. Now that the expires column was containing a 0 value the purgeExpiredDocuments loop was detecting the document as needing to be removed. To prevent removing resources that have no expiry the purge loop now explicitly ignores any resources that have a 0 expiry value. Fixes #30236
The change to use `Update` in #28473 caused a 0 value to be written to the expires field of the document. When using `Create` or `Set` a zero expiry value would result in an empty expires field for that document. Now that the expires column was containing a 0 value the purgeExpiredDocuments loop was detecting the document as needing to be removed. To prevent removing resources that have no expiry the purge loop now explicitly ignores any resources that have a 0 expiry value. Fixes #30236
The original implementation did not use transactions which violated
the atomic guarantees of the CAS operation. The backend compliance
test was able to catch this when it was updated to run concurrent
CAS operations.
The original implementation of Update was actually doing a get and
then upsert. However, there are no guarantees that prevent a delete
from occurring between get and upsert, which means Update would
upsert the value instead of failing. Instead of get and then upsert
we now update the document using the
(firestore.DocumentRef) Updatemethod.
If Teleport is configured to use the same collection for backend state
and audit events the collection watcher ends up consuming all audit
events as empty backend items. To avoid this the watcher is now filtering
out any collections which have an empty key since it is not possible
for backend resources to be written without a key this will only
exclude audit events which have a different schema.
Similar to above, the Firestore events implementation now excludes
any documents which have an empty session id to prevent backend
resources from getting included in queries for audit events if the
collection is being shared.
Partially addresses #28118