Firestore backend improvements by rosstimothy · Pull Request #28473 · gravitational/teleport

rosstimothy · 2023-06-29T15:40:20Z

CAS now utilizes a transaction to ensure the operation is atomic

The original implementation did not use transactions which violated
the atomic guarantees of the CAS operation. The backend compliance
test was able to catch this when it was updated to run concurrent
CAS operations.

Update is limited to updating a value

The original implementation of Update was actually doing a get and
then upsert. However, there are no guarantees that prevent a delete
from occurring between get and upsert, which means Update would
upsert the value instead of failing. Instead of get and then upsert
we now update the document using the (firestore.DocumentRef) Update
method.

Watching items from the collection filters out any audit events

If Teleport is configured to use the same collection for backend state
and audit events the collection watcher ends up consuming all audit
events as empty backend items. To avoid this the watcher is now filtering
out any collections which have an empty key since it is not possible
for backend resources to be written without a key this will only
exclude audit events which have a different schema.

SearchEvents now filters out backend resources

Similar to above, the Firestore events implementation now excludes
any documents which have an empty session id to prevent backend
resources from getting included in queries for audit events if the
collection is being shared.

Partially addresses #28118

The backend test suite was not validating that simultaneous CAS operations result in only one attempt succeeding. The test now runs multiple concurrent CAS operations and ensures that only a single operation succeeds. This shortcoming with the test allowed the Firestore backend to pass the compliance test while not perfoming CAS in an atomic manner.

1) CAS now utilizes a transaction to ensure the operation is atomic The original implementation did not use transactions which violated the atomic guarantees of the CAS operation. The backend compliance test was able to catch this when it was updated to run concurrent CAS opertations. 2) Update is limited to updating a value The original implementation of Update was actually doing a get and then upsert. However, there are no guarantees that prevent a delete from occurring between get and upsert, which means Update would upsert the value instead of failing. Instead of get and then upsert we now update the document using the (firestore.DocumentRef) Update method. 3) Watching items from the collection filters out any audit events If Teleport is configured to use the same collection for backend state and audit events the collection watcher ends up consuming all audit events as empty backend items. To avoid this the watcher is now filtering out any collections which have an empty key since it is not possible for backend resources to be written without a key this will only exclude audit events which have a different schema. 4) SearchEvents now filters out backend resources Similar to above, the Firestore events implementation now excludes any documents which have an empty session id to prevent backend resources from getting included in queries for audit events if the collection is being shared.

espadolini

Instead of increasing the expiry interval to make the tests work, should we add an optional method to backends to trigger a manual expiry deletion?

rosstimothy · 2023-07-05T15:14:50Z

PTAL @fspmarshall @jakule @GavinFrazar

jakule · 2023-07-05T18:16:48Z

 	require.NoError(t, err)
 	require.Equal(t, []byte("2"), out.Value)
+
+	for i := 0; i < 10; i++ {


Did you have a chance to run this test before and after your changes? From what I remember Firestore emulator is strongly consistent, so this will never fail in CI, but it can in production.

The very first commit of this PR was solely changing this test to capture the CAS failure.

Also we don't actually run Firestore tests in CI 😢

jakule · 2023-07-05T18:17:52Z

@@ -110,73 +130,87 @@ type Constructor func(options ...ConstructionOption) (backend.Backend, clockwork
 // backend under test.
 func RunBackendComplianceSuite(t *testing.T, newBackend Constructor) {


Can t.Parallel() also be added to the test itself?

See #28473 (comment)

jakule · 2023-07-05T18:18:19Z

 		}

-		clock := clockwork.NewFakeClock()
+		clock := clockwork.NewRealClock()


jakule · 2023-07-05T18:20:00Z

+		"collection_name":                       collection,
+		"project_id":                            projectID,
+		"endpoint":                              endpoint,
+		"purge_expired_documents_poll_interval": 300 * time.Millisecond,


Can you add explain why 300ms?

We could probably adjust this a bit if needed but I took the value from a similarish value in the dynamodb tests https://github.com/gravitational/teleport/blob/master/lib/backend/dynamo/dynamodbbk_test.go#L52

public-teleport-github-review-bot · 2023-07-05T21:43:29Z

@rosstimothy See the table below for backport results.

Branch	Result
branch/v11	Create PR
branch/v12	Create PR
branch/v13	Create PR

fspmarshall · 2023-07-05T21:51:52Z

+			if _, err := docSnap.Ref.Delete(ctx, firestore.LastUpdateTime(docSnap.UpdateTime)); err != nil && status.Code(err) == codes.FailedPrecondition {
+				// If the document has been updated, then attempt one additional get to see if the
+				// resource was updated and is no longer expired.
+				docSnap, err := b.svc.Collection(b.CollectionName).Doc(docSnap.Ref.ID).Get(ctx)
+				if err != nil {
+					return nil, ConvertGRPCError(err)
+				}
+				r, err := newRecordFromDoc(docSnap)
+				if err != nil {
+					return nil, trace.Wrap(err)
+				}
+
+				if !r.isExpired(b.clock.Now()) {
+					values = append(values, r.backendItem())
+				}


low-priority nit: this doesn't violate the backend model, but there may be cases where it would result in what I'd call "annoying" behavior:

Say that I have some range R, and the following sequence of operations happens:

An initial range request occurs across range R at time T1.

A write happens to key R1 at time T2.

A write happens to key R2 at time T3.

A delete of the old value of R2 is attempted by the original range request, falling back to a read of R2 at time T4.

In this sequence of events, the resulting range contains the value of R2 as set at T3, but omits the value of R1 as set at T2.

This is ultimately fine, and can happen for any number of legitimate reasons, since many of our backends either don't guarantee that writes to different keys are well-ordered in the event stream, and/or because range requests are evaluated sequentially from the "latest" state, so values that appear later in the range may also be "newer".

However, all else being equal, less inconsistencies of this nature are probably better, and in this case I think we could have less network round-trips, simpler logic, and more consistency by just partially or completely omitting this logic.

We already treat expiry as a lazy/best-effort affair, and given the limitations of our backends we're going to keep doing that at least to some extent. I'd suggest either completely omitting optimistic cleanup, or simply returning the expired value if optimistic cleanup fails for any reason.

Additionally, it might be worth putting a per-request cap on the amount of optimistic cleanup we're willing to do... if a cluster was offline for a while, we might start up to a backend state with 20_000 expired nodes. We don't really want a GetRange call waiting on 20_000 API requests I think. Better to just let those 20_000 nodes get whittled down asynchronously.

Anyhow, like I said... not actually a violation of the backend model, just a potential annoyance, so take it or leave it.

* Test concurrent compare and swaps The backend test suite was not validating that simultaneous CAS operations result in only one attempt succeeding. The test now runs multiple concurrent CAS operations and ensures that only a single operation succeeds. This shortcoming with the test allowed the Firestore backend to pass the compliance test while not perfoming CAS in an atomic manner. * Firestore backend improvements 1) CAS now utilizes a transaction to ensure the operation is atomic The original implementation did not use transactions which violated the atomic guarantees of the CAS operation. The backend compliance test was able to catch this when it was updated to run concurrent CAS opertations. 2) Update is limited to updating a value The original implementation of Update was actually doing a get and then upsert. However, there are no guarantees that prevent a delete from occurring between get and upsert, which means Update would upsert the value instead of failing. Instead of get and then upsert we now update the document using the (firestore.DocumentRef) Update method. 3) Watching items from the collection filters out any audit events If Teleport is configured to use the same collection for backend state and audit events the collection watcher ends up consuming all audit events as empty backend items. To avoid this the watcher is now filtering out any collections which have an empty key since it is not possible for backend resources to be written without a key this will only exclude audit events which have a different schema. 4) SearchEvents now filters out backend resources Similar to above, the Firestore events implementation now excludes any documents which have an empty session id to prevent backend resources from getting included in queries for audit events if the collection is being shared. * speed up backend test suite * conditionally delete expired items on get * fix: cleanup tests

Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.

The change to use `Update` in #28473 caused a 0 value to be written to the expires field of the document. When using `Create` or `Set` a zero expiry value would result in an empty expires field for that document. Now that the expires column was containing a 0 value the purgeExpiredDocuments loop was detecting the document as needing to be removed. To prevent removing resources that have no expiry the purge loop now explicitly ignores any resources that have a 0 expiry value. Fixes #30236

rosstimothy added 2 commits June 29, 2023 12:42

rosstimothy force-pushed the tross/fix_firestore branch from 477cc94 to 9edbc43 Compare June 29, 2023 16:42

speed up backend test suite

fd5a13a

rosstimothy force-pushed the tross/fix_firestore branch from a89f298 to fd5a13a Compare June 29, 2023 19:07

rosstimothy added backport/branch/v11 labels Jun 29, 2023

rosstimothy marked this pull request as ready for review June 29, 2023 19:34

rosstimothy requested review from espadolini and fspmarshall June 29, 2023 19:34

github-actions Bot added audit-log Issues related to Teleports Audit Log size/md labels Jun 29, 2023

github-actions Bot requested review from GavinFrazar and jakule June 29, 2023 19:34

espadolini approved these changes Jun 30, 2023

View reviewed changes

Comment thread lib/backend/firestore/firestorebk_test.go Outdated

Comment thread lib/backend/firestore/firestorebk_test.go

Comment thread lib/backend/test/suite.go Outdated

espadolini reviewed Jun 30, 2023

View reviewed changes

Comment thread lib/backend/test/suite.go Outdated

rosstimothy added 2 commits June 30, 2023 10:18

conditionally delete expired items on get

3659f5d

fix: cleanup tests

4cc81d3

espadolini approved these changes Jul 3, 2023

View reviewed changes

Comment thread lib/backend/firestore/firestorebk.go

jakule approved these changes Jul 5, 2023

View reviewed changes

public-teleport-github-review-bot Bot removed request for GavinFrazar and fspmarshall July 5, 2023 18:20

rosstimothy added this pull request to the merge queue Jul 5, 2023

Merged via the queue into master with commit a91ad82 Jul 5, 2023

rosstimothy deleted the tross/fix_firestore branch July 5, 2023 21:41

rosstimothy mentioned this pull request Jul 5, 2023

[v13] Firestore backend improvements #28737

Merged

This was referenced Jul 5, 2023

[v12] Firestore backend improvements #28738

Merged

[v11] Firestore backend improvements #28739

Merged

fspmarshall reviewed Jul 5, 2023

View reviewed changes

rosstimothy added a commit that referenced this pull request Aug 4, 2023

Stablize backend test suite

d257a2a

Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.

rosstimothy mentioned this pull request Aug 4, 2023

Stablize backend test suite #30068

Merged

github-merge-queue Bot pushed a commit that referenced this pull request Aug 4, 2023

Stablize backend test suite (#30068)

194c86e

Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.

github-actions Bot pushed a commit that referenced this pull request Aug 4, 2023

Stablize backend test suite

ddd3f1b

Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.

github-actions Bot pushed a commit that referenced this pull request Aug 4, 2023

Stablize backend test suite

8220b97

Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.

github-actions Bot pushed a commit that referenced this pull request Aug 4, 2023

Stablize backend test suite

9899b55

Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.

github-merge-queue Bot pushed a commit that referenced this pull request Aug 5, 2023

Stablize backend test suite (#30076)

f400bc8

Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.

github-merge-queue Bot pushed a commit that referenced this pull request Aug 5, 2023

Stablize backend test suite (#30074)

cccdc96

Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.

github-merge-queue Bot pushed a commit that referenced this pull request Aug 5, 2023

Stablize backend test suite (#30075)

1de5baf

Removes the parallel subtests added in #28473 to reduce the chance of the backend tests failing.

rosstimothy mentioned this pull request Aug 10, 2023

Fix resources being deleted from Firestore on update #30272

Merged

		@@ -110,73 +130,87 @@ type Constructor func(options ...ConstructionOption) (backend.Backend, clockwork
		// backend under test.
		func RunBackendComplianceSuite(t *testing.T, newBackend Constructor) {

Conversation

rosstimothy commented Jun 29, 2023

Uh oh!

espadolini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rosstimothy commented Jul 5, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

public-teleport-github-review-bot Bot commented Jul 5, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants