Skip to content

Firestore backend improvements #28473

Merged
rosstimothy merged 5 commits intomasterfrom
tross/fix_firestore
Jul 5, 2023
Merged

Firestore backend improvements #28473
rosstimothy merged 5 commits intomasterfrom
tross/fix_firestore

Conversation

@rosstimothy
Copy link
Copy Markdown
Contributor

  1. CAS now utilizes a transaction to ensure the operation is atomic

The original implementation did not use transactions which violated
the atomic guarantees of the CAS operation. The backend compliance
test was able to catch this when it was updated to run concurrent
CAS operations.

  1. Update is limited to updating a value

The original implementation of Update was actually doing a get and
then upsert. However, there are no guarantees that prevent a delete
from occurring between get and upsert, which means Update would
upsert the value instead of failing. Instead of get and then upsert
we now update the document using the (firestore.DocumentRef) Update
method.

  1. Watching items from the collection filters out any audit events

If Teleport is configured to use the same collection for backend state
and audit events the collection watcher ends up consuming all audit
events as empty backend items. To avoid this the watcher is now filtering
out any collections which have an empty key since it is not possible
for backend resources to be written without a key this will only
exclude audit events which have a different schema.

  1. SearchEvents now filters out backend resources

Similar to above, the Firestore events implementation now excludes
any documents which have an empty session id to prevent backend
resources from getting included in queries for audit events if the
collection is being shared.

Partially addresses #28118

The backend test suite was not validating that simultaneous CAS
operations result in only one attempt succeeding. The test now
runs multiple concurrent CAS operations and ensures that only a
single operation succeeds. This shortcoming with the test allowed
the Firestore backend to pass the compliance test while not perfoming
CAS in an atomic manner.
1) CAS now utilizes a transaction to ensure the operation is atomic

The original implementation did not use transactions which violated
the atomic guarantees of the CAS operation. The backend compliance
test was able to catch this when it was updated to run concurrent
CAS opertations.

2) Update is limited to updating a value

The original implementation of Update was actually doing a get and
then upsert. However, there are no guarantees that prevent a delete
from occurring between get and upsert, which means Update would
upsert the value instead of failing. Instead of get and then upsert
we now update the document using the (firestore.DocumentRef) Update
method.

3) Watching items from the collection filters out any audit events

If Teleport is configured to use the same collection for backend state
and audit events the collection watcher ends up consuming all audit
events as empty backend items. To avoid this the watcher is now filtering
out any collections which have an empty key since it is not possible
for backend resources to be written without a key this will only
exclude audit events which have a different schema.

4) SearchEvents now filters out backend resources

Similar to above, the Firestore events implementation now excludes
any documents which have an empty session id to prevent backend
resources from getting included in queries for audit events if the
collection is being shared.
@rosstimothy rosstimothy force-pushed the tross/fix_firestore branch from 477cc94 to 9edbc43 Compare June 29, 2023 16:42
@rosstimothy rosstimothy force-pushed the tross/fix_firestore branch from a89f298 to fd5a13a Compare June 29, 2023 19:07
@rosstimothy rosstimothy marked this pull request as ready for review June 29, 2023 19:34
@github-actions github-actions Bot added audit-log Issues related to Teleports Audit Log size/md labels Jun 29, 2023
@github-actions github-actions Bot requested review from GavinFrazar and jakule June 29, 2023 19:34
Copy link
Copy Markdown
Contributor

@espadolini espadolini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of increasing the expiry interval to make the tests work, should we add an optional method to backends to trigger a manual expiry deletion?

Comment thread lib/backend/firestore/firestorebk_test.go Outdated
Comment thread lib/backend/firestore/firestorebk_test.go
Comment thread lib/backend/test/suite.go Outdated
Comment thread lib/backend/test/suite.go Outdated
Comment thread lib/backend/firestore/firestorebk.go
@rosstimothy
Copy link
Copy Markdown
Contributor Author

PTAL @fspmarshall @jakule @GavinFrazar

Comment thread lib/backend/test/suite.go
require.NoError(t, err)
require.Equal(t, []byte("2"), out.Value)

for i := 0; i < 10; i++ {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you have a chance to run this test before and after your changes? From what I remember Firestore emulator is strongly consistent, so this will never fail in CI, but it can in production.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The very first commit of this PR was solely changing this test to capture the CAS failure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we don't actually run Firestore tests in CI 😢

Comment thread lib/backend/test/suite.go
@@ -110,73 +130,87 @@ type Constructor func(options ...ConstructionOption) (backend.Backend, clockwork
// backend under test.
func RunBackendComplianceSuite(t *testing.T, newBackend Constructor) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can t.Parallel() also be added to the test itself?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

clock := clockwork.NewFakeClock()
clock := clockwork.NewRealClock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😿

"collection_name": collection,
"project_id": projectID,
"endpoint": endpoint,
"purge_expired_documents_poll_interval": 300 * time.Millisecond,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add explain why 300ms?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably adjust this a bit if needed but I took the value from a similarish value in the dynamodb tests https://github.com/gravitational/teleport/blob/master/lib/backend/dynamo/dynamodbbk_test.go#L52

@rosstimothy rosstimothy added this pull request to the merge queue Jul 5, 2023
Merged via the queue into master with commit a91ad82 Jul 5, 2023
@rosstimothy rosstimothy deleted the tross/fix_firestore branch July 5, 2023 21:41
@public-teleport-github-review-bot
Copy link
Copy Markdown

@rosstimothy See the table below for backport results.

Branch Result
branch/v11 Create PR
branch/v12 Create PR
branch/v13 Create PR

Comment on lines +429 to +443
if _, err := docSnap.Ref.Delete(ctx, firestore.LastUpdateTime(docSnap.UpdateTime)); err != nil && status.Code(err) == codes.FailedPrecondition {
// If the document has been updated, then attempt one additional get to see if the
// resource was updated and is no longer expired.
docSnap, err := b.svc.Collection(b.CollectionName).Doc(docSnap.Ref.ID).Get(ctx)
if err != nil {
return nil, ConvertGRPCError(err)
}
r, err := newRecordFromDoc(docSnap)
if err != nil {
return nil, trace.Wrap(err)
}

if !r.isExpired(b.clock.Now()) {
values = append(values, r.backendItem())
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low-priority nit: this doesn't violate the backend model, but there may be cases where it would result in what I'd call "annoying" behavior:

Say that I have some range R, and the following sequence of operations happens:

  • An initial range request occurs across range R at time T1.
  • A write happens to key R1 at time T2.
  • A write happens to key R2 at time T3.
  • A delete of the old value of R2 is attempted by the original range request, falling back to a read of R2 at time T4.

In this sequence of events, the resulting range contains the value of R2 as set at T3, but omits the value of R1 as set at T2.

This is ultimately fine, and can happen for any number of legitimate reasons, since many of our backends either don't guarantee that writes to different keys are well-ordered in the event stream, and/or because range requests are evaluated sequentially from the "latest" state, so values that appear later in the range may also be "newer".

However, all else being equal, less inconsistencies of this nature are probably better, and in this case I think we could have less network round-trips, simpler logic, and more consistency by just partially or completely omitting this logic.

We already treat expiry as a lazy/best-effort affair, and given the limitations of our backends we're going to keep doing that at least to some extent. I'd suggest either completely omitting optimistic cleanup, or simply returning the expired value if optimistic cleanup fails for any reason.

Additionally, it might be worth putting a per-request cap on the amount of optimistic cleanup we're willing to do... if a cluster was offline for a while, we might start up to a backend state with 20_000 expired nodes. We don't really want a GetRange call waiting on 20_000 API requests I think. Better to just let those 20_000 nodes get whittled down asynchronously.

Anyhow, like I said... not actually a violation of the backend model, just a potential annoyance, so take it or leave it.

ravicious pushed a commit that referenced this pull request Jul 11, 2023
* Test concurrent compare and swaps

The backend test suite was not validating that simultaneous CAS
operations result in only one attempt succeeding. The test now
runs multiple concurrent CAS operations and ensures that only a
single operation succeeds. This shortcoming with the test allowed
the Firestore backend to pass the compliance test while not perfoming
CAS in an atomic manner.

* Firestore backend improvements

1) CAS now utilizes a transaction to ensure the operation is atomic

The original implementation did not use transactions which violated
the atomic guarantees of the CAS operation. The backend compliance
test was able to catch this when it was updated to run concurrent
CAS opertations.

2) Update is limited to updating a value

The original implementation of Update was actually doing a get and
then upsert. However, there are no guarantees that prevent a delete
from occurring between get and upsert, which means Update would
upsert the value instead of failing. Instead of get and then upsert
we now update the document using the (firestore.DocumentRef) Update
method.

3) Watching items from the collection filters out any audit events

If Teleport is configured to use the same collection for backend state
and audit events the collection watcher ends up consuming all audit
events as empty backend items. To avoid this the watcher is now filtering
out any collections which have an empty key since it is not possible
for backend resources to be written without a key this will only
exclude audit events which have a different schema.

4) SearchEvents now filters out backend resources

Similar to above, the Firestore events implementation now excludes
any documents which have an empty session id to prevent backend
resources from getting included in queries for audit events if the
collection is being shared.

* speed up backend test suite

* conditionally delete expired items on get

* fix: cleanup tests
rosstimothy added a commit that referenced this pull request Aug 4, 2023
Removes the parallel subtests added in #28473 to reduce the chance
of the backend tests failing.
github-merge-queue Bot pushed a commit that referenced this pull request Aug 4, 2023
Removes the parallel subtests added in #28473 to reduce the chance
of the backend tests failing.
github-actions Bot pushed a commit that referenced this pull request Aug 4, 2023
Removes the parallel subtests added in #28473 to reduce the chance
of the backend tests failing.
github-actions Bot pushed a commit that referenced this pull request Aug 4, 2023
Removes the parallel subtests added in #28473 to reduce the chance
of the backend tests failing.
github-actions Bot pushed a commit that referenced this pull request Aug 4, 2023
Removes the parallel subtests added in #28473 to reduce the chance
of the backend tests failing.
github-merge-queue Bot pushed a commit that referenced this pull request Aug 5, 2023
Removes the parallel subtests added in #28473 to reduce the chance
of the backend tests failing.
github-merge-queue Bot pushed a commit that referenced this pull request Aug 5, 2023
Removes the parallel subtests added in #28473 to reduce the chance
of the backend tests failing.
github-merge-queue Bot pushed a commit that referenced this pull request Aug 5, 2023
Removes the parallel subtests added in #28473 to reduce the chance
of the backend tests failing.
rosstimothy added a commit that referenced this pull request Aug 10, 2023
The change to use `Update` in #28473 caused a 0 value to be written
to the expires field of the document. When using `Create` or `Set`
a zero expiry value would result in an empty expires field for that
document. Now that the expires column was containing a 0 value the
purgeExpiredDocuments loop was detecting the document as needing
to be removed. To prevent removing resources that have no expiry
the purge loop now explicitly ignores any resources that have a 0
expiry value.

Fixes #30236
github-merge-queue Bot pushed a commit that referenced this pull request Aug 10, 2023
The change to use `Update` in #28473 caused a 0 value to be written
to the expires field of the document. When using `Create` or `Set`
a zero expiry value would result in an empty expires field for that
document. Now that the expires column was containing a 0 value the
purgeExpiredDocuments loop was detecting the document as needing
to be removed. To prevent removing resources that have no expiry
the purge loop now explicitly ignores any resources that have a 0
expiry value.

Fixes #30236
github-actions Bot pushed a commit that referenced this pull request Aug 10, 2023
The change to use `Update` in #28473 caused a 0 value to be written
to the expires field of the document. When using `Create` or `Set`
a zero expiry value would result in an empty expires field for that
document. Now that the expires column was containing a 0 value the
purgeExpiredDocuments loop was detecting the document as needing
to be removed. To prevent removing resources that have no expiry
the purge loop now explicitly ignores any resources that have a 0
expiry value.

Fixes #30236
github-actions Bot pushed a commit that referenced this pull request Aug 10, 2023
The change to use `Update` in #28473 caused a 0 value to be written
to the expires field of the document. When using `Create` or `Set`
a zero expiry value would result in an empty expires field for that
document. Now that the expires column was containing a 0 value the
purgeExpiredDocuments loop was detecting the document as needing
to be removed. To prevent removing resources that have no expiry
the purge loop now explicitly ignores any resources that have a 0
expiry value.

Fixes #30236
github-actions Bot pushed a commit that referenced this pull request Aug 10, 2023
The change to use `Update` in #28473 caused a 0 value to be written
to the expires field of the document. When using `Create` or `Set`
a zero expiry value would result in an empty expires field for that
document. Now that the expires column was containing a 0 value the
purgeExpiredDocuments loop was detecting the document as needing
to be removed. To prevent removing resources that have no expiry
the purge loop now explicitly ignores any resources that have a 0
expiry value.

Fixes #30236
github-merge-queue Bot pushed a commit that referenced this pull request Aug 10, 2023
The change to use `Update` in #28473 caused a 0 value to be written
to the expires field of the document. When using `Create` or `Set`
a zero expiry value would result in an empty expires field for that
document. Now that the expires column was containing a 0 value the
purgeExpiredDocuments loop was detecting the document as needing
to be removed. To prevent removing resources that have no expiry
the purge loop now explicitly ignores any resources that have a 0
expiry value.

Fixes #30236
github-merge-queue Bot pushed a commit that referenced this pull request Aug 10, 2023
The change to use `Update` in #28473 caused a 0 value to be written
to the expires field of the document. When using `Create` or `Set`
a zero expiry value would result in an empty expires field for that
document. Now that the expires column was containing a 0 value the
purgeExpiredDocuments loop was detecting the document as needing
to be removed. To prevent removing resources that have no expiry
the purge loop now explicitly ignores any resources that have a 0
expiry value.

Fixes #30236
github-merge-queue Bot pushed a commit that referenced this pull request Aug 10, 2023
The change to use `Update` in #28473 caused a 0 value to be written
to the expires field of the document. When using `Create` or `Set`
a zero expiry value would result in an empty expires field for that
document. Now that the expires column was containing a 0 value the
purgeExpiredDocuments loop was detecting the document as needing
to be removed. To prevent removing resources that have no expiry
the purge loop now explicitly ignores any resources that have a 0
expiry value.

Fixes #30236
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

audit-log Issues related to Teleports Audit Log size/md

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants