Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement RFD 24 for alternative DynamoDB event indexing #6583

Merged
merged 22 commits into from
May 6, 2021

Conversation

xacrimon
Copy link
Contributor

@xacrimon xacrimon commented Apr 23, 2021

This implements the solution outlined in RFD 24 by switching the indexing strategy for time search.
This needs to be backported to branch/v6 for inclusion in v6.2 after this PR has been merged.

@xacrimon xacrimon added this to the 6.2 "Buffalo" milestone Apr 23, 2021
@xacrimon xacrimon self-assigned this Apr 23, 2021
@xacrimon xacrimon added the aws Used for AWS Related Issues. label Apr 23, 2021
@xacrimon xacrimon force-pushed the joel/graceful-event-two branch from df1abfc to db242cd Compare April 24, 2021 13:54
@xacrimon xacrimon marked this pull request as ready for review April 24, 2021 14:03
@xacrimon xacrimon requested review from fspmarshall and quinqu April 24, 2021 14:04
@xacrimon
Copy link
Contributor Author

xacrimon commented Apr 24, 2021

@fspmarshall @awly Please be very adversarial when reviewing this since this is making modifications to schema and touches event data.

@xacrimon xacrimon requested review from awly and removed request for quinqu April 26, 2021 16:43
@xacrimon xacrimon force-pushed the joel/graceful-event-two branch 2 times, most recently from 09cc40b to 8946827 Compare April 26, 2021 19:16
@xacrimon
Copy link
Contributor Author

Theoretically the update loop in the migration code could be replaced by a batch write in the future but I have opted not to do it here since it complicates my next events iteration PR since it reworks some of this code. Therefor it makes more sense to apply that optimization in that PR instead.

Copy link
Contributor

@fspmarshall fspmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial pass.

I think we definitely want manual confirmation that a cluster with multiple days of back events migrates correctly before putting this into a release.

Also, please update comments generally to explain what happens in the event of concurrent initialization from multiple auth servers.

Comment on lines 357 to 359
log.Info("Creating new DynamoDB index...")
if !hasIndexV2 {
err = l.createV2GSI()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Info("Creating new DynamoDB index...")
if !hasIndexV2 {
err = l.createV2GSI()
if !hasIndexV2 {
log.Info("Creating new DynamoDB index...")
err = l.createV2GSI()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if !hasIndexV1 {
migrateIndices = false

_, err := dataBackend.Get(ctx, migrateRFD24FinishedKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a custom key, would it be possible to delete the V1 index after migration, using that as the marker that migration is complete? That way we don't have to write some additional code to clean up the marker key in some future version? Also, doing this would eliminate the need for passing around a reference to the teleport backed, which would be preferable. Especially since relying on backend keys like this isn't resilient to migrations across backends via tctl get all/--bootstrap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems preferable, will impl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 700 to 740
// AWS returns a `lastEvaluatedKey` in case the response is truncated, i.e. needs to be fetched with
// multiple requests. According to their documentation, the final response is signaled by not setting
// this value - therefore we use it as our break condition.
lastEvaluatedKey = out.LastEvaluatedKey
if len(lastEvaluatedKey) == 0 {
sort.Sort(events.ByTimeAndIndex(values))
return values, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this check causes the function to only return events from the first date in the range, since we now query each date individually. I believe that we should be continuing to next date when we hit this condition, instead of returning.

Related: The g.Error("DynamoDB response size exceeded limit.") log line needs to be moved up into the date iteration loop.

Also related: There are enough nested loops here that loop labels would really help readability.

Copy link
Contributor Author

@xacrimon xacrimon Apr 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review was on an outdated version. Fixed and added loop labels

// the schema to add a string key `date`.
//
// This does not remove the old global secondary index.
// This must be done at a later point in time when all events have been migrated as per RFD 24.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment conflicts with the above migration logic, which is removing the v1 index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the V1 index to use as a checkpoint, is now deleted after event migration.

Comment on lines 1025 to 1033
if err := l.createV2GSI(); err != nil {
return trace.Wrap(err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When creating a new table, the V2 GSI should be part of the initial table configuration, rather than added separately. As written, it will be possible for a table to exist which has neither the V1 or the V2 GSI present, which will break the migration logic (and is generally undesirable).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 839 to 897
// If we hit this time, we give up waiting.
waitStart := time.Now()
endWait := waitStart.Add(time.Hour * 24)

// Wait until the index is created and active.
for time.Now().Before(endWait) {
indexExists, err := l.indexExists(l.Tablename, indexTimeSearchV2)
if err != nil {
return trace.Wrap(err)
}

if indexExists {
log.Info("DynamoDB index created")
break
}

time.Sleep(time.Second * 5)
elapsed := time.Since(waitStart).Seconds()
log.Infof("Creating new DynamoDB index, %f seconds elapsed...", elapsed)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

24 hours is definitely too long to block (for context, table creation blocks for up to ~8 minutes). Also, this wait needs to be cancellable by the context.Context passed into New(...).

Copy link
Contributor Author

@xacrimon xacrimon Apr 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved now by

  • Reducing timeout to 10 minutes
  • Marking the index as existing even in an "UPDATE" state which is safe to do, just didn't consider it previously
  • Made the wait cancellable

Comment on lines 373 to 390
go func() {
log.Info("Starting event migration to v6.2 format")
err := l.migrateDateAttribute(ctx)
if err != nil {
log.WithError(err).Error("Encountered error migrating events to v6.2 format")
return
}

item := backend.Item{
Key: migrateRFD24FinishedKey,
Value: make([]byte, 0),
}

_, err = dataBackend.Put(ctx, item)
if err != nil {
log.WithError(err).Error("Migrated all events to v6.2 format successfully but failed to write flag to backend.")
}
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If event migration fails, it won't be attempted again until teleport restarts. This isn't ideal. Instead, event migration should continue to be attempted until either it succeeds, or another auth server successfully performs the migration (this should be done on a fairly long and jittered interval).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added migration retries on a jittered interval.

@xacrimon
Copy link
Contributor Author

Did an initial pass.

I think we definitely want manual confirmation that a cluster with multiple days of back events migrates correctly before putting this into a release.

Also, please update comments generally to explain what happens in the event of concurrent initialization from multiple auth servers.

You don't happen to have a test cluster anywhere do you? I'm afraid I don't have any homelab setup to run this on.

@fspmarshall
Copy link
Contributor

You don't happen to have a test cluster anywhere do you? I'm afraid I don't have any homelab setup to run this on.

No, my experience is limited to etcd and local backends. @quinqu could probably assist on this front (or will know who can).

@xacrimon
Copy link
Contributor Author

xacrimon commented Apr 27, 2021

You don't happen to have a test cluster anywhere do you? I'm afraid I don't have any homelab setup to run this on.

No, my experience is limited to etcd and local backends. @quinqu could probably assist on this front (or will know who can).

Found a cluster I had lying around on DO with about a weeks worth of events. Will apply this migration and validate results.

@xacrimon xacrimon requested a review from fspmarshall April 27, 2021 17:56
@xacrimon
Copy link
Contributor Author

@fspmarshall Added comments explaining what happens in case of concurrent migration from multiple auth servers.

@xacrimon
Copy link
Contributor Author

Successfully migrated a homelab cluster containing 1 weeks worth of various events without issues.

Copy link
Contributor

@awly awly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: there's a migration shim for regular backends called on startup

if err := bk.Migrate(ctx); err != nil {
return nil, trace.Wrap(err)
}

If this dynamo events backend implements a backend.Backend, consider triggering the migration there.

@@ -43,6 +43,35 @@ import (
log "github.com/sirupsen/logrus"
)

// isoDateLayout is the time formatting layout used by the date attribute on events.
const isoDateLayout = "2006-01-02"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this ISO 8601?
if yes, rename to iso8601DateFormat
if no, clarify what iso means here

Copy link
Contributor Author

@xacrimon xacrimon Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, renamed to iso8601DateFormat

Comment on lines 325 to 328
delay := utils.HalfJitter(time.Minute * 5)
log.WithError(err).Errorf("Failed RFD 24 migration, making another attempt in %f seconds", delay.Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delay is never actually used to wait

also, the delay interval seems rather long.
do we want to block the auth server startup for up to 5min because the first attempt at index creation failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could reduce this to something shorter like 1 minute without too much extra work being done I think, may run into some API errors from Dynamo more times on duplicate attempts at index modification but that's safe as long as we retry.

Copy link
Contributor Author

@xacrimon xacrimon Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced to 1 minute in a recent commit which feels like a good compromise.

}

// Table is already up to date.
// We use the existence of the V1 index has a completion flag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We use the existence of the V1 index has a completion flag
// We use the existence of the V1 index as a completion flag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


// Table is already up to date.
// We use the existence of the V1 index has a completion flag
// for migration. We remove it and the end of the migration which
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// for migration. We remove it and the end of the migration which
// for migration. We remove it at the end of the migration which

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 390 to 404
log.Info("Removing old DynamoDB index")
err = l.removeV1GSI()
if err != nil {
log.WithError(err).Error("Migrated all events to v6.2 format successfully but failed remove old index.")
} else {
break
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this only happen when migrateDateAttribute succeeded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should, control flow has been fixed.

IndexName: aws.String(indexTimeSearch),
ExclusiveStartKey: lastEvaluatedKey,
dateLoop:
for _, date := range dates {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does O(N) queries. Is there any way we can do a single query over only CreatedAt without CreatedAtDate?

Copy link
Contributor Author

@xacrimon xacrimon Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the index partitioning disallows this due to how DynamoDB works internally, this is what we do at the moment on master but due to partitioning constraints this means all data ends up in one hardcoded partition which is the precise thing RFD 24 works around since partitions have data limits. I am currently unaware of a way to do this and did not find a way to do this the last time I investigated it, I will have another look through the Dynamo API docs again however if anything has changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had another look through the API and this seems to be impossible with current API limitations. BatchGetRequest looked interesting at first but it turns out it only supports point lookups for range keys which won't work for our case.

}

for _, gsi := range tableDescription.Table.GlobalSecondaryIndexes {
if *gsi.IndexName == indexName && (*gsi.IndexStatus == dynamodb.IndexStatusActive || *gsi.IndexStatus == dynamodb.IndexStatusUpdating) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can any of these fields be nil?

Copy link
Contributor Author

@xacrimon xacrimon Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They cannot according to the SDK if I am reading correctly. The name is mandatory and the index status is a string enum that's always populated.


// For every item processed by this scan iteration we send an update action
// that adds the new date attribute.
for _, item := range scanOut.Items {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this loop check the existing attributes on the event and skip events that were already migrated?

Copy link
Contributor Author

@xacrimon xacrimon Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scan already filters out events that have the new attribute on the Dynamo side using the query scan filter attribute_not_exists(CreatedAtDate). All events that this loop iterates over are not migrated.


_, err = l.svc.UpdateItem(c)
if err != nil {
log.Infof("item fail data %q", item)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this looks like a leftover from debugging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

CHANGELOG.md Outdated
This release of teleport contains minor features and bugfixes.

* Changed DynamoDB events backend indexing strategy. [#6583](https://github.com/gravitational/teleport/pull/6583)
Warning! This will trigger a data migration on the first start after upgrade. For optimal performance perform this migration with only one auth server and no other nodes online. It may take some time and progress will be periodically written to syslog. Once Teleport starts and is accessible via Web UI, the rest of the cluster may be started.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Warning! This will trigger a data migration on the first start after upgrade. For optimal performance perform this migration with only one auth server and no other nodes online. It may take some time and progress will be periodically written to syslog. Once Teleport starts and is accessible via Web UI, the rest of the cluster may be started.
Warning! This will trigger a data migration on the first start after upgrade. For optimal performance perform this migration with only one auth server online. It may take some time and progress will be periodically written to the auth server log. Once Teleport starts and is accessible via Web UI, the rest of the cluster may be started.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@awly
Copy link
Contributor

awly commented Apr 27, 2021

@xacrimon could you also add some unit tests? even if it's using a stub dynamodb implementation

@xacrimon xacrimon requested a review from awly April 28, 2021 16:17
@xacrimon
Copy link
Contributor Author

xacrimon commented Apr 28, 2021

@xacrimon could you also add some unit tests? even if it's using a stub dynamodb implementation

@awly I could but am unsure how to go about doing this without a) not testing anything significant in regards to migration logic or b) reimplementing large parts with a dynamo mock implementation with all it's validation logic which would take significant time since an off the shelf solution supporting all the API's we use does not exist as far as I am aware.

I will go ahead and write some tests using a stub implementation regardless since it's an improvement anyhow.

@russjones russjones added the rfd Request for Discussion label Apr 28, 2021
@xacrimon xacrimon force-pushed the joel/graceful-event-two branch from 441ec37 to 7bbb53f Compare May 6, 2021 16:34
Copy link
Contributor

@klizhentas klizhentas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws Used for AWS Related Issues. backport-required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants