Implement RFD 24 for alternative DynamoDB event indexing #6583

xacrimon · 2021-04-23T17:26:36Z

This implements the solution outlined in RFD 24 by switching the indexing strategy for time search.
This needs to be backported to branch/v6 for inclusion in v6.2 after this PR has been merged.

xacrimon · 2021-04-24T14:04:50Z

@fspmarshall @awly Please be very adversarial when reviewing this since this is making modifications to schema and touches event data.

xacrimon · 2021-04-26T19:34:36Z

Theoretically the update loop in the migration code could be replaced by a batch write in the future but I have opted not to do it here since it complicates my next events iteration PR since it reworks some of this code. Therefor it makes more sense to apply that optimization in that PR instead.

fspmarshall

Did an initial pass.

I think we definitely want manual confirmation that a cluster with multiple days of back events migrates correctly before putting this into a release.

Also, please update comments generally to explain what happens in the event of concurrent initialization from multiple auth servers.

fspmarshall · 2021-04-26T17:59:32Z

lib/events/dynamoevents/dynamoevents.go

+		log.Info("Creating new DynamoDB index...")
+		if !hasIndexV2 {
+			err = l.createV2GSI()


Suggested change

log.Info("Creating new DynamoDB index...")

if !hasIndexV2 {

err = l.createV2GSI()

if !hasIndexV2 {

log.Info("Creating new DynamoDB index...")

err = l.createV2GSI()

fspmarshall · 2021-04-26T18:01:54Z

lib/events/dynamoevents/dynamoevents.go

+	if !hasIndexV1 {
+		migrateIndices = false
+
+		_, err := dataBackend.Get(ctx, migrateRFD24FinishedKey)


Instead of a custom key, would it be possible to delete the V1 index after migration, using that as the marker that migration is complete? That way we don't have to write some additional code to clean up the marker key in some future version? Also, doing this would eliminate the need for passing around a reference to the teleport backed, which would be preferable. Especially since relying on backend keys like this isn't resilient to migrations across backends via tctl get all/--bootstrap.

That seems preferable, will impl

fspmarshall · 2021-04-26T18:08:52Z

lib/events/dynamoevents/dynamoevents.go

+			// AWS returns a `lastEvaluatedKey` in case the response is truncated, i.e. needs to be fetched with
+			// multiple requests. According to their documentation, the final response is signaled by not setting
+			// this value - therefore we use it as our break condition.
+			lastEvaluatedKey = out.LastEvaluatedKey
+			if len(lastEvaluatedKey) == 0 {
+				sort.Sort(events.ByTimeAndIndex(values))
+				return values, nil
+			}


IIUC this check causes the function to only return events from the first date in the range, since we now query each date individually. I believe that we should be continuing to next date when we hit this condition, instead of returning.

Related: The g.Error("DynamoDB response size exceeded limit.") log line needs to be moved up into the date iteration loop.

Also related: There are enough nested loops here that loop labels would really help readability.

Review was on an outdated version. Fixed and added loop labels

fspmarshall · 2021-04-26T18:19:19Z

lib/events/dynamoevents/dynamoevents.go

+// the schema to add a string key `date`.
+//
+// This does not remove the old global secondary index.
+// This must be done at a later point in time when all events have been migrated as per RFD 24.


This comment conflicts with the above migration logic, which is removing the v1 index.

Moved the V1 index to use as a checkpoint, is now deleted after event migration.

fspmarshall · 2021-04-26T18:39:25Z

lib/events/dynamoevents/dynamoevents.go

+	if err := l.createV2GSI(); err != nil {
+		return trace.Wrap(err)
+	}


When creating a new table, the V2 GSI should be part of the initial table configuration, rather than added separately. As written, it will be possible for a table to exist which has neither the V1 or the V2 GSI present, which will break the migration logic (and is generally undesirable).

fspmarshall · 2021-04-26T19:24:58Z

lib/events/dynamoevents/dynamoevents.go

+	// If we hit this time, we give up waiting.
+	waitStart := time.Now()
+	endWait := waitStart.Add(time.Hour * 24)
+
+	// Wait until the index is created and active.
+	for time.Now().Before(endWait) {
+		indexExists, err := l.indexExists(l.Tablename, indexTimeSearchV2)
+		if err != nil {
+			return trace.Wrap(err)
+		}
+
+		if indexExists {
+			log.Info("DynamoDB index created")
+			break
+		}
+
+		time.Sleep(time.Second * 5)
+		elapsed := time.Since(waitStart).Seconds()
+		log.Infof("Creating new DynamoDB index, %f seconds elapsed...", elapsed)
+	}


24 hours is definitely too long to block (for context, table creation blocks for up to ~8 minutes). Also, this wait needs to be cancellable by the context.Context passed into New(...).

Solved now by

Reducing timeout to 10 minutes

Marking the index as existing even in an "UPDATE" state which is safe to do, just didn't consider it previously

Made the wait cancellable

fspmarshall · 2021-04-26T19:52:29Z

lib/events/dynamoevents/dynamoevents.go

+		go func() {
+			log.Info("Starting event migration to v6.2 format")
+			err := l.migrateDateAttribute(ctx)
+			if err != nil {
+				log.WithError(err).Error("Encountered error migrating events to v6.2 format")
+				return
+			}
+
+			item := backend.Item{
+				Key:   migrateRFD24FinishedKey,
+				Value: make([]byte, 0),
+			}
+
+			_, err = dataBackend.Put(ctx, item)
+			if err != nil {
+				log.WithError(err).Error("Migrated all events to v6.2 format successfully but failed to write flag to backend.")
+			}
+		}()


If event migration fails, it won't be attempted again until teleport restarts. This isn't ideal. Instead, event migration should continue to be attempted until either it succeeds, or another auth server successfully performs the migration (this should be done on a fairly long and jittered interval).

Added migration retries on a jittered interval.

xacrimon · 2021-04-26T19:52:46Z

Did an initial pass.

I think we definitely want manual confirmation that a cluster with multiple days of back events migrates correctly before putting this into a release.

Also, please update comments generally to explain what happens in the event of concurrent initialization from multiple auth servers.

You don't happen to have a test cluster anywhere do you? I'm afraid I don't have any homelab setup to run this on.

fspmarshall · 2021-04-26T20:00:44Z

You don't happen to have a test cluster anywhere do you? I'm afraid I don't have any homelab setup to run this on.

No, my experience is limited to etcd and local backends. @quinqu could probably assist on this front (or will know who can).

xacrimon · 2021-04-27T15:54:56Z

You don't happen to have a test cluster anywhere do you? I'm afraid I don't have any homelab setup to run this on.

No, my experience is limited to etcd and local backends. @quinqu could probably assist on this front (or will know who can).

Found a cluster I had lying around on DO with about a weeks worth of events. Will apply this migration and validate results.

xacrimon · 2021-04-27T17:57:35Z

@fspmarshall Added comments explaining what happens in case of concurrent migration from multiple auth servers.

xacrimon · 2021-04-27T21:45:23Z

Successfully migrated a homelab cluster containing 1 weeks worth of various events without issues.

awly

Note: there's a migration shim for regular backends called on startup

teleport/lib/service/service.go

Lines 3153 to 3155 in 4b11dc4

    
           if err := bk.Migrate(ctx); err != nil { 
        
           	return nil, trace.Wrap(err) 
        
           }

If this dynamo events backend implements a backend.Backend, consider triggering the migration there.

awly · 2021-04-27T21:08:42Z

lib/events/dynamoevents/dynamoevents.go

@@ -43,6 +43,35 @@ import (
 	log "github.com/sirupsen/logrus"
 )

+// isoDateLayout is the time formatting layout used by the date attribute on events.
+const isoDateLayout = "2006-01-02"


is this ISO 8601?
if yes, rename to iso8601DateFormat
if no, clarify what iso means here

Done, renamed to iso8601DateFormat

awly · 2021-04-27T21:14:50Z

lib/events/dynamoevents/dynamoevents.go

+			delay := utils.HalfJitter(time.Minute * 5)
+			log.WithError(err).Errorf("Failed RFD 24 migration, making another attempt in %f seconds", delay.Seconds())


delay is never actually used to wait

also, the delay interval seems rather long.
do we want to block the auth server startup for up to 5min because the first attempt at index creation failed?

We could reduce this to something shorter like 1 minute without too much extra work being done I think, may run into some API errors from Dynamo more times on duplicate attempts at index modification but that's safe as long as we retry.

Reduced to 1 minute in a recent commit which feels like a good compromise.

awly · 2021-04-27T21:16:30Z

lib/events/dynamoevents/dynamoevents.go

+	}
+
+	// Table is already up to date.
+	// We use the existence of the V1 index has a completion flag


Suggested change

// We use the existence of the V1 index has a completion flag

// We use the existence of the V1 index as a completion flag

awly · 2021-04-27T21:16:44Z

lib/events/dynamoevents/dynamoevents.go

+
+	// Table is already up to date.
+	// We use the existence of the V1 index has a completion flag
+	// for migration. We remove it and the end of the migration which


Suggested change

// for migration. We remove it and the end of the migration which

// for migration. We remove it at the end of the migration which

awly · 2021-04-27T21:22:53Z

lib/events/dynamoevents/dynamoevents.go

+			log.Info("Removing old DynamoDB index")
+			err = l.removeV1GSI()
+			if err != nil {
+				log.WithError(err).Error("Migrated all events to v6.2 format successfully but failed remove old index.")
+			} else {
+				break
+			}


should this only happen when migrateDateAttribute succeeded?

It should, control flow has been fixed.

awly · 2021-04-27T21:33:09Z

lib/events/dynamoevents/dynamoevents.go

-			IndexName:                 aws.String(indexTimeSearch),
-			ExclusiveStartKey:         lastEvaluatedKey,
+dateLoop:
+	for _, date := range dates {


This does O(N) queries. Is there any way we can do a single query over only CreatedAt without CreatedAtDate?

Unfortunately the index partitioning disallows this due to how DynamoDB works internally, this is what we do at the moment on master but due to partitioning constraints this means all data ends up in one hardcoded partition which is the precise thing RFD 24 works around since partitions have data limits. I am currently unaware of a way to do this and did not find a way to do this the last time I investigated it, I will have another look through the Dynamo API docs again however if anything has changed.

Had another look through the API and this seems to be impossible with current API limitations. BatchGetRequest looked interesting at first but it turns out it only supports point lookups for range keys which won't work for our case.

awly · 2021-04-27T21:35:02Z

lib/events/dynamoevents/dynamoevents.go

+	}
+
+	for _, gsi := range tableDescription.Table.GlobalSecondaryIndexes {
+		if *gsi.IndexName == indexName && (*gsi.IndexStatus == dynamodb.IndexStatusActive || *gsi.IndexStatus == dynamodb.IndexStatusUpdating) {


can any of these fields be nil?

They cannot according to the SDK if I am reading correctly. The name is mandatory and the index status is a string enum that's always populated.

awly · 2021-04-27T21:44:13Z

lib/events/dynamoevents/dynamoevents.go

+
+		// For every item processed by this scan iteration we send an update action
+		// that adds the new date attribute.
+		for _, item := range scanOut.Items {


Can this loop check the existing attributes on the event and skip events that were already migrated?

The scan already filters out events that have the new attribute on the Dynamo side using the query scan filter attribute_not_exists(CreatedAtDate). All events that this loop iterates over are not migrated.

awly · 2021-04-27T21:45:30Z

lib/events/dynamoevents/dynamoevents.go

+
+			_, err = l.svc.UpdateItem(c)
+			if err != nil {
+				log.Infof("item fail data %q", item)


nit: this looks like a leftover from debugging

awly · 2021-04-27T21:46:51Z

CHANGELOG.md

+This release of teleport contains minor features and bugfixes.
+
+* Changed DynamoDB events backend indexing strategy. [#6583](https://github.com/gravitational/teleport/pull/6583)
+  Warning! This will trigger a data migration on the first start after upgrade. For optimal performance perform this migration with only one auth server and no other nodes online. It may take some time and progress will be periodically written to syslog. Once Teleport starts and is accessible via Web UI, the rest of the cluster may be started.


Suggested change

Warning! This will trigger a data migration on the first start after upgrade. For optimal performance perform this migration with only one auth server and no other nodes online. It may take some time and progress will be periodically written to syslog. Once Teleport starts and is accessible via Web UI, the rest of the cluster may be started.

Warning! This will trigger a data migration on the first start after upgrade. For optimal performance perform this migration with only one auth server online. It may take some time and progress will be periodically written to the auth server log. Once Teleport starts and is accessible via Web UI, the rest of the cluster may be started.

awly · 2021-04-27T21:48:12Z

@xacrimon could you also add some unit tests? even if it's using a stub dynamodb implementation

xacrimon · 2021-04-28T16:25:43Z

@xacrimon could you also add some unit tests? even if it's using a stub dynamodb implementation

@awly I could but am unsure how to go about doing this without a) not testing anything significant in regards to migration logic or b) reimplementing large parts with a dynamo mock implementation with all it's validation logic which would take significant time since an off the shelf solution supporting all the API's we use does not exist as far as I am aware.

I will go ahead and write some tests using a stub implementation regardless since it's an improvement anyhow.

…the v1 index for checkpointing

klizhentas

Bot.

xacrimon added the backport-required label Apr 23, 2021

xacrimon added this to the 6.2 "Buffalo" milestone Apr 23, 2021

xacrimon self-assigned this Apr 23, 2021

xacrimon added the aws Used for AWS Related Issues. label Apr 23, 2021

xacrimon force-pushed the joel/graceful-event-two branch from df1abfc to db242cd Compare April 24, 2021 13:54

xacrimon marked this pull request as ready for review April 24, 2021 14:03

xacrimon requested review from klizhentas, r0mant and russjones as code owners April 24, 2021 14:03

xacrimon requested review from fspmarshall and quinqu April 24, 2021 14:04

xacrimon requested review from awly and removed request for quinqu April 26, 2021 16:43

xacrimon force-pushed the joel/graceful-event-two branch 2 times, most recently from 09cc40b to 8946827 Compare April 26, 2021 19:16

fspmarshall reviewed Apr 26, 2021

View reviewed changes

xacrimon requested a review from fspmarshall April 27, 2021 17:56

awly reviewed Apr 27, 2021

View reviewed changes

xacrimon requested a review from awly April 28, 2021 16:17

russjones added the rfd Request for Discussion label Apr 28, 2021

fspmarshall approved these changes May 5, 2021

View reviewed changes

xacrimon added 22 commits May 6, 2021 18:33

implement rfd 24

3beae47

fix lints

0378646

use index existance as checkpoint

bbc7c13

adjust logs

8cfdcd0

added loop labels

72b8a5b

create v2 gsi as part of table creation

d939970

adjust index creation logic

1df7aba

retry with jitter

c4c6723

fix typo

0b801c0

various fixes and improvements

7c77be5

parameter formatting

8759c1d

update daysBetween to return correct dates regardless of time

02bab9a

print delay correctly

0977651

add comment

51fecd8

add some unit tests and revert some earlier changes since we now use …

00d6aca

…the v1 index for checkpointing

supply time location

22117ab

update other test

fe40ad0

added event migration unit test

e658b46

andrew feedback

4ee80eb

adjust check

062d147

sort correctly

ca2efd2

refactor sleeps to work with context cancellation

7bbb53f

xacrimon force-pushed the joel/graceful-event-two branch from 441ec37 to 7bbb53f Compare May 6, 2021 16:34

klizhentas approved these changes May 6, 2021

View reviewed changes

xacrimon merged commit 1316e67 into master May 6, 2021

xacrimon deleted the joel/graceful-event-two branch May 6, 2021 17:18

This was referenced May 6, 2021

Backport #6583 "Implement RFD 24 for alternative DynamoDB event indexing" #6762

Merged

Adjust break behaviour in DynamoDB driver to avoid loosing events #6781

Closed

webvictim mentioned this pull request Sep 9, 2021

RFD 24: DynamoDB Audit Event Overflow Handling #6359

Merged

	if err := bk.Migrate(ctx); err != nil {
	return nil, trace.Wrap(err)
	}

		delay := utils.HalfJitter(time.Minute * 5)
		log.WithError(err).Errorf("Failed RFD 24 migration, making another attempt in %f seconds", delay.Seconds())

	// We use the existence of the V1 index has a completion flag
	// We use the existence of the V1 index as a completion flag

	// for migration. We remove it and the end of the migration which
	// for migration. We remove it at the end of the migration which

	Warning! This will trigger a data migration on the first start after upgrade. For optimal performance perform this migration with only one auth server and no other nodes online. It may take some time and progress will be periodically written to syslog. Once Teleport starts and is accessible via Web UI, the rest of the cluster may be started.
	Warning! This will trigger a data migration on the first start after upgrade. For optimal performance perform this migration with only one auth server online. It may take some time and progress will be periodically written to the auth server log. Once Teleport starts and is accessible via Web UI, the rest of the cluster may be started.

Implement RFD 24 for alternative DynamoDB event indexing #6583

Implement RFD 24 for alternative DynamoDB event indexing #6583

Conversation

xacrimon commented Apr 23, 2021 • edited Loading

xacrimon commented Apr 24, 2021 • edited Loading

xacrimon commented Apr 26, 2021

fspmarshall left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xacrimon Apr 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xacrimon Apr 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xacrimon commented Apr 26, 2021

fspmarshall commented Apr 26, 2021

xacrimon commented Apr 27, 2021 • edited Loading

xacrimon commented Apr 27, 2021

xacrimon commented Apr 27, 2021

awly left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xacrimon Apr 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xacrimon Apr 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xacrimon Apr 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xacrimon Apr 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xacrimon Apr 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awly commented Apr 27, 2021

xacrimon commented Apr 28, 2021 • edited Loading

klizhentas left a comment

Choose a reason for hiding this comment

xacrimon commented Apr 23, 2021 •

edited

Loading

xacrimon commented Apr 24, 2021 •

edited

Loading

xacrimon Apr 27, 2021 •

edited

Loading

xacrimon Apr 27, 2021 •

edited

Loading

xacrimon commented Apr 27, 2021 •

edited

Loading

xacrimon Apr 28, 2021 •

edited

Loading

xacrimon Apr 28, 2021 •

edited

Loading

xacrimon Apr 28, 2021 •

edited

Loading

xacrimon Apr 28, 2021 •

edited

Loading

xacrimon Apr 28, 2021 •

edited

Loading

xacrimon commented Apr 28, 2021 •

edited

Loading