Introduce CLEANUP_UNKNOWN_AND_EXCLUDED step by gsoldevila · Pull Request #149931 · elastic/kibana

gsoldevila · 2023-01-31T11:59:01Z

In the context of migrations, #147371 avoids reindexing during an upgrade, provided that diffMappings === false.

This alternative path skips some key steps that are performed before reindexing:

CHECK_UNKNOWN_DOCUMENTS
CALCULATE_EXCLUDE_FILTERS

These steps enrich a search query that is used during reindexing, effectively filtering out undesired documents.

If the mappings match (or they are compatible) and we no longer reindex, this cleanup operation does not happen, leaving undesired documents in our system indices.

The goal of this PR is to add an extra step in the state machine (CLEANUP_UNKNOWN_AND_EXCLUDED), which will actively cleanup a system index if we're going the skip reindexing path.

Fixes #150299

gsoldevila · 2023-01-31T12:07:34Z

...cts/core-saved-objects-migration-server-internal/src/actions/cleanup_unknown_and_excluded.ts

+            bool: {
+              should: [
+                ...excludeFiltersRes.filterClauses,
+                ...unknownDocTypes.map((type) => ({ term: { type } })),


checkForUnknownDocs takes up to 1000 unknown document types (along with up to 100 doc ids for each of them). ATM it seems reasonable to think that we'll have ALL unknown types in this response.

gsoldevila · 2023-01-31T14:01:21Z

A possible alternative for the deleteQuery would be to delete those documents that:

A) have a type, and that type is not any of the known types
B) OR they match any of the plugin-defined excludeFromUpgrade filters

query: {
  bool: {
     should: [
         A,
         ...B,
    ]
}

A: {
  bool: {
     must: [
       { exists: { field: 'type' }  }
     ],
     mustNot: [
        KNOWN_TYPES.map((type) => ({ term: { type }})
     ]
  }
}

B: [ // exclude from upgrade filters ]

For comparison, the currently proposed query is deleting those documents that:

A) Have an unknown type
B) OR Have a deleted type
C) OR they match any of the plugin-defined excludeFromUpgrade filters

query: {
  bool: {
     should: [
         ...A,      // UNKNOWN_TYPES.map((type) => ({ term: { type }})
         ...B,      // REMOVED_TYPES.map((type) => ({ term: { type }})
         ...C,      // exclude from upgrade filters
    ]
}

At a first glance, the current approach seems a bit more safe, as we explicitly state which documents' types we want to delete (as opposed to which ones we DO NOT want to delete). We should perhaps assess which one is cheaper in terms of performance.

gsoldevila · 2023-01-31T14:25:07Z

Note that for the active delete we are NOT write locking the system indices, and thus, if a delete fails half way through, or another actor injects unknown documents to the system indices, restarting Kibana MUST be able to finish the cleanup and start properly.

Two possible solutions:

Running the CLEANUP_UNKNOWN_AND_EXCLUDED systematically, even on versionMigrationCompleted === true deployments (no stack upgrade). This would ensure we perform an active cleanup at each startup (possibly time consuming).
- This could also be interesting from a functional standpoint, as it would allow to cleanup more often.
- We gotta be careful with unpersisted search sessions if we go down this path.
Another possibility would be to update the version aliases AFTER the active cleanup. This way, subsequent starts after failed attempts (or other nodes running the migration) would also run the active cleanup steps. This seems like a more efficient alternative, but it is perhaps less secure in terms of guaranteeing consistency at startup. <== CURRENTLY IMPLEMENTED IN THE PR

…fix'

rudolf

Left some nits, but overall the code looks good and I think we just need unit test coverage for the model.

rudolf · 2023-02-04T19:03:48Z

...cts/core-saved-objects-migration-server-internal/src/actions/cleanup_unknown_and_excluded.ts

+        index,
+        query: deleteQuery,
+        wait_for_completion: true,
+        refresh: true,


Is there a reason we want to force a refresh here? In general refresh: 'wait_for' feels safer?

Actually, the wait_for_completion does not seem enough. The active_delete.test.ts performs a search operation on the index, and if I remove the refresh: true the tests fails, as it still gets back some documents that should be deleted.

can you try with refresh: 'wait_for'? If you remove refresh it defaults to false which would explain why the test search still finds deleted documents.

That does not seem to be a valid parameter value (boolean | undefined). Leaving refresh: true as discussed.

rudolf · 2023-02-04T19:23:27Z

src/core/server/integration_tests/saved_objects/migrations/active_delete.test.ts

+export const logFilePath = Path.join(__dirname, 'active_delete.test.log');
+const currentVersion = Env.createDefault(REPO_ROOT, getEnvOptions()).packageInfo.version;
+
+describe('active delete', () => {


nit: can we use a longer description so that it's easier to understand what this test does in e.g. a skipped test issue

rudolf · 2023-02-04T19:27:21Z

src/core/server/integration_tests/saved_objects/migrations/active_delete.test.ts

+        (result) => result._source?.type === 'basic'
+      ).length;
+
+      expect(basicDocumentCount).toEqual(3);


it's really nice to know these assertions will be stable unlike some of our other tests where the constants need to be tweaked the whole time 😍

rudolf · 2023-02-04T19:31:51Z

src/core/server/integration_tests/saved_objects/migrations/kibana_migrator_test_kit.ts

+      root: {
+        level: 'off',
+      },


do we need this? we set the root logger level to info some lines lower, so feels like this could be removed?

At some point the schema validation was failing and I thought "logging.root" was mandatory. Will remove it 👍🏼

rudolf · 2023-02-04T19:40:21Z

packages/core/saved-objects/core-saved-objects-migration-server-internal/src/model/model.ts

        ],
      };
    }
+  } else if (stateP.controlState === 'CLEANUP_UNKNOWN_AND_EXCLUDED') {


nit, this was already "out of sync" but I think it helps a little bit to read the code if the control states sort of follow the execution flow. CLEANUP_UNKNOWN_AND_EXCLUDED and PREPARE_COMPATIBLE_MIGRATION would always happen after WAIT_FOR_YELLOW_SOURCE so maybe we can move these code blocks below it? (similar thing in next.ts)

rudolf · 2023-02-04T19:42:30Z

...ges/core/saved-objects/core-saved-objects-migration-server-internal/src/core/unused_types.ts

 // saved objects which are no longer used. These saved objects will still be
 // kept in the outdated index for backup purposes, but won't be available in
 // the upgraded index.
+export const unpersistedSearchSessionsQuery = {


nit: can we add a comment like: //TODO: move to an excludeOnUpgrade hook in data plugin (need to double check if it's actually the data plugin that registers this type

I confirm it's the data plugin. Seems pretty straightforward to move it there, I'll update it.

rudolf · 2023-02-04T19:58:26Z

...cts/core-saved-objects-migration-server-internal/src/actions/cleanup_unknown_and_excluded.ts

+    checkForUnknownDocs({ client, indexName, knownTypes, excludeOnUpgradeQuery }),
+    TaskEither.chain(
+      (
+        unknownDocsRes: {} | UnknownDocsFound


we can polish this in a follow-up PR too because it's not the PR that introduced it, but the fact that checkForUnknownDocs returns an unnamed type {} makes the code here a bit harder to follow. It would be nicer if the types were NoUnknownDocsFound | UnknownDocsFound

The unknownDocTypes variable further below is also a bit awkward which makes me wonder if checkForUnknownDocs should maybe return only one "success" response with an Option to show if docs were found or not.

but I don't think it's worth trying to do this as part of this PR

Created follow-up issue to tackle this:
#150286

elasticmachine · 2023-02-06T12:34:35Z

Pinging @elastic/kibana-core (Team:Core)

… sessions)

davismcphee

Search session SO type change LGTM!

…fix'

rudolf · 2023-02-16T15:21:06Z

...re/saved-objects/core-saved-objects-migration-server-internal/src/actions/delete_by_query.ts

+      .deleteByQuery({
+        index: indexName,
+        query,
+        wait_for_completion: true,


this would mean the requests fails if it takes longer than the default timeout to complete. I think we set that to 120s but it's probably worth validating that when a deleteByQuery takes too long we get a retryable_es_client error. Because this could easily happen but as long as we retry it should be fine.

There is a risk that we create a new deleteByQuery before the existing one completed, but with our exponential backoff that should be fine. Also don't think we should usually have a huge number of docs deleted.

Fair point, thanks for the insight!

I take it we still need that flag, cause not adding it and carrying on with the flow could have worse consequences.

I'll validate the timeout doubt.

yes, without it we'd get back a task and we'd need to add a step like the other *_WAIT_FOR_TASK steps. I didn't want to have to add another step but does actually feel safer. Otherwise if a cluster is slow we could have several Kibana's piling on several deleteByQueries on top of each other making matters worse.

Delete should be quicker than the updateByQuery we run in UPDATE_TARGET_MAPPINGS but for that action we did have this problem a few times in production.

So it's starting to feel like it'd be safer to just create a task and wait for it and know we don't have anything to worry about.

Let's play it safe then. I added a CLEANUP_UNKNOWN_AND_EXCLUDED_WAIT_FOR_TASK step in my latest push.

rudolf · 2023-02-16T15:23:27Z

...re/saved-objects/core-saved-objects-migration-server-internal/src/actions/delete_by_query.ts

+        index: indexName,
+        query,
+        wait_for_completion: true,
+        refresh: true,


nit: can we add a comment why this is necessary because in the future someone might think "we don't need to force the refresh"... note to our future selves

jloleysens · 2023-02-16T16:26:45Z

Thanks for addressing my nits 😄 (did not re-test locally)

gsoldevila · 2023-02-16T16:44:42Z

Thanks for addressing my nits 😄 (did not re-test locally)

Sure! whatever you nit 🥁

gsoldevila · 2023-02-16T16:51:06Z

packages/core/saved-objects/core-saved-objects-migration-server-internal/src/next.ts

        client,
        index: state.targetIndex,
-        transformedDocs: state.transformedDocBatches[state.currentBatch],
+        operations: state.bulkOperationBatches[state.currentBatch],


I was thinking, we no longer ONLY update documents in this step, but we delete them as well.
What happens if there are conflicts...

...document updated by another instance? We have optimistic concurrency control for that one.

...document deleted by another instance? does this cause the bulk operation to fail? I don't see a conflicts: 'proceed' flag here.

This part of the flow affects corrupt and transform errors, and it is run systematically even by up-to-date deployments, so if a deployment fails to delete some of them it will retry on next startup.

The question here is whether it is safe to leave some corrupt and un-transformed documents in the system indices, and start "normally". WDYT @rudolf ?

Looking at the Bulk API docs, it seems that errors are not halting the operation, they are are collected and returned instead.

Provided that we don't have _seq_no nor _primary_term for our delete operations in the bulk, I think it is safe to assume that if deletes fail it is because they couldn't find the corresponding document.

good you raised this... I think the biggest problem is if delete "conflicts" cause the whole migration to fail. Even if a restart should fix everything seeing a FATAL error is never nice.

Like Ahmad explains deleteByQuery throws version conflict exceptions for a double delete. But if we don't specify a seq_no _primary_term I'm not sure that bulk delete would do the same. It would probably return a not found error so ideally we should handle that and ignore that in the model.

The docs seem to indicate that such errors are collected and the bulk operation continues (we don't even have the conflicts: 'proceed' flag here), but I'll double check that to confirm.

After sleeping through it, I realised the other consequence of this change is that we now perform this cleanup of corrupt and transform errors systematically at startup (even on up-to-date deployments), whereas we used to do that on version upgrade only. This is less true for the transform errors, given that in a normal scenario all documents' versions should match the current one.

And, as a matter of fact, this is an improvement over the current implementation, where corrupt docs and transform errors are not cleaned up, and the migrator currently ignores the discardCorruptObjects flag at this point (I believe it will simply fail to start if transform errors occur).

rudolf

I can see you're still working on this so you probably know about these, but thought I'd leave some comments

rudolf · 2023-02-21T12:20:23Z

packages/core/saved-objects/core-saved-objects-migration-server-internal/src/model/model.ts

+      return {
+        ...stateP,
+        controlState: 'FATAL',
+        reason: extractUnknownDocFailureReason(
+          stateP.migrationDocLinks.resolveMigrationFailures,
+          res.left.unknownDocs
+        ),
+      };


should this be throwBadResponse? didn't check out the code locally but I would expect res.left.unknownDocs to throw a type error

The other part of the model.ts where we check for unknown docs also fails with a FATAL. What is the difference in terms of behavior between FATAL state and throwBadResponse? Should I update both?

rudolf · 2023-02-21T12:26:22Z

...ved-objects/core-saved-objects-migration-server-internal/src/actions/delete_by_query.test.ts

+  });
+
+  it('resolves with `Either.left`, if the delete query fails', async () => {
+    const versionConflicts = [


should be moved to the wait for delete by query action test

Yeah that one still needs some work / refactor.

rudolf · 2023-02-21T12:27:11Z

packages/core/saved-objects/core-saved-objects-migration-server-internal/src/model/model.ts

+          ...stateP,
+          controlState: 'FATAL',
+          reason:
+            `Migration failed because it was unable to delete unwanted documents from the ${stateP.sourceIndex.value} system index:\n` +


we should ignore version conflicts and proceed

I first implemented a custom retryDelay mechanism here, to make sure the delete is re-run if there are failures.
Then I got rid of it, I thought that by failing with FATAL we'd restart and re-run the delete (checking again for unwanted documents).

If we ignore the failures and carry on with the migration, we're going to update the aliases on the next step.
Thus, if for some reason we fail to delete one of the SO with unknown type, Kibana is not going to be able to start, nor to delete that document at startup, right?

yeah I should have thought more carefully through this before saying we should proceed...

I don't think we should throw a FATAL even if everything will be fine in the end after we restart. Upgrades are a sensitive procedure and it's alarming to a user to see FATAL in the logs. This also feels like something that could happen fairly frequently it's not just an obscure edge case.

So I think the behaviour you had where we retry would be better. Looking at the diff you used to retry 5 times but I think we can retry retryAttempts times which defaults to 15.

kibana-ci · 2023-02-27T10:38:58Z

💚 Build Succeeded

Buildkite Build
Commit: 561febb

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`@kbn/core-root-server-internal`	22	24	+2
`@kbn/core-saved-objects-migration-server-internal`	79	81	+2
total			+4

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id	before	after	diff
`@kbn/core-saved-objects-migration-server-internal`	45	46	+1

Unknown metric groups

API count

id	before	after	diff
`@kbn/core-root-server-internal`	23	25	+2
`@kbn/core-saved-objects-migration-server-internal`	112	116	+4
total			+6

ESLint disabled in files

id	before	after	diff
`@kbn/core`	0	1	+1

ESLint disabled line counts

id	before	after	diff
`@kbn/core`	6	7	+1
`securitySolution`	428	430	+2
total			+3

Total ESLint disabled count

id	before	after	diff
`@kbn/core`	6	8	+2
`securitySolution`	506	508	+2
total			+4

History

💛 Build #110391 was flaky 6aeee3b
💔 Build #110345 failed c4345b4
💚 Build #109266 succeeded 887ce7d
💔 Build #109035 failed 7010169
💚 Build #108489 succeeded 469b184
💔 Build #108287 failed a8cc06b

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @gsoldevila

kibanamachine · 2023-02-27T10:47:48Z

💔 All backports failed

Status	Branch	Result
❌	8.7	Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 149931

Questions ?

Please refer to the Backport tool documentation

In the context of migrations, elastic#147371 avoids reindexing during an upgrade, provided that `diffMappings === false`. This _alternative path_ skips some key steps that are performed before reindexing: * `CHECK_UNKNOWN_DOCUMENTS` * `CALCULATE_EXCLUDE_FILTERS` These steps enrich a search query that is used during reindexing, effectively filtering out undesired documents. If the mappings [match](elastic#147371) (or they are [compatible](elastic#149326)) and we _no longer reindex_, this cleanup operation does not happen, leaving undesired documents in our system indices. The goal of this PR is to add an extra step in the state machine (`CLEANUP_UNKNOWN_AND_EXCLUDED`), which will actively cleanup a system index if we're going the _skip reindexing_ path. ![image](https://user-images.githubusercontent.com/25349407/216979691-fef40638-f990-4850-bac8-ee3e58330a7f.png) Fixes elastic#150299 --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit 754e868)

gsoldevila · 2023-02-27T13:35:53Z

packages/core/saved-objects/core-saved-objects-migration-server-internal/src/model/model.ts

          stateP.targetIndexMappings
-        )
+        ) &&
+        Math.random() < 10


This extra check slipped through, it is harmless and it will be removed in #149326

# Backport This will backport the following commits from `main` to `8.7`: - [Introduce CLEANUP_UNKNOWN_AND_EXCLUDED step (#149931)](#149931)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)

In the context of migrations, elastic#147371 avoids reindexing during an upgrade, provided that `diffMappings === false`. This _alternative path_ skips some key steps that are performed before reindexing: * `CHECK_UNKNOWN_DOCUMENTS` * `CALCULATE_EXCLUDE_FILTERS` These steps enrich a search query that is used during reindexing, effectively filtering out undesired documents. If the mappings [match](elastic#147371) (or they are [compatible](elastic#149326)) and we _no longer reindex_, this cleanup operation does not happen, leaving undesired documents in our system indices. The goal of this PR is to add an extra step in the state machine (`CLEANUP_UNKNOWN_AND_EXCLUDED`), which will actively cleanup a system index if we're going the _skip reindexing_ path. ![image](https://user-images.githubusercontent.com/25349407/216979691-fef40638-f990-4850-bac8-ee3e58330a7f.png) Fixes elastic#150299 --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>

Introduce CLEANUP_UNKNOWN_AND_EXCLUDED step

85217b9

gsoldevila commented Jan 31, 2023

View reviewed changes

gsoldevila added 2 commits January 31, 2023 13:19

Fix UTs

5920fed

Add REMOVED_TYPES to the deleteQuery

67728f2

gsoldevila and others added 5 commits February 1, 2023 16:05

Perform active cleanup before updating aliases

dbff9a8

Merge branch 'main' into kbn-147237-active-cleanup

94e2b22

Add test suite for active cleanup

cffd0ee

Merge branch 'main' into kbn-147237-active-cleanup

4ac1906

[CI] Auto-commit changed files from 'node scripts/lint_ts_projects --…

bb92518

…fix'

rudolf reviewed Feb 4, 2023

View reviewed changes

gsoldevila mentioned this pull request Feb 6, 2023

Improve checkForUnknownDocs() return type: NoUnknownDocsFound | UnknownDocsFound #150286

Open

Address PR comments, improve test coverage

6cfebc2

gsoldevila marked this pull request as ready for review February 6, 2023 12:34

gsoldevila requested review from a team as code owners February 6, 2023 12:34

gsoldevila requested a review from rudolf February 6, 2023 12:34

gsoldevila added 3 commits February 6, 2023 14:23

Update UT snapshots

b566fbc

Merge branch 'main' into kbn-147237-active-cleanup

d09cfb2

Update mappings hashes (added excludeOnUpgrade for unpersisted search…

eb46204

… sessions)

davismcphee approved these changes Feb 6, 2023

View reviewed changes

[CI] Auto-commit changed files from 'node scripts/lint_ts_projects --…

469b184

…fix'

gsoldevila added backport:prev-minor v8.8.0 and removed backport:skip This PR does not require backporting labels Feb 16, 2023

rudolf approved these changes Feb 16, 2023

View reviewed changes

jloleysens approved these changes Feb 16, 2023

View reviewed changes

gsoldevila commented Feb 16, 2023

View reviewed changes

gsoldevila self-assigned this Feb 20, 2023

gsoldevila added 2 commits February 20, 2023 17:54

Make active cleanup asynchronous and wait for it

1ba5ed2

Merge branch 'main' into kbn-147237-active-cleanup

7010169

rudolf reviewed Feb 21, 2023

View reviewed changes

gsoldevila added 7 commits February 21, 2023 15:39

Update UTs

a6141ca

Merge branch 'main' into kbn-147237-active-cleanup

f170cb6

Add retry mechanism to reduce number of FATALs

887ce7d

Misc enhancements

c4345b4

Fix linting and failing tests

485940a

Merge branch 'main' into kbn-147237-active-cleanup

6aeee3b

Merge branch 'main' into kbn-147237-active-cleanup

561febb

gsoldevila merged commit 754e868 into elastic:main Feb 27, 2023

gsoldevila mentioned this pull request Feb 27, 2023

[8.7] Introduce CLEANUP_UNKNOWN_AND_EXCLUDED step (#149931) #152216

Merged

gsoldevila commented Feb 27, 2023

View reviewed changes

rudolf added the Epic:ScaleMigrations Scale upgrade migrations to millions of saved objects label Jun 2, 2023

Conversation

gsoldevila commented Jan 31, 2023 • edited by kibanamachine Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsoldevila commented Jan 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gsoldevila commented Jan 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rudolf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Feb 6, 2023

Uh oh!

davismcphee left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rudolf Feb 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jloleysens commented Feb 16, 2023

Uh oh!

gsoldevila commented Feb 16, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsoldevila Feb 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rudolf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsoldevila Feb 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

gsoldevila commented Jan 31, 2023 •

edited by kibanamachine

Loading

gsoldevila commented Jan 31, 2023 •

edited

Loading

gsoldevila commented Jan 31, 2023 •

edited

Loading

rudolf Feb 16, 2023 •

edited

Loading

gsoldevila Feb 17, 2023 •

edited

Loading

gsoldevila Feb 21, 2023 •

edited

Loading