Skip to content

Introduce CLEANUP_UNKNOWN_AND_EXCLUDED step#149931

Merged
gsoldevila merged 43 commits intoelastic:mainfrom
gsoldevila:kbn-147237-active-cleanup
Feb 27, 2023
Merged

Introduce CLEANUP_UNKNOWN_AND_EXCLUDED step#149931
gsoldevila merged 43 commits intoelastic:mainfrom
gsoldevila:kbn-147237-active-cleanup

Conversation

@gsoldevila
Copy link
Contributor

@gsoldevila gsoldevila commented Jan 31, 2023

In the context of migrations, #147371 avoids reindexing during an upgrade, provided that diffMappings === false.

This alternative path skips some key steps that are performed before reindexing:

  • CHECK_UNKNOWN_DOCUMENTS
  • CALCULATE_EXCLUDE_FILTERS

These steps enrich a search query that is used during reindexing, effectively filtering out undesired documents.

If the mappings match (or they are compatible) and we no longer reindex, this cleanup operation does not happen, leaving undesired documents in our system indices.

The goal of this PR is to add an extra step in the state machine (CLEANUP_UNKNOWN_AND_EXCLUDED), which will actively cleanup a system index if we're going the skip reindexing path.

image

Fixes #150299

@gsoldevila gsoldevila added bug Fixes for quality problems that affect the customer experience Team:Core Platform Core services: plugins, logging, config, saved objects, http, ES client, i18n, etc t// release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting Feature:Migrations v8.7.0 labels Jan 31, 2023
bool: {
should: [
...excludeFiltersRes.filterClauses,
...unknownDocTypes.map((type) => ({ term: { type } })),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkForUnknownDocs takes up to 1000 unknown document types (along with up to 100 doc ids for each of them). ATM it seems reasonable to think that we'll have ALL unknown types in this response.

@gsoldevila
Copy link
Contributor Author

gsoldevila commented Jan 31, 2023

A possible alternative for the deleteQuery would be to delete those documents that:

  • A) have a type, and that type is not any of the known types
  • B) OR they match any of the plugin-defined excludeFromUpgrade filters
query: {
  bool: {
     should: [
         A,
         ...B,
    ]
}

A: {
  bool: {
     must: [
       { exists: { field: 'type' }  }
     ],
     mustNot: [
        KNOWN_TYPES.map((type) => ({ term: { type }})
     ]
  }
}

B: [ // exclude from upgrade filters ]

For comparison, the currently proposed query is deleting those documents that:

  • A) Have an unknown type
  • B) OR Have a deleted type
  • C) OR they match any of the plugin-defined excludeFromUpgrade filters
query: {
  bool: {
     should: [
         ...A,      // UNKNOWN_TYPES.map((type) => ({ term: { type }})
         ...B,      // REMOVED_TYPES.map((type) => ({ term: { type }})
         ...C,      // exclude from upgrade filters
    ]
}

At a first glance, the current approach seems a bit more safe, as we explicitly state which documents' types we want to delete (as opposed to which ones we DO NOT want to delete). We should perhaps assess which one is cheaper in terms of performance.

@gsoldevila
Copy link
Contributor Author

gsoldevila commented Jan 31, 2023

Note that for the active delete we are NOT write locking the system indices, and thus, if a delete fails half way through, or another actor injects unknown documents to the system indices, restarting Kibana MUST be able to finish the cleanup and start properly.

Two possible solutions:

  • Running the CLEANUP_UNKNOWN_AND_EXCLUDED systematically, even on versionMigrationCompleted === true deployments (no stack upgrade). This would ensure we perform an active cleanup at each startup (possibly time consuming).
    • This could also be interesting from a functional standpoint, as it would allow to cleanup more often.
    • We gotta be careful with unpersisted search sessions if we go down this path.
  • Another possibility would be to update the version aliases AFTER the active cleanup. This way, subsequent starts after failed attempts (or other nodes running the migration) would also run the active cleanup steps. This seems like a more efficient alternative, but it is perhaps less secure in terms of guaranteeing consistency at startup. <== CURRENTLY IMPLEMENTED IN THE PR

Copy link
Contributor

@rudolf rudolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nits, but overall the code looks good and I think we just need unit test coverage for the model.

index,
query: deleteQuery,
wait_for_completion: true,
refresh: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we want to force a refresh here? In general refresh: 'wait_for' feels safer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the wait_for_completion does not seem enough. The active_delete.test.ts performs a search operation on the index, and if I remove the refresh: true the tests fails, as it still gets back some documents that should be deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try with refresh: 'wait_for'? If you remove refresh it defaults to false which would explain why the test search still finds deleted documents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does not seem to be a valid parameter value (boolean | undefined). Leaving refresh: true as discussed.

export const logFilePath = Path.join(__dirname, 'active_delete.test.log');
const currentVersion = Env.createDefault(REPO_ROOT, getEnvOptions()).packageInfo.version;

describe('active delete', () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we use a longer description so that it's easier to understand what this test does in e.g. a skipped test issue

(result) => result._source?.type === 'basic'
).length;

expect(basicDocumentCount).toEqual(3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's really nice to know these assertions will be stable unlike some of our other tests where the constants need to be tweaked the whole time 😍

Comment on lines +141 to +143
root: {
level: 'off',
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? we set the root logger level to info some lines lower, so feels like this could be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point the schema validation was failing and I thought "logging.root" was mandatory. Will remove it 👍🏼

],
};
}
} else if (stateP.controlState === 'CLEANUP_UNKNOWN_AND_EXCLUDED') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, this was already "out of sync" but I think it helps a little bit to read the code if the control states sort of follow the execution flow. CLEANUP_UNKNOWN_AND_EXCLUDED and PREPARE_COMPATIBLE_MIGRATION would always happen after WAIT_FOR_YELLOW_SOURCE so maybe we can move these code blocks below it? (similar thing in next.ts)

// saved objects which are no longer used. These saved objects will still be
// kept in the outdated index for backup purposes, but won't be available in
// the upgraded index.
export const unpersistedSearchSessionsQuery = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add a comment like: //TODO: move to an excludeOnUpgrade hook in data plugin (need to double check if it's actually the data plugin that registers this type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm it's the data plugin. Seems pretty straightforward to move it there, I'll update it.

checkForUnknownDocs({ client, indexName, knownTypes, excludeOnUpgradeQuery }),
TaskEither.chain(
(
unknownDocsRes: {} | UnknownDocsFound
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can polish this in a follow-up PR too because it's not the PR that introduced it, but the fact that checkForUnknownDocs returns an unnamed type {} makes the code here a bit harder to follow. It would be nicer if the types were NoUnknownDocsFound | UnknownDocsFound

The unknownDocTypes variable further below is also a bit awkward which makes me wonder if checkForUnknownDocs should maybe return only one "success" response with an Option to show if docs were found or not.

but I don't think it's worth trying to do this as part of this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created follow-up issue to tackle this:
#150286

@gsoldevila gsoldevila marked this pull request as ready for review February 6, 2023 12:34
@gsoldevila gsoldevila requested review from a team as code owners February 6, 2023 12:34
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@gsoldevila gsoldevila requested a review from rudolf February 6, 2023 12:34
Copy link
Contributor

@davismcphee davismcphee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Search session SO type change LGTM!

@gsoldevila gsoldevila added backport:prev-minor v8.8.0 and removed backport:skip This PR does not require backporting labels Feb 16, 2023
.deleteByQuery({
index: indexName,
query,
wait_for_completion: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would mean the requests fails if it takes longer than the default timeout to complete. I think we set that to 120s but it's probably worth validating that when a deleteByQuery takes too long we get a retryable_es_client error. Because this could easily happen but as long as we retry it should be fine.

There is a risk that we create a new deleteByQuery before the existing one completed, but with our exponential backoff that should be fine. Also don't think we should usually have a huge number of docs deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, thanks for the insight!

I take it we still need that flag, cause not adding it and carrying on with the flow could have worse consequences.

I'll validate the timeout doubt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, without it we'd get back a task and we'd need to add a step like the other *_WAIT_FOR_TASK steps. I didn't want to have to add another step but does actually feel safer. Otherwise if a cluster is slow we could have several Kibana's piling on several deleteByQueries on top of each other making matters worse.

Delete should be quicker than the updateByQuery we run in UPDATE_TARGET_MAPPINGS but for that action we did have this problem a few times in production.

So it's starting to feel like it'd be safer to just create a task and wait for it and know we don't have anything to worry about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's play it safe then. I added a CLEANUP_UNKNOWN_AND_EXCLUDED_WAIT_FOR_TASK step in my latest push.

index: indexName,
query,
wait_for_completion: true,
refresh: true,
Copy link
Contributor

@rudolf rudolf Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add a comment why this is necessary because in the future someone might think "we don't need to force the refresh"... note to our future selves

@jloleysens
Copy link
Contributor

Thanks for addressing my nits 😄 (did not re-test locally)

@gsoldevila
Copy link
Contributor Author

Thanks for addressing my nits 😄 (did not re-test locally)

Sure! whatever you nit 🥁

client,
index: state.targetIndex,
transformedDocs: state.transformedDocBatches[state.currentBatch],
operations: state.bulkOperationBatches[state.currentBatch],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking, we no longer ONLY update documents in this step, but we delete them as well.
What happens if there are conflicts...

  • ...document updated by another instance? We have optimistic concurrency control for that one.
  • ...document deleted by another instance? does this cause the bulk operation to fail? I don't see a conflicts: 'proceed' flag here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the flow affects corrupt and transform errors, and it is run systematically even by up-to-date deployments, so if a deployment fails to delete some of them it will retry on next startup.

The question here is whether it is safe to leave some corrupt and un-transformed documents in the system indices, and start "normally". WDYT @rudolf ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the Bulk API docs, it seems that errors are not halting the operation, they are are collected and returned instead.

Provided that we don't have _seq_no nor _primary_term for our delete operations in the bulk, I think it is safe to assume that if deletes fail it is because they couldn't find the corresponding document.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good you raised this... I think the biggest problem is if delete "conflicts" cause the whole migration to fail. Even if a restart should fix everything seeing a FATAL error is never nice.

Like Ahmad explains deleteByQuery throws version conflict exceptions for a double delete. But if we don't specify a seq_no _primary_term I'm not sure that bulk delete would do the same. It would probably return a not found error so ideally we should handle that and ignore that in the model.

Copy link
Contributor Author

@gsoldevila gsoldevila Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs seem to indicate that such errors are collected and the bulk operation continues (we don't even have the conflicts: 'proceed' flag here), but I'll double check that to confirm.

After sleeping through it, I realised the other consequence of this change is that we now perform this cleanup of corrupt and transform errors systematically at startup (even on up-to-date deployments), whereas we used to do that on version upgrade only. This is less true for the transform errors, given that in a normal scenario all documents' versions should match the current one.

And, as a matter of fact, this is an improvement over the current implementation, where corrupt docs and transform errors are not cleaned up, and the migrator currently ignores the discardCorruptObjects flag at this point (I believe it will simply fail to start if transform errors occur).

@gsoldevila gsoldevila self-assigned this Feb 20, 2023
Copy link
Contributor

@rudolf rudolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see you're still working on this so you probably know about these, but thought I'd leave some comments

Comment on lines +493 to +500
return {
...stateP,
controlState: 'FATAL',
reason: extractUnknownDocFailureReason(
stateP.migrationDocLinks.resolveMigrationFailures,
res.left.unknownDocs
),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be throwBadResponse? didn't check out the code locally but I would expect res.left.unknownDocs to throw a type error

Copy link
Contributor Author

@gsoldevila gsoldevila Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other part of the model.ts where we check for unknown docs also fails with a FATAL. What is the difference in terms of behavior between FATAL state and throwBadResponse? Should I update both?

});

it('resolves with `Either.left`, if the delete query fails', async () => {
const versionConflicts = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be moved to the wait for delete by query action test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that one still needs some work / refactor.

...stateP,
controlState: 'FATAL',
reason:
`Migration failed because it was unable to delete unwanted documents from the ${stateP.sourceIndex.value} system index:\n` +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should ignore version conflicts and proceed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first implemented a custom retryDelay mechanism here, to make sure the delete is re-run if there are failures.
Then I got rid of it, I thought that by failing with FATAL we'd restart and re-run the delete (checking again for unwanted documents).

If we ignore the failures and carry on with the migration, we're going to update the aliases on the next step.
Thus, if for some reason we fail to delete one of the SO with unknown type, Kibana is not going to be able to start, nor to delete that document at startup, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I should have thought more carefully through this before saying we should proceed...

I don't think we should throw a FATAL even if everything will be fine in the end after we restart. Upgrades are a sensitive procedure and it's alarming to a user to see FATAL in the logs. This also feels like something that could happen fairly frequently it's not just an obscure edge case.

So I think the behaviour you had where we retry would be better. Looking at the diff you used to retry 5 times but I think we can retry retryAttempts times which defaults to 15.

@kibana-ci
Copy link

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/core-root-server-internal 22 24 +2
@kbn/core-saved-objects-migration-server-internal 79 81 +2
total +4

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
@kbn/core-saved-objects-migration-server-internal 45 46 +1
Unknown metric groups

API count

id before after diff
@kbn/core-root-server-internal 23 25 +2
@kbn/core-saved-objects-migration-server-internal 112 116 +4
total +6

ESLint disabled in files

id before after diff
@kbn/core 0 1 +1

ESLint disabled line counts

id before after diff
@kbn/core 6 7 +1
securitySolution 428 430 +2
total +3

Total ESLint disabled count

id before after diff
@kbn/core 6 8 +2
securitySolution 506 508 +2
total +4

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @gsoldevila

@gsoldevila gsoldevila merged commit 754e868 into elastic:main Feb 27, 2023
@kibanamachine
Copy link
Contributor

💔 All backports failed

Status Branch Result
8.7 Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 149931

Questions ?

Please refer to the Backport tool documentation

gsoldevila added a commit to gsoldevila/kibana that referenced this pull request Feb 27, 2023
In the context of migrations,
elastic#147371 avoids reindexing during
an upgrade, provided that `diffMappings === false`.

This _alternative path_ skips some key steps that are performed before
reindexing:
* `CHECK_UNKNOWN_DOCUMENTS`
* `CALCULATE_EXCLUDE_FILTERS`

These steps enrich a search query that is used during reindexing,
effectively filtering out undesired documents.

If the mappings [match](elastic#147371)
(or they are
[compatible](elastic#149326)) and we _no
longer reindex_, this cleanup operation does not happen, leaving
undesired documents in our system indices.

The goal of this PR is to add an extra step in the state machine
(`CLEANUP_UNKNOWN_AND_EXCLUDED`), which will actively cleanup a system
index if we're going the _skip reindexing_ path.

![image](https://user-images.githubusercontent.com/25349407/216979691-fef40638-f990-4850-bac8-ee3e58330a7f.png)

Fixes elastic#150299

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit 754e868)
stateP.targetIndexMappings
)
) &&
Math.random() < 10
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extra check slipped through, it is harmless and it will be removed in #149326

gsoldevila added a commit that referenced this pull request Feb 27, 2023
# Backport

This will backport the following commits from `main` to `8.7`:
- [Introduce CLEANUP_UNKNOWN_AND_EXCLUDED step
(#149931)](#149931)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Gerard
Soldevila","email":"gerard.soldevila@elastic.co"},"sourceCommit":{"committedDate":"2023-02-27T10:43:13Z","message":"Introduce
CLEANUP_UNKNOWN_AND_EXCLUDED step (#149931)\n\nIn the context of
migrations,\r\nhttps://github.com//pull/147371 avoids
reindexing during\r\nan upgrade, provided that `diffMappings ===
false`.\r\n\r\nThis _alternative path_ skips some key steps that are
performed before\r\nreindexing:\r\n* `CHECK_UNKNOWN_DOCUMENTS`\r\n*
`CALCULATE_EXCLUDE_FILTERS`\r\n\r\nThese steps enrich a search query
that is used during reindexing,\r\neffectively filtering out undesired
documents.\r\n\r\nIf the mappings
[match](https://github.com/elastic/kibana/pull/147371)\r\n(or they
are\r\n[compatible](#149326)) and
we _no\r\nlonger reindex_, this cleanup operation does not happen,
leaving\r\nundesired documents in our system indices.\r\n\r\nThe goal of
this PR is to add an extra step in the state
machine\r\n(`CLEANUP_UNKNOWN_AND_EXCLUDED`), which will actively cleanup
a system\r\nindex if we're going the _skip reindexing_
path.\r\n\r\n\r\n![image](https://user-images.githubusercontent.com/25349407/216979691-fef40638-f990-4850-bac8-ee3e58330a7f.png)\r\n\r\nFixes
https://github.com/elastic/kibana/issues/150299\r\n\r\n---------\r\n\r\nCo-authored-by:
kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"754e8682d45196e519720693e036017c915c0379","branchLabelMapping":{"^v8.8.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["bug","Team:Core","release_note:skip","Feature:Migrations","backport:prev-minor","v8.7.0","v8.8.0"],"number":149931,"url":"https://github.com/elastic/kibana/pull/149931","mergeCommit":{"message":"Introduce
CLEANUP_UNKNOWN_AND_EXCLUDED step (#149931)\n\nIn the context of
migrations,\r\nhttps://github.com//pull/147371 avoids
reindexing during\r\nan upgrade, provided that `diffMappings ===
false`.\r\n\r\nThis _alternative path_ skips some key steps that are
performed before\r\nreindexing:\r\n* `CHECK_UNKNOWN_DOCUMENTS`\r\n*
`CALCULATE_EXCLUDE_FILTERS`\r\n\r\nThese steps enrich a search query
that is used during reindexing,\r\neffectively filtering out undesired
documents.\r\n\r\nIf the mappings
[match](https://github.com/elastic/kibana/pull/147371)\r\n(or they
are\r\n[compatible](#149326)) and
we _no\r\nlonger reindex_, this cleanup operation does not happen,
leaving\r\nundesired documents in our system indices.\r\n\r\nThe goal of
this PR is to add an extra step in the state
machine\r\n(`CLEANUP_UNKNOWN_AND_EXCLUDED`), which will actively cleanup
a system\r\nindex if we're going the _skip reindexing_
path.\r\n\r\n\r\n![image](https://user-images.githubusercontent.com/25349407/216979691-fef40638-f990-4850-bac8-ee3e58330a7f.png)\r\n\r\nFixes
https://github.com/elastic/kibana/issues/150299\r\n\r\n---------\r\n\r\nCo-authored-by:
kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"754e8682d45196e519720693e036017c915c0379"}},"sourceBranch":"main","suggestedTargetBranches":["8.7"],"targetPullRequestStates":[{"branch":"8.7","label":"v8.7.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.8.0","labelRegex":"^v8.8.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/149931","number":149931,"mergeCommit":{"message":"Introduce
CLEANUP_UNKNOWN_AND_EXCLUDED step (#149931)\n\nIn the context of
migrations,\r\nhttps://github.com//pull/147371 avoids
reindexing during\r\nan upgrade, provided that `diffMappings ===
false`.\r\n\r\nThis _alternative path_ skips some key steps that are
performed before\r\nreindexing:\r\n* `CHECK_UNKNOWN_DOCUMENTS`\r\n*
`CALCULATE_EXCLUDE_FILTERS`\r\n\r\nThese steps enrich a search query
that is used during reindexing,\r\neffectively filtering out undesired
documents.\r\n\r\nIf the mappings
[match](https://github.com/elastic/kibana/pull/147371)\r\n(or they
are\r\n[compatible](#149326)) and
we _no\r\nlonger reindex_, this cleanup operation does not happen,
leaving\r\nundesired documents in our system indices.\r\n\r\nThe goal of
this PR is to add an extra step in the state
machine\r\n(`CLEANUP_UNKNOWN_AND_EXCLUDED`), which will actively cleanup
a system\r\nindex if we're going the _skip reindexing_
path.\r\n\r\n\r\n![image](https://user-images.githubusercontent.com/25349407/216979691-fef40638-f990-4850-bac8-ee3e58330a7f.png)\r\n\r\nFixes
https://github.com/elastic/kibana/issues/150299\r\n\r\n---------\r\n\r\nCo-authored-by:
kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"754e8682d45196e519720693e036017c915c0379"}}]}]
BACKPORT-->
bmorelli25 pushed a commit to bmorelli25/kibana that referenced this pull request Mar 10, 2023
In the context of migrations,
elastic#147371 avoids reindexing during
an upgrade, provided that `diffMappings === false`.

This _alternative path_ skips some key steps that are performed before
reindexing:
* `CHECK_UNKNOWN_DOCUMENTS`
* `CALCULATE_EXCLUDE_FILTERS`

These steps enrich a search query that is used during reindexing,
effectively filtering out undesired documents.

If the mappings [match](elastic#147371)
(or they are
[compatible](elastic#149326)) and we _no
longer reindex_, this cleanup operation does not happen, leaving
undesired documents in our system indices.

The goal of this PR is to add an extra step in the state machine
(`CLEANUP_UNKNOWN_AND_EXCLUDED`), which will actively cleanup a system
index if we're going the _skip reindexing_ path.


![image](https://user-images.githubusercontent.com/25349407/216979691-fef40638-f990-4850-bac8-ee3e58330a7f.png)

Fixes elastic#150299

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
@rudolf rudolf added the Epic:ScaleMigrations Scale upgrade migrations to millions of saved objects label Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Fixes for quality problems that affect the customer experience Epic:ScaleMigrations Scale upgrade migrations to millions of saved objects Feature:Migrations release_note:skip Skip the PR/issue when compiling release notes Team:Core Platform Core services: plugins, logging, config, saved objects, http, ES client, i18n, etc t// v8.7.0 v8.8.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cleanup when mappings are compatible (unknown types and excludeOnUpgrade)

9 participants