Introduce CLEANUP_UNKNOWN_AND_EXCLUDED step#149931
Conversation
| bool: { | ||
| should: [ | ||
| ...excludeFiltersRes.filterClauses, | ||
| ...unknownDocTypes.map((type) => ({ term: { type } })), |
There was a problem hiding this comment.
checkForUnknownDocs takes up to 1000 unknown document types (along with up to 100 doc ids for each of them). ATM it seems reasonable to think that we'll have ALL unknown types in this response.
|
A possible alternative for the
query: {
bool: {
should: [
A,
...B,
]
}
A: {
bool: {
must: [
{ exists: { field: 'type' } }
],
mustNot: [
KNOWN_TYPES.map((type) => ({ term: { type }})
]
}
}
B: [ // exclude from upgrade filters ]For comparison, the currently proposed query is deleting those documents that:
query: {
bool: {
should: [
...A, // UNKNOWN_TYPES.map((type) => ({ term: { type }})
...B, // REMOVED_TYPES.map((type) => ({ term: { type }})
...C, // exclude from upgrade filters
]
}At a first glance, the current approach seems a bit more safe, as we explicitly state which documents' types we want to delete (as opposed to which ones we DO NOT want to delete). We should perhaps assess which one is cheaper in terms of performance. |
|
Note that for the active delete we are NOT write locking the system indices, and thus, if a delete fails half way through, or another actor injects unknown documents to the system indices, restarting Kibana MUST be able to finish the cleanup and start properly. Two possible solutions:
|
rudolf
left a comment
There was a problem hiding this comment.
Left some nits, but overall the code looks good and I think we just need unit test coverage for the model.
| index, | ||
| query: deleteQuery, | ||
| wait_for_completion: true, | ||
| refresh: true, |
There was a problem hiding this comment.
Is there a reason we want to force a refresh here? In general refresh: 'wait_for' feels safer?
There was a problem hiding this comment.
Actually, the wait_for_completion does not seem enough. The active_delete.test.ts performs a search operation on the index, and if I remove the refresh: true the tests fails, as it still gets back some documents that should be deleted.
There was a problem hiding this comment.
can you try with refresh: 'wait_for'? If you remove refresh it defaults to false which would explain why the test search still finds deleted documents.
There was a problem hiding this comment.
That does not seem to be a valid parameter value (boolean | undefined). Leaving refresh: true as discussed.
| export const logFilePath = Path.join(__dirname, 'active_delete.test.log'); | ||
| const currentVersion = Env.createDefault(REPO_ROOT, getEnvOptions()).packageInfo.version; | ||
|
|
||
| describe('active delete', () => { |
There was a problem hiding this comment.
nit: can we use a longer description so that it's easier to understand what this test does in e.g. a skipped test issue
| (result) => result._source?.type === 'basic' | ||
| ).length; | ||
|
|
||
| expect(basicDocumentCount).toEqual(3); |
There was a problem hiding this comment.
it's really nice to know these assertions will be stable unlike some of our other tests where the constants need to be tweaked the whole time 😍
| root: { | ||
| level: 'off', | ||
| }, |
There was a problem hiding this comment.
do we need this? we set the root logger level to info some lines lower, so feels like this could be removed?
There was a problem hiding this comment.
At some point the schema validation was failing and I thought "logging.root" was mandatory. Will remove it 👍🏼
| ], | ||
| }; | ||
| } | ||
| } else if (stateP.controlState === 'CLEANUP_UNKNOWN_AND_EXCLUDED') { |
There was a problem hiding this comment.
nit, this was already "out of sync" but I think it helps a little bit to read the code if the control states sort of follow the execution flow. CLEANUP_UNKNOWN_AND_EXCLUDED and PREPARE_COMPATIBLE_MIGRATION would always happen after WAIT_FOR_YELLOW_SOURCE so maybe we can move these code blocks below it? (similar thing in next.ts)
| // saved objects which are no longer used. These saved objects will still be | ||
| // kept in the outdated index for backup purposes, but won't be available in | ||
| // the upgraded index. | ||
| export const unpersistedSearchSessionsQuery = { |
There was a problem hiding this comment.
nit: can we add a comment like: //TODO: move to an excludeOnUpgrade hook in data plugin (need to double check if it's actually the data plugin that registers this type
There was a problem hiding this comment.
I confirm it's the data plugin. Seems pretty straightforward to move it there, I'll update it.
| checkForUnknownDocs({ client, indexName, knownTypes, excludeOnUpgradeQuery }), | ||
| TaskEither.chain( | ||
| ( | ||
| unknownDocsRes: {} | UnknownDocsFound |
There was a problem hiding this comment.
we can polish this in a follow-up PR too because it's not the PR that introduced it, but the fact that checkForUnknownDocs returns an unnamed type {} makes the code here a bit harder to follow. It would be nicer if the types were NoUnknownDocsFound | UnknownDocsFound
The unknownDocTypes variable further below is also a bit awkward which makes me wonder if checkForUnknownDocs should maybe return only one "success" response with an Option to show if docs were found or not.
but I don't think it's worth trying to do this as part of this PR
There was a problem hiding this comment.
Created follow-up issue to tackle this:
#150286
|
Pinging @elastic/kibana-core (Team:Core) |
davismcphee
left a comment
There was a problem hiding this comment.
Search session SO type change LGTM!
| .deleteByQuery({ | ||
| index: indexName, | ||
| query, | ||
| wait_for_completion: true, |
There was a problem hiding this comment.
this would mean the requests fails if it takes longer than the default timeout to complete. I think we set that to 120s but it's probably worth validating that when a deleteByQuery takes too long we get a retryable_es_client error. Because this could easily happen but as long as we retry it should be fine.
There is a risk that we create a new deleteByQuery before the existing one completed, but with our exponential backoff that should be fine. Also don't think we should usually have a huge number of docs deleted.
There was a problem hiding this comment.
Fair point, thanks for the insight!
I take it we still need that flag, cause not adding it and carrying on with the flow could have worse consequences.
I'll validate the timeout doubt.
There was a problem hiding this comment.
yes, without it we'd get back a task and we'd need to add a step like the other *_WAIT_FOR_TASK steps. I didn't want to have to add another step but does actually feel safer. Otherwise if a cluster is slow we could have several Kibana's piling on several deleteByQueries on top of each other making matters worse.
Delete should be quicker than the updateByQuery we run in UPDATE_TARGET_MAPPINGS but for that action we did have this problem a few times in production.
So it's starting to feel like it'd be safer to just create a task and wait for it and know we don't have anything to worry about.
There was a problem hiding this comment.
Let's play it safe then. I added a CLEANUP_UNKNOWN_AND_EXCLUDED_WAIT_FOR_TASK step in my latest push.
| index: indexName, | ||
| query, | ||
| wait_for_completion: true, | ||
| refresh: true, |
There was a problem hiding this comment.
nit: can we add a comment why this is necessary because in the future someone might think "we don't need to force the refresh"... note to our future selves
|
Thanks for addressing my nits 😄 (did not re-test locally) |
Sure! whatever you nit 🥁 |
| client, | ||
| index: state.targetIndex, | ||
| transformedDocs: state.transformedDocBatches[state.currentBatch], | ||
| operations: state.bulkOperationBatches[state.currentBatch], |
There was a problem hiding this comment.
I was thinking, we no longer ONLY update documents in this step, but we delete them as well.
What happens if there are conflicts...
- ...document updated by another instance? We have optimistic concurrency control for that one.
- ...document deleted by another instance? does this cause the bulk operation to fail? I don't see a
conflicts: 'proceed'flag here.
There was a problem hiding this comment.
This part of the flow affects corrupt and transform errors, and it is run systematically even by up-to-date deployments, so if a deployment fails to delete some of them it will retry on next startup.
The question here is whether it is safe to leave some corrupt and un-transformed documents in the system indices, and start "normally". WDYT @rudolf ?
There was a problem hiding this comment.
Looking at the Bulk API docs, it seems that errors are not halting the operation, they are are collected and returned instead.
Provided that we don't have _seq_no nor _primary_term for our delete operations in the bulk, I think it is safe to assume that if deletes fail it is because they couldn't find the corresponding document.
There was a problem hiding this comment.
good you raised this... I think the biggest problem is if delete "conflicts" cause the whole migration to fail. Even if a restart should fix everything seeing a FATAL error is never nice.
Like Ahmad explains deleteByQuery throws version conflict exceptions for a double delete. But if we don't specify a seq_no _primary_term I'm not sure that bulk delete would do the same. It would probably return a not found error so ideally we should handle that and ignore that in the model.
There was a problem hiding this comment.
The docs seem to indicate that such errors are collected and the bulk operation continues (we don't even have the conflicts: 'proceed' flag here), but I'll double check that to confirm.
After sleeping through it, I realised the other consequence of this change is that we now perform this cleanup of corrupt and transform errors systematically at startup (even on up-to-date deployments), whereas we used to do that on version upgrade only. This is less true for the transform errors, given that in a normal scenario all documents' versions should match the current one.
And, as a matter of fact, this is an improvement over the current implementation, where corrupt docs and transform errors are not cleaned up, and the migrator currently ignores the discardCorruptObjects flag at this point (I believe it will simply fail to start if transform errors occur).
rudolf
left a comment
There was a problem hiding this comment.
I can see you're still working on this so you probably know about these, but thought I'd leave some comments
| return { | ||
| ...stateP, | ||
| controlState: 'FATAL', | ||
| reason: extractUnknownDocFailureReason( | ||
| stateP.migrationDocLinks.resolveMigrationFailures, | ||
| res.left.unknownDocs | ||
| ), | ||
| }; |
There was a problem hiding this comment.
should this be throwBadResponse? didn't check out the code locally but I would expect res.left.unknownDocs to throw a type error
There was a problem hiding this comment.
The other part of the model.ts where we check for unknown docs also fails with a FATAL. What is the difference in terms of behavior between FATAL state and throwBadResponse? Should I update both?
| }); | ||
|
|
||
| it('resolves with `Either.left`, if the delete query fails', async () => { | ||
| const versionConflicts = [ |
There was a problem hiding this comment.
should be moved to the wait for delete by query action test
There was a problem hiding this comment.
Yeah that one still needs some work / refactor.
| ...stateP, | ||
| controlState: 'FATAL', | ||
| reason: | ||
| `Migration failed because it was unable to delete unwanted documents from the ${stateP.sourceIndex.value} system index:\n` + |
There was a problem hiding this comment.
we should ignore version conflicts and proceed
There was a problem hiding this comment.
I first implemented a custom retryDelay mechanism here, to make sure the delete is re-run if there are failures.
Then I got rid of it, I thought that by failing with FATAL we'd restart and re-run the delete (checking again for unwanted documents).
If we ignore the failures and carry on with the migration, we're going to update the aliases on the next step.
Thus, if for some reason we fail to delete one of the SO with unknown type, Kibana is not going to be able to start, nor to delete that document at startup, right?
There was a problem hiding this comment.
yeah I should have thought more carefully through this before saying we should proceed...
I don't think we should throw a FATAL even if everything will be fine in the end after we restart. Upgrades are a sensitive procedure and it's alarming to a user to see FATAL in the logs. This also feels like something that could happen fairly frequently it's not just an obscure edge case.
So I think the behaviour you had where we retry would be better. Looking at the diff you used to retry 5 times but I think we can retry retryAttempts times which defaults to 15.
💚 Build Succeeded
Metrics [docs]Public APIs missing comments
Public APIs missing exports
Unknown metric groupsAPI count
ESLint disabled in files
ESLint disabled line counts
Total ESLint disabled count
History
To update your PR or re-run it, just comment with: cc @gsoldevila |
💔 All backports failed
Manual backportTo create the backport manually run: Questions ?Please refer to the Backport tool documentation |
In the context of migrations, elastic#147371 avoids reindexing during an upgrade, provided that `diffMappings === false`. This _alternative path_ skips some key steps that are performed before reindexing: * `CHECK_UNKNOWN_DOCUMENTS` * `CALCULATE_EXCLUDE_FILTERS` These steps enrich a search query that is used during reindexing, effectively filtering out undesired documents. If the mappings [match](elastic#147371) (or they are [compatible](elastic#149326)) and we _no longer reindex_, this cleanup operation does not happen, leaving undesired documents in our system indices. The goal of this PR is to add an extra step in the state machine (`CLEANUP_UNKNOWN_AND_EXCLUDED`), which will actively cleanup a system index if we're going the _skip reindexing_ path.  Fixes elastic#150299 --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit 754e868)
| stateP.targetIndexMappings | ||
| ) | ||
| ) && | ||
| Math.random() < 10 |
There was a problem hiding this comment.
This extra check slipped through, it is harmless and it will be removed in #149326
# Backport This will backport the following commits from `main` to `8.7`: - [Introduce CLEANUP_UNKNOWN_AND_EXCLUDED step (#149931)](#149931) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Gerard Soldevila","email":"gerard.soldevila@elastic.co"},"sourceCommit":{"committedDate":"2023-02-27T10:43:13Z","message":"Introduce CLEANUP_UNKNOWN_AND_EXCLUDED step (#149931)\n\nIn the context of migrations,\r\nhttps://github.com//pull/147371 avoids reindexing during\r\nan upgrade, provided that `diffMappings === false`.\r\n\r\nThis _alternative path_ skips some key steps that are performed before\r\nreindexing:\r\n* `CHECK_UNKNOWN_DOCUMENTS`\r\n* `CALCULATE_EXCLUDE_FILTERS`\r\n\r\nThese steps enrich a search query that is used during reindexing,\r\neffectively filtering out undesired documents.\r\n\r\nIf the mappings [match](https://github.com/elastic/kibana/pull/147371)\r\n(or they are\r\n[compatible](#149326)) and we _no\r\nlonger reindex_, this cleanup operation does not happen, leaving\r\nundesired documents in our system indices.\r\n\r\nThe goal of this PR is to add an extra step in the state machine\r\n(`CLEANUP_UNKNOWN_AND_EXCLUDED`), which will actively cleanup a system\r\nindex if we're going the _skip reindexing_ path.\r\n\r\n\r\n\r\n\r\nFixes https://github.com/elastic/kibana/issues/150299\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>","sha":"754e8682d45196e519720693e036017c915c0379","branchLabelMapping":{"^v8.8.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["bug","Team:Core","release_note:skip","Feature:Migrations","backport:prev-minor","v8.7.0","v8.8.0"],"number":149931,"url":"https://github.com/elastic/kibana/pull/149931","mergeCommit":{"message":"Introduce CLEANUP_UNKNOWN_AND_EXCLUDED step (#149931)\n\nIn the context of migrations,\r\nhttps://github.com//pull/147371 avoids reindexing during\r\nan upgrade, provided that `diffMappings === false`.\r\n\r\nThis _alternative path_ skips some key steps that are performed before\r\nreindexing:\r\n* `CHECK_UNKNOWN_DOCUMENTS`\r\n* `CALCULATE_EXCLUDE_FILTERS`\r\n\r\nThese steps enrich a search query that is used during reindexing,\r\neffectively filtering out undesired documents.\r\n\r\nIf the mappings [match](https://github.com/elastic/kibana/pull/147371)\r\n(or they are\r\n[compatible](#149326)) and we _no\r\nlonger reindex_, this cleanup operation does not happen, leaving\r\nundesired documents in our system indices.\r\n\r\nThe goal of this PR is to add an extra step in the state machine\r\n(`CLEANUP_UNKNOWN_AND_EXCLUDED`), which will actively cleanup a system\r\nindex if we're going the _skip reindexing_ path.\r\n\r\n\r\n\r\n\r\nFixes https://github.com/elastic/kibana/issues/150299\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>","sha":"754e8682d45196e519720693e036017c915c0379"}},"sourceBranch":"main","suggestedTargetBranches":["8.7"],"targetPullRequestStates":[{"branch":"8.7","label":"v8.7.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.8.0","labelRegex":"^v8.8.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/149931","number":149931,"mergeCommit":{"message":"Introduce CLEANUP_UNKNOWN_AND_EXCLUDED step (#149931)\n\nIn the context of migrations,\r\nhttps://github.com//pull/147371 avoids reindexing during\r\nan upgrade, provided that `diffMappings === false`.\r\n\r\nThis _alternative path_ skips some key steps that are performed before\r\nreindexing:\r\n* `CHECK_UNKNOWN_DOCUMENTS`\r\n* `CALCULATE_EXCLUDE_FILTERS`\r\n\r\nThese steps enrich a search query that is used during reindexing,\r\neffectively filtering out undesired documents.\r\n\r\nIf the mappings [match](https://github.com/elastic/kibana/pull/147371)\r\n(or they are\r\n[compatible](#149326)) and we _no\r\nlonger reindex_, this cleanup operation does not happen, leaving\r\nundesired documents in our system indices.\r\n\r\nThe goal of this PR is to add an extra step in the state machine\r\n(`CLEANUP_UNKNOWN_AND_EXCLUDED`), which will actively cleanup a system\r\nindex if we're going the _skip reindexing_ path.\r\n\r\n\r\n\r\n\r\nFixes https://github.com/elastic/kibana/issues/150299\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>","sha":"754e8682d45196e519720693e036017c915c0379"}}]}] BACKPORT-->
In the context of migrations, elastic#147371 avoids reindexing during an upgrade, provided that `diffMappings === false`. This _alternative path_ skips some key steps that are performed before reindexing: * `CHECK_UNKNOWN_DOCUMENTS` * `CALCULATE_EXCLUDE_FILTERS` These steps enrich a search query that is used during reindexing, effectively filtering out undesired documents. If the mappings [match](elastic#147371) (or they are [compatible](elastic#149326)) and we _no longer reindex_, this cleanup operation does not happen, leaving undesired documents in our system indices. The goal of this PR is to add an extra step in the state machine (`CLEANUP_UNKNOWN_AND_EXCLUDED`), which will actively cleanup a system index if we're going the _skip reindexing_ path.  Fixes elastic#150299 --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
In the context of migrations, #147371 avoids reindexing during an upgrade, provided that
diffMappings === false.This alternative path skips some key steps that are performed before reindexing:
CHECK_UNKNOWN_DOCUMENTSCALCULATE_EXCLUDE_FILTERSThese steps enrich a search query that is used during reindexing, effectively filtering out undesired documents.
If the mappings match (or they are compatible) and we no longer reindex, this cleanup operation does not happen, leaving undesired documents in our system indices.
The goal of this PR is to add an extra step in the state machine (
CLEANUP_UNKNOWN_AND_EXCLUDED), which will actively cleanup a system index if we're going the skip reindexing path.Fixes #150299