-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Add pipeline to clean docs during data stream reindex #121617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pipeline to clean docs during data stream reindex #121617
Conversation
.../plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/ReindexDataStreamPipeline.java
Outdated
Show resolved
Hide resolved
.../plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/ReindexDataStreamPipeline.java
Outdated
Show resolved
Hide resolved
|
Pinging @elastic/es-data-management (Team:Data Management) |
|
Hi @parkertimmins, I've created a changelog YAML for you. |
| final BytesReference pipeline = BytesReference.bytes(currentPipelineDefinition()); | ||
| client.execute( | ||
| PutPipelineTransportAction.TYPE, | ||
| new PutPipelineRequest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably ought to set the parent task id here so that it's cancellable (although I'm not 100% sure it's worth it).
| { | ||
| builder.startObject(); | ||
| { | ||
| builder.startObject("set"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to set every @timestamp to 0? Shouldn't this be a script processor that checks if it exists first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh never mind! This is what override is for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to compare which performs better: a set with override: false or a set with an if condition checking for a null @timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to imagine that a script would be faster, but it would be interesting. We could always change that later though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if the script is faster, that would probably shame @joegallo into action, making the set processor faster.
|
Looks pretty good, but a couple of things remain:
|
dakrone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this generally looks good, but I'm curious why we don't use our existing infrastructure for installing pipelines?
.../plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/ReindexDataStreamPipeline.java
Outdated
Show resolved
Hide resolved
.../plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/ReindexDataStreamPipeline.java
Outdated
Show resolved
Hide resolved
| { | ||
| builder.startObject(); | ||
| { | ||
| builder.startObject("set"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to compare which performs better: a set with override: false or a set with an if condition checking for a null @timestamp
| } | ||
|
|
||
| @Override | ||
| protected String getOrigin() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this no longer needs to be run with the user permissions, it seemed better to not require the user to have put-pipeline, and to make new user with system perms (or something like that) and only give it to this registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree.
| } | ||
| } | ||
| ], | ||
| "version": 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since version is now handled by the index template registry, the way to keep it from overwriting a custom template is to use a higher version number.
masseyke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
dakrone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I left one comment about which origin to use
| public static final String APM_ORIGIN = "apm"; | ||
| public static final String OTEL_ORIGIN = "otel"; | ||
| public static final String REINDEX_DATA_STREAM_ORIGIN = "reindex_data_stream"; | ||
| public static final String MIGRATE_ORIGIN = "migrate"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the existing STACK_ORIGIN origin, I think there's an aversion to adding too many of these if I remember correctly.
Add the pipeline "reindex-data-stream-pipeline" to the reindex request within ReindexDataStreamIndexAction. This cleans up documents as needed before inserting into the destination index. Currently, the pipeline only sets a timestamp field with a value of 0, if the document is missing a timestamp field. This is needed because existing indices which are added to a data stream may not contain a timestamp, but reindex validates that a timestamp field exists when creating data stream destination indices. This pipeline is managed by ES, but can be overriden by users if necessary. To do this, the version field of the pipeline should be set to a value higher than the MigrateRegistry version.
💔 Backport failed
You can use sqren/backport to manually backport by running |
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
Add the pipeline "reindex-data-stream-pipeline" to the reindex request within ReindexDataStreamIndexAction. This cleans up documents as needed before inserting into the destination index. Currently, the pipeline only sets a timestamp field with a value of 0, if the document is missing a timestamp field. This is needed because existing indices which are added to a data stream may not contain a timestamp, but reindex validates that a timestamp field exists when creating data stream destination indices. This pipeline is managed by ES, but can be overriden by users if necessary. To do this, the version field of the pipeline should be set to a value higher than the MigrateRegistry version. (cherry picked from commit 29965bc) # Conflicts: # x-pack/plugin/migrate/src/internalClusterTest/java/org/elasticsearch/xpack/migrate/action/ReindexDatastreamIndexTransportActionIT.java
Add the pipeline "reindex-data-stream-pipeline" to the reindex request within ReindexDataStreamIndexAction. This cleans up documents as needed before inserting into the destination index. Currently, the pipeline only sets a timestamp field with a value of 0, if the document is missing a timestamp field. This is needed because existing indices which are added to a data stream may not contain a timestamp, but reindex validates that a timestamp field exists when creating data stream destination indices. This pipeline is managed by ES, but can be overriden by users if necessary. To do this, the version field of the pipeline should be set to a value higher than the MigrateRegistry version. (cherry picked from commit 29965bc) # Conflicts: # x-pack/plugin/migrate/src/internalClusterTest/java/org/elasticsearch/xpack/migrate/action/ReindexDatastreamIndexTransportActionIT.java
) Add the pipeline "reindex-data-stream-pipeline" to the reindex request within ReindexDataStreamIndexAction. This cleans up documents as needed before inserting into the destination index. Currently, the pipeline only sets a timestamp field with a value of 0, if the document is missing a timestamp field. This is needed because existing indices which are added to a data stream may not contain a timestamp, but reindex validates that a timestamp field exists when creating data stream destination indices. This pipeline is managed by ES, but can be overriden by users if necessary. To do this, the version field of the pipeline should be set to a value higher than the MigrateRegistry version.
…#121730) * Add pipeline to clean docs during data stream reindex (#121617) Add the pipeline "reindex-data-stream-pipeline" to the reindex request within ReindexDataStreamIndexAction. This cleans up documents as needed before inserting into the destination index. Currently, the pipeline only sets a timestamp field with a value of 0, if the document is missing a timestamp field. This is needed because existing indices which are added to a data stream may not contain a timestamp, but reindex validates that a timestamp field exists when creating data stream destination indices. This pipeline is managed by ES, but can be overriden by users if necessary. To do this, the version field of the pipeline should be set to a value higher than the MigrateRegistry version. (cherry picked from commit 29965bc) # Conflicts: # x-pack/plugin/migrate/src/internalClusterTest/java/org/elasticsearch/xpack/migrate/action/ReindexDatastreamIndexTransportActionIT.java * Remove timeouts that werent present in 8x.
#121729) * Add pipeline to clean docs during data stream reindex (#121617) Add the pipeline "reindex-data-stream-pipeline" to the reindex request within ReindexDataStreamIndexAction. This cleans up documents as needed before inserting into the destination index. Currently, the pipeline only sets a timestamp field with a value of 0, if the document is missing a timestamp field. This is needed because existing indices which are added to a data stream may not contain a timestamp, but reindex validates that a timestamp field exists when creating data stream destination indices. This pipeline is managed by ES, but can be overriden by users if necessary. To do this, the version field of the pipeline should be set to a value higher than the MigrateRegistry version. (cherry picked from commit 29965bc) # Conflicts: # x-pack/plugin/migrate/src/internalClusterTest/java/org/elasticsearch/xpack/migrate/action/ReindexDatastreamIndexTransportActionIT.java * Remove timeouts that werent present in 8x.
It is possible for documents in a data stream to not have a
@timestampfield. For example, if an existing index is added to a data stream. But, this will cause the reindex operation to fail as the destination index will already contained the_data_stream_timestampvalue from the source mappings, causing each document to be checked for an@timestampfield.This change adds a pipeline, tentatively called
reindex-data-stream, which adds a@timestampfield with a value of0to destination docs if a timestamp is missing. If a user creates a pipeline with this name, and without aversionfield, the user's pipeline will be used instead of the built in pipeline.