[ML] Adds progress reporting for transforms (#41278) #41529

benwtrent · 2019-04-25T12:57:53Z

This admittedly large PR adds progress reporting to data frame transforms. The majority of the size is due to refactoring cause by yak-shaving[0] :(.

Design decisions

I opted to put the progress reporting into its own object (so we can add more fields as we desire in the future), and put it directly under the State object. I kept it separated from the checkpoint information just for simplicity's sake as the two pieces of information (checkpoint status and progress) are two separate pieces of information.
Also, due to yak-shaving, much refactoring was done in this PR. All the refactoring done in this PR would have to be done eventually when we cancel the task on _stop and I needed part of it for gathering progress information when the ES Node executes the task.
I am now having the task automatically start when the node executor kicks it off. This is part of the yak-shaving refactoring. It makes sense that if _start creates the task (and the executor sees that it is a new task) that it should automatically start without having to call start() on the allocated task on the node.
Progress information is now stored in the state, gathering the "remaining docs" via a query could require a very costly query. Specifically range queries against terms are very expensive.
Total number of docs is a simple enough query.

Considerations

This is a "good enough" progress reporting. No guarantees are made as the index could be updated so that the cursor actually hits more or fewer docs than initially gathered.

Future work

Have the total docs query take checkpointing into account. Right now, it only utilizes the dataframe source query. As new checkpoints are executed, the query will have to change to give an accurate count of the total docs expected to be processed in that checkpoint.

[0] https://en.wiktionary.org/wiki/yak_shaving

Backport of #41278

* [ML] Adds progress reporting for transforms * fixing after master merge * Addressing PR comments * removing unused imports * Adjusting afterKey handling and percentage to be 100* * Making sure it is a linked hashmap for serialization * removing unused import * addressing PR comments * removing unused import * simplifying code, only storing total docs and decrementing * adjusting for rewrite * removing initial progress gathering from executor

elasticmachine · 2019-04-25T12:57:55Z

Pinging @elastic/ml-core

benwtrent · 2019-04-25T14:34:59Z

run elasticsearch-ci/1

benwtrent added >non-issue backport v7.2.0 :ml/Transform Transform labels Apr 25, 2019

benwtrent merged commit 08843ba into elastic:7.x Apr 25, 2019

benwtrent deleted the feature/ml-df-calculate-docs-left-7.x branch April 25, 2019 16:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Adds progress reporting for transforms (#41278) #41529

[ML] Adds progress reporting for transforms (#41278) #41529

Uh oh!

benwtrent commented Apr 25, 2019

Uh oh!

elasticmachine commented Apr 25, 2019

Uh oh!

benwtrent commented Apr 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[ML] Adds progress reporting for transforms (#41278) #41529

[ML] Adds progress reporting for transforms (#41278) #41529

Uh oh!

Conversation

benwtrent commented Apr 25, 2019

Design decisions

Considerations

Future work

Uh oh!

elasticmachine commented Apr 25, 2019

Uh oh!

benwtrent commented Apr 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants