Skip to content

Conversation

@benwtrent
Copy link
Member

This admittedly large PR adds progress reporting to data frame transforms. The majority of the size is due to refactoring cause by yak-shaving[0] :(.

Design decisions

  • I opted to put the progress reporting into its own object (so we can add more fields as we desire in the future), and put it directly under the State object. I kept it separated from the checkpoint information just for simplicity's sake as the two pieces of information (checkpoint status and progress) are two separate pieces of information.
  • Also, due to yak-shaving, much refactoring was done in this PR. All the refactoring done in this PR would have to be done eventually when we cancel the task on _stop and I needed part of it for gathering progress information when the ES Node executes the task.
  • I am now having the task automatically start when the node executor kicks it off. This is part of the yak-shaving refactoring. It makes sense that if _start creates the task (and the executor sees that it is a new task) that it should automatically start without having to call start() on the allocated task on the node.
  • Progress information is now stored in the state, gathering the "remaining docs" via a query could require a very costly query. Specifically range queries against terms are very expensive.
  • Total number of docs is a simple enough query.

Considerations

  • This is a "good enough" progress reporting. No guarantees are made as the index could be updated so that the cursor actually hits more or fewer docs than initially gathered.

Future work

  • Have the total docs query take checkpointing into account. Right now, it only utilizes the dataframe source query. As new checkpoints are executed, the query will have to change to give an accurate count of the total docs expected to be processed in that checkpoint.

[0] https://en.wiktionary.org/wiki/yak_shaving

Backport of #41278

* [ML] Adds progress reporting for transforms

* fixing after master merge

* Addressing PR comments

* removing unused imports

* Adjusting afterKey handling and percentage to be 100*

* Making sure it is a linked hashmap for serialization

* removing unused import

* addressing PR comments

* removing unused import

* simplifying code, only storing total docs and decrementing

* adjusting for rewrite

* removing initial progress gathering from executor
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@benwtrent
Copy link
Member Author

run elasticsearch-ci/1

@benwtrent benwtrent merged commit 08843ba into elastic:7.x Apr 25, 2019
@benwtrent benwtrent deleted the feature/ml-df-calculate-docs-left-7.x branch April 25, 2019 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants