[ML] Improving parsing of large uploaded files by jgowdyelastic · Pull Request #62970 · elastic/kibana

jgowdyelastic · 2020-04-08T16:08:01Z

The data from the file is now no longer read in one go, instead it is read as an ArrayBuffer with the first 5MBs decoded and stored for sending to the find_file_structure endpoint.
At the point of import, the data is chopped up into 100MB chunks for processing into ndjson docs.
When dividing the data, there is a good chance a partial line will be left at the end of each chunk. The length of this partial line is measured and the start of the next chunk is rolled back to include it.

With this change I've managed to import a 1.43GB CSV file locally which took 11mins.
Because of this, I've increased the absolute max file size to be 1GB.

Further improvements can be made to reduce the browser memory footprint. Currently the whole file is still stored in memory before import, only now it's broken up into parts.
A better way would be to only process the file in chunks while the uploading is happening, removing the need to process the file entirely before beginning the upload. but that will involve a larger architectural change.

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
This was checked for keyboard-only and screenreader accessibility
This renders correctly on smaller devices using a responsive layout. (You can test this in your browser
This was checked for cross-browser compatibility, including a check against IE11

For maintainers

This was checked for breaking API changes and was labeled appropriately

elasticmachine · 2020-04-08T16:43:43Z

Pinging @elastic/ml-ui (:ml)

jgowdyelastic · 2020-04-08T17:03:30Z

cc @droberts195

peteharverson

Tested and LGTM. One minor comment.

x-pack/plugins/ml/common/constants/file_datavisualizer.ts

alvarezmelissa87

LGTM ⚡

jgowdyelastic · 2020-04-14T12:02:39Z

@elasticmachine merge upstream

jgowdyelastic · 2020-04-14T15:22:11Z

@elasticmachine merge upstream

kibanamachine · 2020-04-14T17:10:55Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: c93a51c

History

💚 Build #39852 succeeded 8781a69
💚 Build #39655 succeeded 46505bd
💚 Build #39639 succeeded 8dbd6c6

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

* [ML] Improving parsing of large uploaded files * small clean up * increasing max to 1GB * adding comments Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

* [ML] Improving parsing of large uploaded files * small clean up * increasing max to 1GB * adding comments Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

* master: (29 commits) Add test:jest_integration npm script (elastic#62938) [data.search.aggs] Remove service getters from agg types (AggConfig part) (elastic#62548) [Discover] Fix broken setting of bucketInterval (elastic#62939) Disable adding conditions when in alert management context. (elastic#63514) [Alerting] fixes to allow pre-configured actions to be executed (elastic#63432) adding useMemo (elastic#63504) [Maps] fix double fetch when filter pill is added (elastic#63024) [Lens] Fix missing formatting bug in "break down by" (elastic#63288) [SIEM] [Cases] Removed double pasted line (elastic#63507) [Reporting] Improve functional test steps (elastic#63259) [SIEM][CASE] Tests for server's configuration API (elastic#63099) [SIEM] [Cases] Case container unit tests (elastic#63376) [ML] Improving parsing of large uploaded files (elastic#62970) [ML] Listing global calendars on the job management page (elastic#63124) [Ingest][Endpoint] Add Ingest rest api response types for use in Endpoint (elastic#63373) Add help text to form fields (elastic#63165) [ML] Converts utils Mocha tests to Jest (elastic#63132) [Metrics UI] Refactor With* containers to hooks (elastic#59503) [NP] Migrate logstash server side code to NP (elastic#63135) Clicking cancel in saved query save modal doesn't close it (elastic#62774) ...

* [ML] Improving parsing of large uploaded files * small clean up * increasing max to 1GB * adding comments Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

jgowdyelastic added 3 commits April 8, 2020 16:44

[ML] Improving parsing of large uploaded files

a89359f

small clean up

8dbd6c6

increasing max to 1GB

46505bd

jgowdyelastic requested review from alvarezmelissa87, darnautov, peteharverson and walterra April 8, 2020 16:43

jgowdyelastic self-assigned this Apr 8, 2020

jgowdyelastic added the :ml label Apr 8, 2020

jgowdyelastic added Feature:File and Index Data Viz ML file and index data visualizer release_note:enhancement review v7.8.0 v8.0.0 labels Apr 8, 2020

jgowdyelastic marked this pull request as ready for review April 8, 2020 17:03

jgowdyelastic requested a review from a team as a code owner April 8, 2020 17:03

peteharverson approved these changes Apr 9, 2020

View reviewed changes

x-pack/plugins/ml/common/constants/file_datavisualizer.ts Outdated Show resolved Hide resolved

adding comments

8781a69

alvarezmelissa87 approved these changes Apr 9, 2020

View reviewed changes

Merge branch 'master' into improving-parsing-of-large-uploaded-files

ce9c360

Merge branch 'master' into improving-parsing-of-large-uploaded-files

c93a51c

jgowdyelastic merged commit 2b4c300 into elastic:master Apr 14, 2020

jgowdyelastic mentioned this pull request Apr 14, 2020

[7.x] [ML] Improving parsing of large uploaded files (#62970) #63500

Merged

jgowdyelastic deleted the improving-parsing-of-large-uploaded-files branch April 14, 2020 17:59

jgowdyelastic mentioned this pull request Apr 14, 2020

[ML] Changing file data visualizer max upload setting to string #63502

Merged

1 task

jgowdyelastic mentioned this pull request Apr 21, 2020

[DOCS] Add file size setting for Data Visualizer #64006

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Improving parsing of large uploaded files#62970

[ML] Improving parsing of large uploaded files#62970
jgowdyelastic merged 6 commits intoelastic:masterfrom
jgowdyelastic:improving-parsing-of-large-uploaded-files

jgowdyelastic commented Apr 8, 2020 •

edited

Loading

Uh oh!

elasticmachine commented Apr 8, 2020

Uh oh!

jgowdyelastic commented Apr 8, 2020

Uh oh!

peteharverson left a comment

Uh oh!

Uh oh!

alvarezmelissa87 left a comment

Uh oh!

jgowdyelastic commented Apr 14, 2020

Uh oh!

jgowdyelastic commented Apr 14, 2020

Uh oh!

kibanamachine commented Apr 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

jgowdyelastic commented Apr 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

For maintainers

Uh oh!

elasticmachine commented Apr 8, 2020

Uh oh!

jgowdyelastic commented Apr 8, 2020

Uh oh!

peteharverson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alvarezmelissa87 left a comment

Choose a reason for hiding this comment

Uh oh!

jgowdyelastic commented Apr 14, 2020

Uh oh!

jgowdyelastic commented Apr 14, 2020

Uh oh!

kibanamachine commented Apr 14, 2020

💚 Build Succeeded

History

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jgowdyelastic commented Apr 8, 2020 •

edited

Loading