[ML] Improving parsing of large uploaded files#62970
Merged
jgowdyelastic merged 6 commits intoelastic:masterfrom Apr 14, 2020
Merged
[ML] Improving parsing of large uploaded files#62970jgowdyelastic merged 6 commits intoelastic:masterfrom
jgowdyelastic merged 6 commits intoelastic:masterfrom
Conversation
Contributor
|
Pinging @elastic/ml-ui (:ml) |
Member
Author
|
cc @droberts195 |
peteharverson
approved these changes
Apr 9, 2020
Contributor
peteharverson
left a comment
There was a problem hiding this comment.
Tested and LGTM. One minor comment.
Member
Author
|
@elasticmachine merge upstream |
Member
Author
|
@elasticmachine merge upstream |
Contributor
💚 Build SucceededHistory
To update your PR or re-run it, just comment with: |
jgowdyelastic
added a commit
to jgowdyelastic/kibana
that referenced
this pull request
Apr 14, 2020
* [ML] Improving parsing of large uploaded files * small clean up * increasing max to 1GB * adding comments Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
1 task
jgowdyelastic
added a commit
that referenced
this pull request
Apr 15, 2020
* [ML] Improving parsing of large uploaded files * small clean up * increasing max to 1GB * adding comments Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
gmmorris
added a commit
to gmmorris/kibana
that referenced
this pull request
Apr 15, 2020
* master: (29 commits) Add test:jest_integration npm script (elastic#62938) [data.search.aggs] Remove service getters from agg types (AggConfig part) (elastic#62548) [Discover] Fix broken setting of bucketInterval (elastic#62939) Disable adding conditions when in alert management context. (elastic#63514) [Alerting] fixes to allow pre-configured actions to be executed (elastic#63432) adding useMemo (elastic#63504) [Maps] fix double fetch when filter pill is added (elastic#63024) [Lens] Fix missing formatting bug in "break down by" (elastic#63288) [SIEM] [Cases] Removed double pasted line (elastic#63507) [Reporting] Improve functional test steps (elastic#63259) [SIEM][CASE] Tests for server's configuration API (elastic#63099) [SIEM] [Cases] Case container unit tests (elastic#63376) [ML] Improving parsing of large uploaded files (elastic#62970) [ML] Listing global calendars on the job management page (elastic#63124) [Ingest][Endpoint] Add Ingest rest api response types for use in Endpoint (elastic#63373) Add help text to form fields (elastic#63165) [ML] Converts utils Mocha tests to Jest (elastic#63132) [Metrics UI] Refactor With* containers to hooks (elastic#59503) [NP] Migrate logstash server side code to NP (elastic#63135) Clicking cancel in saved query save modal doesn't close it (elastic#62774) ...
wayneseymour
pushed a commit
that referenced
this pull request
Apr 15, 2020
* [ML] Improving parsing of large uploaded files * small clean up * increasing max to 1GB * adding comments Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The data from the file is now no longer read in one go, instead it is read as an
ArrayBufferwith the first 5MBs decoded and stored for sending to thefind_file_structureendpoint.At the point of import, the data is chopped up into 100MB chunks for processing into ndjson docs.
When dividing the data, there is a good chance a partial line will be left at the end of each chunk. The length of this partial line is measured and the start of the next chunk is rolled back to include it.
With this change I've managed to import a 1.43GB CSV file locally which took 11mins.
Because of this, I've increased the absolute max file size to be 1GB.
Further improvements can be made to reduce the browser memory footprint. Currently the whole file is still stored in memory before import, only now it's broken up into parts.
A better way would be to only process the file in chunks while the uploading is happening, removing the need to process the file entirely before beginning the upload. but that will involve a larger architectural change.
Checklist
Delete any items that are not applicable to this PR.
For maintainers