Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataFrame] Read files in parallel (4x faster) #6984

Closed
wants to merge 1 commit into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 16, 2023

Which issue does this PR close?

Closes #6983
Closes #6908

Rationale for this change

This code uses a single core to read the file

    let _df = _ctx.read_parquet(FILENAME, _read_options).await.unwrap();
    let _cached = _df.cache().await;

What changes are included in this PR?

Use multiple cores

Testing using using cargo --release

With main (16s)

datafusion end -> 2023-07-16T09:07:29.895269-04:00 16.080133858s

With this branch (3s)

datafusion end -> 2023-07-16T08:52:05.511984-04:00 2.947019517s

Are these changes tested?

Yes

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jul 16, 2023
@alamb alamb changed the title [DataFrame] Read files in parallel [DataFrame] Read files in parallel (4x faster) Jul 16, 2023
@alamb alamb marked this pull request as ready for review July 16, 2023 13:14
@alamb alamb marked this pull request as draft July 16, 2023 13:32
@alamb
Copy link
Contributor Author

alamb commented Jul 17, 2023

I think a better approach is described on #6983 (comment)

@alamb alamb closed this Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DataFrame] Parallel Load into dataframe
1 participant