New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Improve `ParquetExec` and related documentation #10647

Merged

alamb merged 8 commits into apache:main from alamb:alamb/parquet_docs

May 28, 2024

Contributor

alamb commented May 24, 2024 •

edited

Loading

Which issue does this PR close?

Part of #10549

Rationale for this change

While trying to make an example that uses ParquetExec, I found its documentation was sparse and could be improved

What changes are included in this PR?

Add docstrings to ParquetExec and related structs explaining more of what they do and how they work

Are these changes tested?

By existing CI

Are there any user-facing changes?

Just doc strings, no functional changes


          Improve ParquetExec and related documentation

70b65bd

github-actions bot added the core label

alamb mentioned this pull request

Add ParquetExec::builder(), deprecate ParquetExec::new #10636

Merged

alamb marked this pull request as ready for review

May 24, 2024 09:39

alamb added the documentation label

Contributor Author

alamb commented May 24, 2024

@thinkharderdev , @tustvold, @Ted-Jiang and @crepererum: if you have time, could you double check that this correctly describes ParquetExec to your understanding?

crepererum reviewed

View reviewed changes

datafusion/core/src/datasource/physical_plan/parquet/mod.rs Outdated

Comment on lines 110 to 111

		/// * Multi-threaded (aka multi-partition): read from one or more files in
		/// parallel. Can read concurrently from multiple row groups from a single file.

Contributor

crepererum May 24, 2024

I would call this "concurrency" instead of "multi-threading". IIRC we don't implement ANY threading in this operator and solely rely on tokio to dispatch concurrent bits for us. I think it's fine to mention that the concurrency in this operator CAN lead to multi-core usage under specific circumstances.

datafusion/core/src/datasource/physical_plan/parquet/mod.rs

+              /// table schema. This can be used to implement "schema evolution". See
+              /// [`SchemaAdapterFactory`] for more details.
+              ///
+              /// * metadata_size_hint: controls the number of bytes read from the end of the

Contributor

crepererum May 24, 2024

FWIW this is passed on to the reader (custom or builtin) and the reader uses that to gather the metadata. The reader CAN however use another more precise source for this information or not read the metadata from object store at all (e.g. it could use an extra service, a dataset-based source or some sort of cache).

datafusion/core/src/datasource/physical_plan/parquet/mod.rs Outdated

+              ///
+              /// * Limit pushdown: stop execution early after some number of rows are read.
+              ///
+              /// * Custom readers: controls I/O for accessing pages. See

Contributor

crepererum May 24, 2024

Suggested change

      
            /// * Custom readers: controls I/O for accessing pages. See
          
            /// * Custom readers: implements I/O for accessing byte ranges and the metadata object. See

It's not steering the IO process, it's actually responsible for performing (or not performing) it. For example, a custom impl. could totally NOT use an object store (which is esp. interesting for the metadata bit, see other comment below).

Contributor Author

alamb May 25, 2024

good call -- updated

datafusion/core/src/datasource/physical_plan/parquet/mod.rs Outdated

Comment on lines 139 to 141

+              /// * Step 3: The `ParquetOpener` gets the file metadata by reading the footer,
+              /// and applies any predicates and projections to determine what pages must be
+              /// read.

Contributor

crepererum May 24, 2024

It gets the metadata from the ParquetFileReaderFactory or more specifically the AsyncFileReader that this factory returns. The ParquetOpener doesn't care where the metadata comes from.

datafusion/core/src/datasource/physical_plan/parquet/mod.rs Outdated

+              /// Interface for creating [`AsyncFileReader`]s to read parquet files.
+              ///
+              /// This interface is used by [`ParquetOpener`] in order to create readers for
+              /// parquet files. Implementations of this trait can be used to provide custom

Contributor

crepererum May 24, 2024

What's "this trait" in this case? I guess you're referring to AsyncFileReader, not ParquetFileReaderFactory here. To avoid confusion and give the user more freedom how/where the implement "pre-cached data, I/O ..." etc., I suggest to start a new paragraph and say:

The combined implementations of [`ParquetFileReaderFactory`] and [`AsyncFileReader`]
can be used to provide custom data access operations such as
pre-cached data, I/O coalescing, etc.

Contributor Author

alamb May 25, 2024

Excellent idea. I did so

alamb added 2 commits

May 25, 2024 09:12


          Merge remote-tracking branch 'apache/main' into alamb/parquet_docs

1d60645


          Improve documentation

91a27e8

github-actions bot removed the documentation label

comphead reviewed

View reviewed changes

datafusion/core/src/datasource/physical_plan/parquet/schema_adapter.rs Outdated Show resolved Hide resolved

comphead approved these changes

View reviewed changes

Contributor

comphead left a comment

lgtm thanks @alamb

alamb and others added 3 commits

May 26, 2024 06:06


          Update datafusion/core/src/datasource/physical_plan/parquet/schema_ad…

0290f38

…apter.rs

Co-authored-by: Oleks V <[email protected]>


          Merge remote-tracking branch 'apache/main' into alamb/parquet_docs

2f949e8


          Merge remote-tracking branch 'apache/main' into alamb/parquet_docs

fc7b497

crepererum approved these changes

View reviewed changes

alamb added 2 commits

May 27, 2024 08:04


          Merge remote-tracking branch 'apache/main' into alamb/parquet_docs

d6b6d10


          fix link

bd1e987

alamb merged commit 7f0e194 into apache:main

23 checks passed

alamb deleted the alamb/parquet_docs branch

May 28, 2024 00:14

findepi pushed a commit to findepi/datafusion that referenced this pull request


          Improve ParquetExec and related documentation (apache#10647)

2ebb30b

* Improve ParquetExec and related documentation

* Improve documentation

* Update datafusion/core/src/datasource/physical_plan/parquet/schema_adapter.rs

Co-authored-by: Oleks V <[email protected]>

* fix link

---------

Co-authored-by: Oleks V <[email protected]>

This pull request was closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core