Update Configuration doc

2010YOUY01 · Jul 11, 2023 · f476acc · f476acc
1 parent 4ba25e0
commit f476acc
Show file tree

Hide file tree

Showing 4 changed files with 56 additions and 55 deletions.
diff --git a/datafusion-cli/Cargo.lock b/datafusion-cli/Cargo.lock
diff --git a/datafusion/common/src/config.rs b/datafusion/common/src/config.rs
@@ -323,10 +323,13 @@ config_namespace! {
         /// long runner execution, all types of joins may encounter out-of-memory errors.
         pub allow_symmetric_joins_without_pruning: bool, default = true
 
-        /// When set to true, file groups will be repartitioned to achieve maximum parallelism.
-        /// Currently supported only for Parquet format in which case
-        /// multiple row groups from the same file may be read concurrently. If false then each
-        /// row group is read serially, though different files may be read in parallel.
+        /// When set to `true`, file groups will be repartitioned to achieve maximum parallelism.
+        /// Currently Parquet and CSV formats are supported.
+        ///
+        /// If set to `true`, all files will be repartitioned evenly (i.e., a single large file
+        /// might be partitioned into smaller chunks) for parallel scanning.
+        /// If set to `false`, different files will be read in parallel, but repartitioning won't
+        /// happen within a single file.
         pub repartition_file_scans: bool, default = true
 
         /// Should DataFusion repartition data using the partitions keys to execute window

diff --git a/datafusion/core/src/datasource/physical_plan/parquet.rs b/datafusion/core/src/datasource/physical_plan/parquet.rs
@@ -754,7 +754,6 @@ mod tests {
     use datafusion_physical_expr::create_physical_expr;
     use datafusion_physical_expr::execution_props::ExecutionProps;
     use futures::StreamExt;
-    use itertools::Itertools;
     use object_store::local::LocalFileSystem;
     use object_store::path::Path;
     use object_store::ObjectMeta;

diff --git a/docs/source/user-guide/configs.md b/docs/source/user-guide/configs.md
@@ -63,7 +63,7 @@ Environment variables are read during `SessionConfig` initialisation so they mus
 | datafusion.optimizer.repartition_file_min_size             | 10485760   | Minimum total files size in bytes to perform file scan repartitioning.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
 | datafusion.optimizer.repartition_joins                     | true       | Should DataFusion repartition data using the join keys to execute joins in parallel using the provided `target_partitions` level                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
 | datafusion.optimizer.allow_symmetric_joins_without_pruning | true       | Should DataFusion allow symmetric hash joins for unbounded data sources even when its inputs do not have any ordering or filtering If the flag is not enabled, the SymmetricHashJoin operator will be unable to prune its internal buffers, resulting in certain join types - such as Full, Left, LeftAnti, LeftSemi, Right, RightAnti, and RightSemi - being produced only at the end of the execution. This is not typical in stream processing. Additionally, without proper design for long runner execution, all types of joins may encounter out-of-memory errors.                                |
-| datafusion.optimizer.repartition_file_scans                | true       | When set to true, file groups will be repartitioned to achieve maximum parallelism. Currently supported only for Parquet format in which case multiple row groups from the same file may be read concurrently. If false then each row group is read serially, though different files may be read in parallel.                                                                                                                                                                                                                                                                                           |
+| datafusion.optimizer.repartition_file_scans                | true       | When set to `true`, file groups will be repartitioned to achieve maximum parallelism. Currently Parquet and CSV formats are supported. If set to `true`, all files will be repartitioned evenly (i.e., a single large file might be partitioned into smaller chunks) for parallel scanning. If set to `false`, different files will be read in parallel, but repartitioning won't happen within a single file.                                                                                                                                                                                          |
 | datafusion.optimizer.repartition_windows                   | true       | Should DataFusion repartition data using the partitions keys to execute window functions in parallel using the provided `target_partitions` level                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
 | datafusion.optimizer.repartition_sorts                     | true       | Should DataFusion execute sorts in a per-partition fashion and merge afterwards instead of coalescing first and sorting globally. With this flag is enabled, plans in the form below `text "SortExec: [a@0 ASC]", " CoalescePartitionsExec", " RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1", ` would turn into the plan below which performs better in multithreaded environments `text "SortPreservingMergeExec: [a@0 ASC]", " SortExec: [a@0 ASC]", " RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1", `                                               |
 | datafusion.optimizer.skip_failed_rules                     | false      | When set to true, the logical plan optimizer will produce warning messages if any optimization rules produce errors and then proceed to the next rule. When set to false, any rules that produce errors will cause the query to fail                                                                                                                                                                                                                                                                                                                                                                    |