diff --git a/datafusion/core/src/lib.rs b/datafusion/core/src/lib.rs index 36e0c3ee9c6a..6af72e6be962 100644 --- a/datafusion/core/src/lib.rs +++ b/datafusion/core/src/lib.rs @@ -1027,8 +1027,8 @@ doc_comment::doctest!( #[cfg(doctest)] doc_comment::doctest!( - "../../../docs/source/user-guide/sql/write_options.md", - user_guide_sql_write_options + "../../../docs/source/user-guide/sql/format_options.md", + user_guide_sql_format_options ); #[cfg(doctest)] diff --git a/docs/source/user-guide/sql/ddl.md b/docs/source/user-guide/sql/ddl.md index 71475cff9a39..fc18154becda 100644 --- a/docs/source/user-guide/sql/ddl.md +++ b/docs/source/user-guide/sql/ddl.md @@ -74,7 +74,7 @@ LOCATION := ( , ...) ``` -For a detailed list of write related options which can be passed in the OPTIONS key_value_list, see [Write Options](write_options). +For a comprehensive list of format-specific options that can be specified in the `OPTIONS` clause, see [Format Options](format_options.md). `file_type` is one of `CSV`, `ARROW`, `PARQUET`, `AVRO` or `JSON` diff --git a/docs/source/user-guide/sql/dml.md b/docs/source/user-guide/sql/dml.md index 4eda59d6dea1..c29447f23cd9 100644 --- a/docs/source/user-guide/sql/dml.md +++ b/docs/source/user-guide/sql/dml.md @@ -49,7 +49,7 @@ The output format is determined by the first match of the following rules: 1. Value of `STORED AS` 2. Filename extension (e.g. `foo.parquet` implies `PARQUET` format) -For a detailed list of valid OPTIONS, see [Write Options](write_options). +For a detailed list of valid OPTIONS, see [Format Options](format_options.md). ### Examples diff --git a/docs/source/user-guide/sql/format_options.md b/docs/source/user-guide/sql/format_options.md new file mode 100644 index 000000000000..e8008eafb166 --- /dev/null +++ b/docs/source/user-guide/sql/format_options.md @@ -0,0 +1,180 @@ + + +# Format Options + +DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` statements. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. In some cases, Options can be specified in multiple ways with a set order of precedence. + +## Specifying Options and Order of Precedence + +Format-related options can be specified in three ways, in decreasing order of precedence: + +- `CREATE EXTERNAL TABLE` syntax +- `COPY` option tuples +- Session-level config defaults + +For a list of supported session-level config defaults, see [Configuration Settings](../configs). These defaults apply to all operations but have the lowest level of precedence. + +If creating an external table, table-specific format options can be specified when the table is created using the `OPTIONS` clause: + +```sql +CREATE EXTERNAL TABLE + my_table(a bigint, b bigint) + STORED AS csv + LOCATION '/tmp/my_csv_table/' + OPTIONS( + NULL_VALUE 'NAN', + 'has_header' 'true', + 'format.delimiter' ';' + ); +``` + +When running `INSERT INTO my_table ...`, the options from the `CREATE TABLE` will be respected (e.g., gzip compression, special delimiter, and header row included). Note that compression, header, and delimiter settings can also be specified within the `OPTIONS` tuple list. Dedicated syntax within the SQL statement always takes precedence over arbitrary option tuples, so if both are specified, the `OPTIONS` setting will be ignored. + +For example, with the table defined above, running the following command: + +```sql +INSERT INTO my_table VALUES(1,2); +``` + +Results in a new CSV file with the specified options: + +```shell +$ cat /tmp/my_csv_table/bmC8zWFvLMtWX68R_0.csv +a;b +1;2 +``` + +Finally, options can be passed when running a `COPY` command. + +```sql +COPY source_table + TO 'test/table_with_options' + PARTITIONED BY (column3, column4) + OPTIONS ( + format parquet, + compression snappy, + 'compression::column1' 'zstd(5)', + ) +``` + +In this example, we write the entire `source_table` out to a folder of Parquet files. One Parquet file will be written in parallel to the folder for each partition in the query. The next option `compression` set to `snappy` indicates that unless otherwise specified, all columns should use the snappy compression codec. The option `compression::col1` sets an override, so that the column `col1` in the Parquet file will use the ZSTD compression codec with compression level `5`. In general, Parquet options that support column-specific settings can be specified with the syntax `OPTION::COLUMN.NESTED.PATH`. + +# Available Options + +## JSON Format Options + +The following options are available when reading or writing JSON files. Note: If any unsupported option is specified, an error will be raised and the query will fail. + +| Option | Description | Default Value | +| ----------- | ---------------------------------------------------------------------------------------------------------------------------------- | ------------- | +| COMPRESSION | Sets the compression that should be applied to the entire JSON file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. | UNCOMPRESSED | + +**Example:** + +```sql +CREATE EXTERNAL TABLE t(a int) +STORED AS JSON +LOCATION '/tmp/foo/' +OPTIONS('COMPRESSION' 'gzip'); +``` + +## CSV Format Options + +The following options are available when reading or writing CSV files. Note: If any unsupported option is specified, an error will be raised and the query will fail. + +| Option | Description | Default Value | +| -------------------- | --------------------------------------------------------------------------------------------------------------------------------- | ------------------ | +| COMPRESSION | Sets the compression that should be applied to the entire CSV file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. | UNCOMPRESSED | +| HAS_HEADER | Sets if the CSV file should include column headers. If not set, uses session or system default. | None | +| DELIMITER | Sets the character which should be used as the column delimiter within the CSV file. | `,` (comma) | +| QUOTE | Sets the character which should be used for quoting values within the CSV file. | `"` (double quote) | +| TERMINATOR | Sets the character which should be used as the line terminator within the CSV file. | None | +| ESCAPE | Sets the character which should be used for escaping special characters within the CSV file. | None | +| DOUBLE_QUOTE | Sets if quotes within quoted fields should be escaped by doubling them (e.g., `"aaa""bbb"`). | None | +| NEWLINES_IN_VALUES | Sets if newlines in quoted values are supported. If not set, uses session or system default. | None | +| DATE_FORMAT | Sets the format that dates should be encoded in within the CSV file. | None | +| DATETIME_FORMAT | Sets the format that datetimes should be encoded in within the CSV file. | None | +| TIMESTAMP_FORMAT | Sets the format that timestamps should be encoded in within the CSV file. | None | +| TIMESTAMP_TZ_FORMAT | Sets the format that timestamps with timezone should be encoded in within the CSV file. | None | +| TIME_FORMAT | Sets the format that times should be encoded in within the CSV file. | None | +| NULL_VALUE | Sets the string which should be used to indicate null values within the CSV file. | None | +| NULL_REGEX | Sets the regex pattern to match null values when loading CSVs. | None | +| SCHEMA_INFER_MAX_REC | Sets the maximum number of records to scan to infer the schema. | None | +| COMMENT | Sets the character which should be used to indicate comment lines in the CSV file. | None | + +**Example:** + +```sql +CREATE EXTERNAL TABLE t (col1 varchar, col2 int, col3 boolean) +STORED AS CSV +LOCATION '/tmp/foo/' +OPTIONS('DELIMITER' '|', 'HAS_HEADER' 'true', 'NEWLINES_IN_VALUES' 'true'); +``` + +## Parquet Format Options + +The following options are available when reading or writing Parquet files. If any unsupported option is specified, an error will be raised and the query will fail. If a column-specific option is specified for a column that does not exist, the option will be ignored without error. + +| Option | Can be Column Specific? | Description | OPTIONS Key | Default Value | +| ------------------------------------------ | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------- | ------------------------ | +| COMPRESSION | Yes | Sets the internal Parquet **compression codec** for data pages, optionally including the compression level. Applies globally if set without `::col`, or specifically to a column if set using `'compression::column_name'`. Valid values: `uncompressed`, `snappy`, `gzip(level)`, `lzo`, `brotli(level)`, `lz4`, `zstd(level)`, `lz4_raw`. | `'compression'` or `'compression::col'` | zstd(3) | +| ENCODING | Yes | Sets the **encoding** scheme for data pages. Valid values: `plain`, `plain_dictionary`, `rle`, `bit_packed`, `delta_binary_packed`, `delta_length_byte_array`, `delta_byte_array`, `rle_dictionary`, `byte_stream_split`. Use key `'encoding'` or `'encoding::col'` in OPTIONS. | `'encoding'` or `'encoding::col'` | None | +| DICTIONARY_ENABLED | Yes | Sets whether dictionary encoding should be enabled globally or for a specific column. | `'dictionary_enabled'` or `'dictionary_enabled::col'` | true | +| STATISTICS_ENABLED | Yes | Sets the level of statistics to write (`none`, `chunk`, `page`). | `'statistics_enabled'` or `'statistics_enabled::col'` | page | +| BLOOM_FILTER_ENABLED | Yes | Sets whether a bloom filter should be written for a specific column. | `'bloom_filter_enabled::column_name'` | None | +| BLOOM_FILTER_FPP | Yes | Sets bloom filter false positive probability (global or per column). | `'bloom_filter_fpp'` or `'bloom_filter_fpp::col'` | None | +| BLOOM_FILTER_NDV | Yes | Sets bloom filter number of distinct values (global or per column). | `'bloom_filter_ndv'` or `'bloom_filter_ndv::col'` | None | +| MAX_ROW_GROUP_SIZE | No | Sets the maximum number of rows per row group. Larger groups require more memory but can improve compression and scan efficiency. | `'max_row_group_size'` | 1048576 | +| ENABLE_PAGE_INDEX | No | If true, reads the Parquet data page level metadata (the Page Index), if present, to reduce I/O and decoding. | `'enable_page_index'` | true | +| PRUNING | No | If true, enables row group pruning based on min/max statistics. | `'pruning'` | true | +| SKIP_METADATA | No | If true, skips optional embedded metadata in the file schema. | `'skip_metadata'` | true | +| METADATA_SIZE_HINT | No | Sets the size hint (in bytes) for fetching Parquet file metadata. | `'metadata_size_hint'` | None | +| PUSHDOWN_FILTERS | No | If true, enables filter pushdown during Parquet decoding. | `'pushdown_filters'` | false | +| REORDER_FILTERS | No | If true, enables heuristic reordering of filters during Parquet decoding. | `'reorder_filters'` | false | +| SCHEMA_FORCE_VIEW_TYPES | No | If true, reads Utf8/Binary columns as view types. | `'schema_force_view_types'` | true | +| BINARY_AS_STRING | No | If true, reads Binary columns as strings. | `'binary_as_string'` | false | +| DATA_PAGESIZE_LIMIT | No | Sets best effort maximum size of data page in bytes. | `'data_pagesize_limit'` | 1048576 | +| DATA_PAGE_ROW_COUNT_LIMIT | No | Sets best effort maximum number of rows in data page. | `'data_page_row_count_limit'` | 20000 | +| DICTIONARY_PAGE_SIZE_LIMIT | No | Sets best effort maximum dictionary page size, in bytes. | `'dictionary_page_size_limit'` | 1048576 | +| WRITE_BATCH_SIZE | No | Sets write_batch_size in bytes. | `'write_batch_size'` | 1024 | +| WRITER_VERSION | No | Sets the Parquet writer version (`1.0` or `2.0`). | `'writer_version'` | 1.0 | +| SKIP_ARROW_METADATA | No | If true, skips writing Arrow schema information into the Parquet file metadata. | `'skip_arrow_metadata'` | false | +| CREATED_BY | No | Sets the "created by" string in the Parquet file metadata. | `'created_by'` | datafusion version X.Y.Z | +| COLUMN_INDEX_TRUNCATE_LENGTH | No | Sets the length (in bytes) to truncate min/max values in column indexes. | `'column_index_truncate_length'` | 64 | +| STATISTICS_TRUNCATE_LENGTH | No | Sets statistics truncate length. | `'statistics_truncate_length'` | None | +| BLOOM_FILTER_ON_WRITE | No | Sets whether bloom filters should be written for all columns by default (can be overridden per column). | `'bloom_filter_on_write'` | false | +| ALLOW_SINGLE_FILE_PARALLELISM | No | Enables parallel serialization of columns in a single file. | `'allow_single_file_parallelism'` | true | +| MAXIMUM_PARALLEL_ROW_GROUP_WRITERS | No | Maximum number of parallel row group writers. | `'maximum_parallel_row_group_writers'` | 1 | +| MAXIMUM_BUFFERED_RECORD_BATCHES_PER_STREAM | No | Maximum number of buffered record batches per stream. | `'maximum_buffered_record_batches_per_stream'` | 2 | +| KEY_VALUE_METADATA | No (Key is specific) | Adds custom key-value pairs to the file metadata. Use the format `'metadata::your_key_name' 'your_value'`. Multiple entries allowed. | `'metadata::key_name'` | None | + +**Example:** + +```sql +CREATE EXTERNAL TABLE t (id bigint, value double, category varchar) +STORED AS PARQUET +LOCATION '/tmp/parquet_data/' +OPTIONS( + 'COMPRESSION::user_id' 'snappy', + 'ENCODING::col_a' 'delta_binary_packed', + 'MAX_ROW_GROUP_SIZE' '1000000', + 'BLOOM_FILTER_ENABLED::id' 'true' +); +``` diff --git a/docs/source/user-guide/sql/index.rst b/docs/source/user-guide/sql/index.rst index 8e3f51bf8b0b..a13d40334b63 100644 --- a/docs/source/user-guide/sql/index.rst +++ b/docs/source/user-guide/sql/index.rst @@ -33,5 +33,5 @@ SQL Reference window_functions scalar_functions special_functions - write_options + format_options prepared_statements diff --git a/docs/source/user-guide/sql/write_options.md b/docs/source/user-guide/sql/write_options.md deleted file mode 100644 index 521e29436212..000000000000 --- a/docs/source/user-guide/sql/write_options.md +++ /dev/null @@ -1,127 +0,0 @@ - - -# Write Options - -DataFusion supports customizing how data is written out to disk as a result of a `COPY` or `INSERT INTO` query. There are a few special options, file format (e.g. CSV or parquet) specific options, and parquet column specific options. Options can also in some cases be specified in multiple ways with a set order of precedence. - -## Specifying Options and Order of Precedence - -Write related options can be specified in the following ways: - -- Session level config defaults -- `CREATE EXTERNAL TABLE` options -- `COPY` option tuples - -For a list of supported session level config defaults see [Configuration Settings](../configs). These defaults apply to all write operations but have the lowest level of precedence. - -If inserting to an external table, table specific write options can be specified when the table is created using the `OPTIONS` clause: - -```sql -CREATE EXTERNAL TABLE - my_table(a bigint, b bigint) - STORED AS csv - COMPRESSION TYPE gzip - LOCATION '/test/location/my_csv_table/' - OPTIONS( - NULL_VALUE 'NAN', - 'has_header' 'true', - 'format.delimiter' ';' - ) -``` - -When running `INSERT INTO my_table ...`, the options from the `CREATE TABLE` will be respected (gzip compression, special delimiter, and header row included). There will be a single output file if the output path doesn't have folder format, i.e. ending with a `\`. Note that compression, header, and delimiter settings can also be specified within the `OPTIONS` tuple list. Dedicated syntax within the SQL statement always takes precedence over arbitrary option tuples, so if both are specified the `OPTIONS` setting will be ignored. NULL_VALUE is a CSV format specific option that determines how null values should be encoded within the CSV file. - -Finally, options can be passed when running a `COPY` command. - - - -```sql -COPY source_table - TO 'test/table_with_options' - PARTITIONED BY (column3, column4) - OPTIONS ( - format parquet, - compression snappy, - 'compression::column1' 'zstd(5)', - ) -``` - -In this example, we write the entirety of `source_table` out to a folder of parquet files. One parquet file will be written in parallel to the folder for each partition in the query. The next option `compression` set to `snappy` indicates that unless otherwise specified all columns should use the snappy compression codec. The option `compression::col1` sets an override, so that the column `col1` in the parquet file will use `ZSTD` compression codec with compression level `5`. In general, parquet options which support column specific settings can be specified with the syntax `OPTION::COLUMN.NESTED.PATH`. - -## Available Options - -### Execution Specific Options - -The following options are available when executing a `COPY` query. - -| Option | Description | Default Value | -| ----------------------------------- | ---------------------------------------------------------------------------------- | ------------- | -| execution.keep_partition_by_columns | Flag to retain the columns in the output data when using `PARTITIONED BY` queries. | false | - -Note: `execution.keep_partition_by_columns` flag can also be enabled through `ExecutionOptions` within `SessionConfig`. - -### JSON Format Specific Options - -The following options are available when writing JSON files. Note: If any unsupported option is specified, an error will be raised and the query will fail. - -| Option | Description | Default Value | -| ----------- | ---------------------------------------------------------------------------------------------------------------------------------- | ------------- | -| COMPRESSION | Sets the compression that should be applied to the entire JSON file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. | UNCOMPRESSED | - -### CSV Format Specific Options - -The following options are available when writing CSV files. Note: if any unsupported options is specified an error will be raised and the query will fail. - -| Option | Description | Default Value | -| --------------- | --------------------------------------------------------------------------------------------------------------------------------- | ---------------- | -| COMPRESSION | Sets the compression that should be applied to the entire CSV file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. | UNCOMPRESSED | -| HEADER | Sets if the CSV file should include column headers | false | -| DATE_FORMAT | Sets the format that dates should be encoded in within the CSV file | arrow-rs default | -| DATETIME_FORMAT | Sets the format that datetimes should be encoded in within the CSV file | arrow-rs default | -| TIME_FORMAT | Sets the format that times should be encoded in within the CSV file | arrow-rs default | -| RFC3339 | If true, uses RFC339 format for date and time encodings | arrow-rs default | -| NULL_VALUE | Sets the string which should be used to indicate null values within the CSV file. | arrow-rs default | -| DELIMITER | Sets the character which should be used as the column delimiter within the CSV file. | arrow-rs default | - -### Parquet Format Specific Options - -The following options are available when writing parquet files. If any unsupported option is specified an error will be raised and the query will fail. If a column specific option is specified for a column which does not exist, the option will be ignored without error. For default values, see: [Configuration Settings](https://datafusion.apache.org/user-guide/configs.html). - -| Option | Can be Column Specific? | Description | -| ---------------------------- | ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | -| COMPRESSION | Yes | Sets the compression codec and if applicable compression level to use | -| MAX_ROW_GROUP_SIZE | No | Sets the maximum number of rows that can be encoded in a single row group. Larger row groups require more memory to write and read. | -| DATA_PAGESIZE_LIMIT | No | Sets the best effort maximum page size in bytes | -| WRITE_BATCH_SIZE | No | Maximum number of rows written for each column in a single batch | -| WRITER_VERSION | No | Parquet writer version (1.0 or 2.0) | -| DICTIONARY_PAGE_SIZE_LIMIT | No | Sets best effort maximum dictionary page size in bytes | -| CREATED_BY | No | Sets the "created by" property in the parquet file | -| COLUMN_INDEX_TRUNCATE_LENGTH | No | Sets the max length of min/max value fields in the column index. | -| DATA_PAGE_ROW_COUNT_LIMIT | No | Sets best effort maximum number of rows in a data page. | -| BLOOM_FILTER_ENABLED | Yes | Sets whether a bloom filter should be written into the file. | -| ENCODING | Yes | Sets the encoding that should be used (e.g. PLAIN or RLE) | -| DICTIONARY_ENABLED | Yes | Sets if dictionary encoding is enabled. Use this instead of ENCODING to set dictionary encoding. | -| STATISTICS_ENABLED | Yes | Sets if statistics are enabled at PAGE or ROW_GROUP level. | -| MAX_STATISTICS_SIZE | Yes | Sets the maximum size in bytes that statistics can take up. | -| BLOOM_FILTER_FPP | Yes | Sets the false positive probability (fpp) for the bloom filter. Implicitly sets BLOOM_FILTER_ENABLED to true. | -| BLOOM_FILTER_NDV | Yes | Sets the number of distinct values (ndv) for the bloom filter. Implicitly sets bloom_filter_enabled to true. |