Skip to content
4 changes: 2 additions & 2 deletions datafusion/core/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1021,8 +1021,8 @@ doc_comment::doctest!(

#[cfg(doctest)]
doc_comment::doctest!(
"../../../docs/source/user-guide/sql/write_options.md",
user_guide_sql_write_options
"../../../docs/source/user-guide/sql/format_options.md",
user_guide_sql_format_options
);

#[cfg(doctest)]
Expand Down
2 changes: 1 addition & 1 deletion docs/source/user-guide/sql/ddl.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ LOCATION <literal>
<key_value_list> := (<literal> <literal, <literal> <literal>, ...)
```

For a detailed list of write related options which can be passed in the OPTIONS key_value_list, see [Write Options](write_options).
For a comprehensive list of format-specific options that can be specified in the `OPTIONS` clause, see [Format Options](format_options.md).

`file_type` is one of `CSV`, `ARROW`, `PARQUET`, `AVRO` or `JSON`

Expand Down
2 changes: 1 addition & 1 deletion docs/source/user-guide/sql/dml.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ The output format is determined by the first match of the following rules:
1. Value of `STORED AS`
2. Filename extension (e.g. `foo.parquet` implies `PARQUET` format)

For a detailed list of valid OPTIONS, see [Write Options](write_options).
For a detailed list of valid OPTIONS, see [Format Options](format_options.md).

### Examples

Expand Down
142 changes: 142 additions & 0 deletions docs/source/user-guide/sql/format_options.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Format Options

DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` query. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. Options can also in some cases be specified in multiple ways with a set order of precedence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` query. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. Options can also in some cases be specified in multiple ways with a set order of precedence.
DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` statements. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. In some cases, Options can be specified in multiple ways with a set order of precedence.


## Specifying Options and Order of Precedence

Format-related options can be specified in the following ways:

- Session-level config defaults
- `CREATE EXTERNAL TABLE` options
- `COPY` option tuples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to explicit specify the order of precedence here. Something like

Suggested change
Format-related options can be specified in the following ways:
- Session-level config defaults
- `CREATE EXTERNAL TABLE` options
- `COPY` option tuples
Format-related options can be specified in three ways, in decreasing order of precedence:
- `CREATE EXTERNAL TABLE` syntax
- `COPY` option tuples
- Session-level config defaults

For a list of supported session-level config defaults, see [Configuration Settings](../configs). These defaults apply to all operations but have the lowest level of precedence.

If creating an external table, table-specific format options can be specified when the table is created using the `OPTIONS` clause:

```sql
CREATE EXTERNAL TABLE
my_table(a bigint, b bigint)
STORED AS csv
LOCATION '/test/location/my_csv_table/'
OPTIONS(
NULL_VALUE 'NAN',
'has_header' 'true',
'format.delimiter' ';'
)
```

When running `INSERT INTO my_table ...`, the options from the `CREATE TABLE` will be respected (e.g., gzip compression, special delimiter, and header row included). Note that compression, header, and delimiter settings can also be specified within the `OPTIONS` tuple list. Dedicated syntax within the SQL statement always takes precedence over arbitrary option tuples, so if both are specified, the `OPTIONS` setting will be ignored.

Finally, options can be passed when running a `COPY` command.

```sql
COPY source_table
TO 'test/table_with_options'
PARTITIONED BY (column3, column4)
OPTIONS (
format parquet,
compression snappy,
'compression::column1' 'zstd(5)',
)
```

In this example, we write the entirety of `source_table` out to a folder of Parquet files. One Parquet file will be written in parallel to the folder for each partition in the query. The next option `compression` set to `snappy` indicates that unless otherwise specified, all columns should use the snappy compression codec. The option `compression::col1` sets an override, so that the column `col1` in the Parquet file will use the ZSTD compression codec with compression level `5`. In general, Parquet options that support column-specific settings can be specified with the syntax `OPTION::COLUMN.NESTED.PATH`.

# Available Options

## Generic Options

| Option | Description | Default Value |
| ---------- | ------------------------------------------------------------- | ---------------- |
| NULL_VALUE | Sets the string which should be used to indicate null values. | arrow-rs default |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a CSV specific option (not a generic option)

For example

> create external table my_table(a int) stored as JSON location '/tmp/foo' options('NULL_VALUE' 'NULL');
Invalid or Unsupported Configuration: Config value "null_value" not found on JsonOptions


## Execution-Specific Options

The following options are available when executing a `COPY` query.

| Option | Description | Default Value |
| ------------------------- | ---------------------------------------------------------------------------------- | ------------- |
| KEEP_PARTITION_BY_COLUMNS | Flag to retain the columns in the output data when using `PARTITIONED BY` queries. | false |

Note: `execution.keep_partition_by_columns` flag can also be enabled through `ExecutionOptions` within `SessionConfig`.

## JSON Format Options

The following options are available when reading or writing JSON files. Note: If any unsupported option is specified, an error will be raised and the query will fail.

| Option | Description | Default Value |
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------- | ------------- |
| COMPRESSION | Sets the compression that should be applied to the entire JSON file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. | UNCOMPRESSED |

**Example:**

```sql
CREATE EXTERNAL TABLE t
STORED AS JSON
LOCATION '/tmp/foo.json'
OPTIONS('COMPRESSION', 'gzip');
```

## CSV Format Options

The following options are available when reading or writing CSV files. Note: If any unsupported option is specified, an error will be raised and the query will fail.

| Option | Description | Default Value |
| ------------------ | --------------------------------------------------------------------------------------------------------------------------------- | ---------------- |
| COMPRESSION | Sets the compression that should be applied to the entire CSV file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. | UNCOMPRESSED |
| HEADER | Sets if the CSV file should include column headers | false |
| NEWLINES_IN_VALUES | Sets if newlines in quoted values are supported | false |
| DATE_FORMAT | Sets the format that dates should be encoded in within the CSV file | arrow-rs default |
| DATETIME_FORMAT | Sets the format that datetimes should be encoded in within the CSV file | arrow-rs default |
| TIME_FORMAT | Sets the format that times should be encoded in within the CSV file | arrow-rs default |
| RFC3339 | If true, uses RFC3339 format for date and time encodings | arrow-rs default |
| NULL_VALUE | Sets the string which should be used to indicate null values within the CSV file. | arrow-rs default |
| DELIMITER | Sets the character which should be used as the column delimiter within the CSV file. | arrow-rs default |

**Example:**

```sql
CREATE EXTERNAL TABLE t
STORED AS CSV
LOCATION '/tmp/foo.csv'
OPTIONS('DELIMITER', '|', 'HEADER', 'true', 'NEWLINES_IN_VALUES', 'true');
```

## Parquet Format Options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this looks great


The following options are available when reading or writing Parquet files. If any unsupported option is specified, an error will be raised and the query will fail. If a column-specific option is specified for a column that does not exist, the option will be ignored without error.

| Option | Can be Column Specific? | Description |
| -------------------- | ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| COMPRESSION | Yes | Sets the compression codec and if applicable compression level to use |
| MAX_ROW_GROUP_SIZE | No | Sets the maximum number of rows that can be encoded in a single row group. Larger row groups require more memory to write and read. |
| BLOOM_FILTER_ENABLED | Yes | Sets whether a bloom filter should be written into the file. |

**Example:**

```sql
CREATE EXTERNAL TABLE t
STORED AS PARQUET
LOCATION '/tmp/foo.parquet'
OPTIONS('COMPRESSION', 'snappy', 'MAX_ROW_GROUP_SIZE', '1000000', 'BLOOM_FILTER_ENABLED', 'true');
```
2 changes: 1 addition & 1 deletion docs/source/user-guide/sql/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,5 @@ SQL Reference
window_functions
scalar_functions
special_functions
write_options
format_options
prepared_statements
Loading