Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions docs/sql-data-sources-orc.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,3 +172,29 @@ When reading from Hive metastore ORC tables and inserting to Hive metastore ORC
<td>2.0.0</td>
</tr>
</table>

## Data Source Option

Data source options of ORC can be set via:
* the `.option`/`.options` methods of
* `DataFrameReader`
* `DataFrameWriter`
* `DataStreamReader`
* `DataStreamWriter`
Copy link
Member

@HyukjinKwon HyukjinKwon May 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also mention:

* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)


<table class="table">
<tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
<tr>
<td><code>mergeSchema</code></td>
<td>None</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic it has the same issue. The default value isn't None but false.

<td>sets whether we should merge schemas collected from all ORC part-files. This will override <code>spark.sql.orc.mergeSchema</code>. The default value is specified in <code>spark.sql.orc.mergeSchema</code>.</td>
<td>read</td>
</tr>
<tr>
<td><code>compression</code></td>
<td>None</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case when the default value doesn't exist, you can follow https://spark.apache.org/docs/latest/configuration.html#runtime-sql-configuration (none).

<td>compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, snappy, zlib, lzo, and zstd). This will override <code>orc.compress</code> and <code>spark.sql.orc.compression.codec</code>. If None is set, it uses the value specified in <code>spark.sql.orc.compression.codec</code>.</td>
<td>write</td>
</tr>
</table>
Other generic options can be found in <a href="https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html"> Generic File Source Options</a>.
Copy link
Member

@dongjoon-hyun dongjoon-hyun May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I know that this is inherited, https://spark.apache.org/docs/latest/ looks fragile to me because it is going to be a broken link when we cut branch-3.2 on July 1st. In branch-3.2, it should point 3.2 document only. Shall we use a relative link instead of /latest/?

Like this PR, we don't know what refactoring happens in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @dongjoon-hyun .
I took a look for that but seems tricky to create a link for each release in Scaladoc ..
I created a JIRA to track it separately here: SPARK-35481.
I will take a separate look if that's fine to you too!

40 changes: 13 additions & 27 deletions python/pyspark/sql/readwriter.py
Original file line number Diff line number Diff line change
Expand Up @@ -793,28 +793,13 @@ def orc(self, path, mergeSchema=None, pathGlobFilter=None, recursiveFileLookup=N
Parameters
----------
path : str or list
mergeSchema : str or bool, optional
sets whether we should merge schemas collected from all
ORC part-files. This will override ``spark.sql.orc.mergeSchema``.
The default value is specified in ``spark.sql.orc.mergeSchema``.
pathGlobFilter : str or bool
an optional glob pattern to only include files with paths matching
the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`.
It does not change the behavior of
`partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa
recursiveFileLookup : str or bool
recursively scan a directory for files. Using this option
disables
`partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa

modification times occurring before the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
modifiedBefore : an optional timestamp to only include files with
modification times occurring before the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
modifiedAfter : an optional timestamp to only include files with
modification times occurring after the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
Other Parameters
----------------
Extra options
For the extra options, refer to
`Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option>`_ # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Can we have a more robust link here?

in the version you use.

Examples
--------
Expand Down Expand Up @@ -1417,12 +1402,13 @@ def orc(self, path, mode=None, partitionBy=None, compression=None):
exists.
partitionBy : str or list, optional
names of partitioning columns
compression : str, optional
compression codec to use when saving to file. This can be one of the
known case-insensitive shorten names (none, snappy, zlib, lzo, and zstd).
This will override ``orc.compress`` and
``spark.sql.orc.compression.codec``. If None is set, it uses the value
specified in ``spark.sql.orc.compression.codec``.

Other Parameters
----------------
Extra options
For the extra options, refer to
`Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option>`_ # noqa
in the version you use.

Examples
--------
Expand Down
20 changes: 6 additions & 14 deletions python/pyspark/sql/streaming.py
Original file line number Diff line number Diff line change
Expand Up @@ -637,20 +637,12 @@ def orc(self, path, mergeSchema=None, pathGlobFilter=None, recursiveFileLookup=N

.. versionadded:: 2.3.0

Parameters
----------
mergeSchema : str or bool, optional
sets whether we should merge schemas collected from all
ORC part-files. This will override ``spark.sql.orc.mergeSchema``.
The default value is specified in ``spark.sql.orc.mergeSchema``.
pathGlobFilter : str or bool, optional
an optional glob pattern to only include files with paths matching
the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`.
It does not change the behavior of `partition discovery`_.
recursiveFileLookup : str or bool, optional
recursively scan a directory for files. Using this option
disables
`partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa
Other Parameters
----------------
Extra options
For the extra options, refer to
`Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option>`_ # noqa
in the version you use.

Examples
--------
Expand Down
21 changes: 4 additions & 17 deletions sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
Original file line number Diff line number Diff line change
Expand Up @@ -874,23 +874,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
/**
* Loads ORC files and returns the result as a `DataFrame`.
*
* You can set the following ORC-specific option(s) for reading ORC files:
* <ul>
* <li>`mergeSchema` (default is the value specified in `spark.sql.orc.mergeSchema`): sets whether
* we should merge schemas collected from all ORC part-files. This will override
* `spark.sql.orc.mergeSchema`.</li>
* <li>`pathGlobFilter`: an optional glob pattern to only include files with paths matching
* the pattern. The syntax follows <code>org.apache.hadoop.fs.GlobFilter</code>.
* It does not change the behavior of partition discovery.</li>
* <li>`modifiedBefore` (batch only): an optional timestamp to only include files with
* modification times occurring before the specified Time. The provided timestamp
* must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)</li>
* <li>`modifiedAfter` (batch only): an optional timestamp to only include files with
* modification times occurring after the specified Time. The provided timestamp
* must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)</li>
* <li>`recursiveFileLookup`: recursively scan a directory for files. Using this option
* disables partition discovery</li>
* </ul>
* ORC-specific option(s) for reading ORC files can be found in
* <a href=
* "https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

* Data Source Option</a> in the version you use.
*
* @param paths input paths
* @since 2.0.0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -881,14 +881,10 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* format("orc").save(path)
* }}}
*
* You can set the following ORC-specific option(s) for writing ORC files:
* <ul>
* <li>`compression` (default is the value specified in `spark.sql.orc.compression.codec`):
* compression codec to use when saving to file. This can be one of the known case-insensitive
* shorten names(`none`, `snappy`, `zlib`, `lzo`, and `zstd`). This will override
* `orc.compress` and `spark.sql.orc.compression.codec`. If `orc.compress` is given,
* it overrides `spark.sql.orc.compression.codec`.</li>
* </ul>
* ORC-specific option(s) for writing ORC files can be found in
* <a href=
* "https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @since 1.5.0
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -453,20 +453,17 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
/**
* Loads a ORC file stream, returning the result as a `DataFrame`.
*
* You can set the following ORC-specific option(s) for reading ORC files:
* You can set the following option(s):
* <ul>
* <li>`maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be
* considered in every trigger.</li>
* <li>`mergeSchema` (default is the value specified in `spark.sql.orc.mergeSchema`): sets whether
* we should merge schemas collected from all ORC part-files. This will override
* `spark.sql.orc.mergeSchema`.</li>
* <li>`pathGlobFilter`: an optional glob pattern to only include files with paths matching
* the pattern. The syntax follows <code>org.apache.hadoop.fs.GlobFilter</code>.
* It does not change the behavior of partition discovery.</li>
* <li>`recursiveFileLookup`: recursively scan a directory for files. Using this option
* disables partition discovery</li>
* </ul>
*
* ORC-specific option(s) for reading ORC file stream can be found in
* <a href=
* "https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @since 2.3.0
*/
def orc(path: String): DataFrame = {
Expand Down