diff --git a/docs/sql-data-sources-parquet.md b/docs/sql-data-sources-parquet.md index 119eae56ebf7..de278cf9f924 100644 --- a/docs/sql-data-sources-parquet.md +++ b/docs/sql-data-sources-parquet.md @@ -255,8 +255,11 @@ REFRESH TABLE my_table; ## Data Source Option Data source options of Parquet can be set via: -* the `.option`/`.options` methods of `DataFrameReader` or `DataFrameWriter` -* the `.option`/`.options` methods of `DataStreamReader` or `DataStreamWriter` +* the `.option`/`.options` methods of + * `DataFrameReader` + * `DataFrameWriter` + * `DataStreamReader` + * `DataStreamWriter` @@ -286,7 +289,20 @@ Data source options of Parquet can be set via: + + + + + + + + + + + +
Property NameDefaultMeaningScope
read
mergeSchemaThe SQL config spark.sql.parquet.mergeSchema which is false by default.Sets whether we should merge schemas collected from all Parquet part-files. This will override spark.sql.parquet.mergeSchema.read
compressionNoneCompression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, uncompressed, snappy, gzip, lzo, brotli, lz4, and zstd). This will override spark.sql.parquet.compression.codec. If None is set, it uses the value specified in spark.sql.parquet.compression.codec.write
+Other generic options can be found in Generic Files Source Options ### Configuration diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 31c1f2f7ca3c..59ab752001a4 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -416,53 +416,10 @@ def parquet(self, *paths, **options): Other Parameters ---------------- - mergeSchema : str or bool, optional - sets whether we should merge schemas collected from all - Parquet part-files. This will override - ``spark.sql.parquet.mergeSchema``. The default value is specified in - ``spark.sql.parquet.mergeSchema``. - pathGlobFilter : str or bool, optional - an optional glob pattern to only include files with paths matching - the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`. - It does not change the behavior of - `partition discovery `_. # noqa - recursiveFileLookup : str or bool, optional - recursively scan a directory for files. Using this option - disables - `partition discovery `_. # noqa - - modification times occurring before the specified time. The provided timestamp - must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) - modifiedBefore (batch only) : an optional timestamp to only include files with - modification times occurring before the specified time. The provided timestamp - must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) - modifiedAfter (batch only) : an optional timestamp to only include files with - modification times occurring after the specified time. The provided timestamp - must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) - datetimeRebaseMode : str, optional - the rebasing mode for the values of the ``DATE``, ``TIMESTAMP_MICROS``, - ``TIMESTAMP_MILLIS`` logical types from the Julian to Proleptic Gregorian calendar. - - * ``EXCEPTION``: Spark fails in reads of ancient dates/timestamps - that are ambiguous between the two calendars. - * ``CORRECTED``: loading of dates/timestamps without rebasing. - * ``LEGACY``: perform rebasing of ancient dates/timestamps from the Julian - to Proleptic Gregorian calendar. - - If None is set, the value of the SQL config - ``spark.sql.parquet.datetimeRebaseModeInRead`` is used by default. - int96RebaseMode : str, optional - the rebasing mode for ``INT96`` timestamps from the Julian to - Proleptic Gregorian calendar. - - * ``EXCEPTION``: Spark fails in reads of ancient ``INT96`` timestamps - that are ambiguous between the two calendars. - * ``CORRECTED``: loading of ``INT96`` timestamps without rebasing. - * ``LEGACY``: perform rebasing of ancient ``INT96`` timestamps from the Julian - to Proleptic Gregorian calendar. - - If None is set, the value of the SQL config - ``spark.sql.parquet.int96RebaseModeInRead`` is used by default. + **options + For the extra options, refer to + `Data Source Option `_ # noqa + in the version you use. Examples -------- @@ -1259,12 +1216,13 @@ def parquet(self, path, mode=None, partitionBy=None, compression=None): exists. partitionBy : str or list, optional names of partitioning columns - compression : str, optional - compression codec to use when saving to file. This can be one of the - known case-insensitive shorten names (none, uncompressed, snappy, gzip, - lzo, brotli, lz4, and zstd). This will override - ``spark.sql.parquet.compression.codec``. If None is set, it uses the - value specified in ``spark.sql.parquet.compression.codec``. + + Other Parameters + ---------------- + Extra options + For the extra options, refer to + `Data Source Option `_ # noqa + in the version you use. Examples -------- diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py index 2c90d7f2dee7..94a022dfe5e1 100644 --- a/python/pyspark/sql/streaming.py +++ b/python/pyspark/sql/streaming.py @@ -676,43 +676,15 @@ def parquet(self, path, mergeSchema=None, pathGlobFilter=None, recursiveFileLook Parameters ---------- - mergeSchema : str or bool, optional - sets whether we should merge schemas collected from all - Parquet part-files. This will override - ``spark.sql.parquet.mergeSchema``. The default value is specified in - ``spark.sql.parquet.mergeSchema``. - pathGlobFilter : str or bool, optional - an optional glob pattern to only include files with paths matching - the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`. - It does not change the behavior of `partition discovery`_. - recursiveFileLookup : str or bool, optional - recursively scan a directory for files. Using this option - disables - `partition discovery `_. # noqa - datetimeRebaseMode : str, optional - the rebasing mode for the values of the ``DATE``, ``TIMESTAMP_MICROS``, - ``TIMESTAMP_MILLIS`` logical types from the Julian to Proleptic Gregorian calendar. - - * ``EXCEPTION``: Spark fails in reads of ancient dates/timestamps - that are ambiguous between the two calendars. - * ``CORRECTED``: loading of dates/timestamps without rebasing. - * ``LEGACY``: perform rebasing of ancient dates/timestamps from the Julian - to Proleptic Gregorian calendar. - - If None is set, the value of the SQL config - ``spark.sql.parquet.datetimeRebaseModeInRead`` is used by default. - int96RebaseMode : str, optional - the rebasing mode for ``INT96`` timestamps from the Julian to - Proleptic Gregorian calendar. - - * ``EXCEPTION``: Spark fails in reads of ancient ``INT96`` timestamps - that are ambiguous between the two calendars. - * ``CORRECTED``: loading of ``INT96`` timestamps without rebasing. - * ``LEGACY``: perform rebasing of ancient ``INT96`` timestamps from the Julian - to Proleptic Gregorian calendar. - - If None is set, the value of the SQL config - ``spark.sql.parquet.int96RebaseModeInRead`` is used by default. + path : str + the path in any Hadoop supported file system + + Other Parameters + ---------------- + Extra options + For the extra options, refer to + `Data Source Option `_. # noqa + in the version you use. Examples -------- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index f7e1903da687..38495009d11c 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -812,46 +812,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { /** * Loads a Parquet file, returning the result as a `DataFrame`. * - * You can set the following Parquet-specific option(s) for reading Parquet files: - *
    - *
  • `mergeSchema` (default is the value specified in `spark.sql.parquet.mergeSchema`): sets - * whether we should merge schemas collected from all Parquet part-files. This will override - * `spark.sql.parquet.mergeSchema`.
  • - *
  • `pathGlobFilter`: an optional glob pattern to only include files with paths matching - * the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. - * It does not change the behavior of partition discovery.
  • - *
  • `modifiedBefore` (batch only): an optional timestamp to only include files with - * modification times occurring before the specified Time. The provided timestamp - * must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
  • - *
  • `modifiedAfter` (batch only): an optional timestamp to only include files with - * modification times occurring after the specified Time. The provided timestamp - * must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
  • - *
  • `recursiveFileLookup`: recursively scan a directory for files. Using this option - * disables partition discovery
  • - *
  • `datetimeRebaseMode` (default is the value specified in the SQL config - * `spark.sql.parquet.datetimeRebaseModeInRead`): the rebasing mode for the values - * of the `DATE`, `TIMESTAMP_MICROS`, `TIMESTAMP_MILLIS` logical types from the Julian to - * Proleptic Gregorian calendar: - *
      - *
    • `EXCEPTION` : Spark fails in reads of ancient dates/timestamps that are ambiguous - * between the two calendars
    • - *
    • `CORRECTED` : loading of dates/timestamps without rebasing
    • - *
    • `LEGACY` : perform rebasing of ancient dates/timestamps from the Julian to Proleptic - * Gregorian calendar
    • - *
    - *
  • - *
  • `int96RebaseMode` (default is the value specified in the SQL config - * `spark.sql.parquet.int96RebaseModeInRead`): the rebasing mode for `INT96` timestamps - * from the Julian to Proleptic Gregorian calendar: - *
      - *
    • `EXCEPTION` : Spark fails in reads of ancient `INT96` timestamps that are ambiguous - * between the two calendars
    • - *
    • `CORRECTED` : loading of timestamps without rebasing
    • - *
    • `LEGACY` : perform rebasing of ancient `INT96` timestamps from the Julian to Proleptic - * Gregorian calendar
    • - *
    - *
  • - *
+ * Parquet-specific option(s) for reading Parquet files can be found in + * + * Data Source Option in the version you use. * * @since 1.4.0 */ diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala index fe6572cff5de..57e6824c2f8d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala @@ -860,13 +860,10 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { * format("parquet").save(path) * }}} * - * You can set the following Parquet-specific option(s) for writing Parquet files: - *
    - *
  • `compression` (default is the value specified in `spark.sql.parquet.compression.codec`): - * compression codec to use when saving to file. This can be one of the known case-insensitive - * shorten names(`none`, `uncompressed`, `snappy`, `gzip`, `lzo`, `brotli`, `lz4`, and `zstd`). - * This will override `spark.sql.parquet.compression.codec`.
  • - *
+ * Parquet-specific option(s) for writing Parquet files can be found in + * + * Data Source Option in the version you use. * * @since 1.4.0 */ diff --git a/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala index 1798f6e2c88b..9f5de11f26c2 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala @@ -476,44 +476,17 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo /** * Loads a Parquet file stream, returning the result as a `DataFrame`. * - * You can set the following Parquet-specific option(s) for reading Parquet files: + * You can set the following option(s): *
    *
  • `maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be * considered in every trigger.
  • - *
  • `mergeSchema` (default is the value specified in `spark.sql.parquet.mergeSchema`): sets - * whether we should merge schemas collected from all - * Parquet part-files. This will override - * `spark.sql.parquet.mergeSchema`.
  • - *
  • `pathGlobFilter`: an optional glob pattern to only include files with paths matching - * the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. - * It does not change the behavior of partition discovery.
  • - *
  • `recursiveFileLookup`: recursively scan a directory for files. Using this option - * disables partition discovery
  • - *
  • `datetimeRebaseMode` (default is the value specified in the SQL config - * `spark.sql.parquet.datetimeRebaseModeInRead`): the rebasing mode for the values - * of the `DATE`, `TIMESTAMP_MICROS`, `TIMESTAMP_MILLIS` logical types from the Julian to - * Proleptic Gregorian calendar: - *
      - *
    • `EXCEPTION` : Spark fails in reads of ancient dates/timestamps that are ambiguous - * between the two calendars
    • - *
    • `CORRECTED` : loading of dates/timestamps without rebasing
    • - *
    • `LEGACY` : perform rebasing of ancient dates/timestamps from the Julian to Proleptic - * Gregorian calendar
    • - *
    - *
  • - *
  • `int96RebaseMode` (default is the value specified in the SQL config - * `spark.sql.parquet.int96RebaseModeInRead`): the rebasing mode for `INT96` timestamps - * from the Julian to Proleptic Gregorian calendar: - *
      - *
    • `EXCEPTION` : Spark fails in reads of ancient `INT96` timestamps that are ambiguous - * between the two calendars
    • - *
    • `CORRECTED` : loading of timestamps without rebasing
    • - *
    • `LEGACY` : perform rebasing of ancient `INT96` timestamps from the Julian to Proleptic - * Gregorian calendar
    • - *
    - *
  • *
* + * Parquet-specific option(s) for reading Parquet file stream can be found in + * + * Data Source Option in the version you use. + * * @since 2.0.0 */ def parquet(path: String): DataFrame = {