-
Notifications
You must be signed in to change notification settings - Fork 29.2k
[SPARK-20431][SQL] Specify a schema by using a DDL-formatted string #17719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -96,14 +96,17 @@ def schema(self, schema): | |
| By specifying the schema here, the underlying data source can skip the schema | ||
| inference step, and thus speed up data loading. | ||
|
|
||
| :param schema: a :class:`pyspark.sql.types.StructType` object | ||
| :param schema: a :class:`pyspark.sql.types.StructType` object or a DDL-formatted string | ||
| """ | ||
| from pyspark.sql import SparkSession | ||
| if not isinstance(schema, StructType): | ||
| raise TypeError("schema should be StructType") | ||
| spark = SparkSession.builder.getOrCreate() | ||
| jschema = spark._jsparkSession.parseDataType(schema.json()) | ||
| self._jreader = self._jreader.schema(jschema) | ||
| if isinstance(schema, StructType): | ||
| jschema = spark._jsparkSession.parseDataType(schema.json()) | ||
| self._jreader = self._jreader.schema(jschema) | ||
| elif isinstance(schema, basestring): | ||
| self._jreader = self._jreader.schema(schema) | ||
| else: | ||
| raise TypeError("schema should be StructType") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update this message?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yea, I'll do |
||
| return self | ||
|
|
||
| @since(1.5) | ||
|
|
@@ -137,7 +140,8 @@ def load(self, path=None, format=None, schema=None, **options): | |
|
|
||
| :param path: optional string or a list of string for file-system backed data sources. | ||
| :param format: optional string for format of the data source. Default to 'parquet'. | ||
| :param schema: optional :class:`pyspark.sql.types.StructType` for the input schema. | ||
| :param schema: optional :class:`pyspark.sql.types.StructType` for the input schema | ||
| or a DDL-formatted string. | ||
| :param options: all other string options | ||
|
|
||
| >>> df = spark.read.load('python/test_support/sql/parquet_partitioned', opt1=True, | ||
|
|
@@ -181,7 +185,8 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, | |
|
|
||
| :param path: string represents path to the JSON dataset, or a list of paths, | ||
| or RDD of Strings storing JSON objects. | ||
| :param schema: an optional :class:`pyspark.sql.types.StructType` for the input schema. | ||
| :param schema: an optional :class:`pyspark.sql.types.StructType` for the input schema or | ||
| a DDL-formatted string. | ||
| :param primitivesAsString: infers all primitive values as a string type. If None is set, | ||
| it uses the default value, ``false``. | ||
| :param prefersDecimal: infers all floating-point values as a decimal type. If the values | ||
|
|
@@ -324,7 +329,8 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non | |
| ``inferSchema`` option or specify the schema explicitly using ``schema``. | ||
|
|
||
| :param path: string, or list of strings, for input path(s). | ||
| :param schema: an optional :class:`pyspark.sql.types.StructType` for the input schema. | ||
| :param schema: an optional :class:`pyspark.sql.types.StructType` for the input schema | ||
| or a DDL-formatted string. | ||
| :param sep: sets the single character as a separator for each field and value. | ||
| If None is set, it uses the default value, ``,``. | ||
| :param encoding: decodes the CSV files by the given encoding type. If None is set, | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -67,6 +67,18 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { | |
| this | ||
| } | ||
|
|
||
| /** | ||
| * Specifies the schema by using the input DDL-formatted string. Some data sources (e.g. JSON) can | ||
| * infer the input schema automatically from data. By specifying the schema here, the underlying | ||
| * data source can skip the schema inference step, and thus speed up data loading. | ||
| * | ||
| * @since 2.3.0 | ||
| */ | ||
| def schema(schemaString: String): DataFrameReader = { | ||
| this.userSpecifiedSchema = Option(StructType.fromDDL(schemaString)) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change will make PySpark API inconsistent with the Scala API
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, but I probably missed your point. What's the API consistency you pointed out here?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I misread the Python codes. |
||
| this | ||
| } | ||
|
|
||
| /** | ||
| * Adds an input option for the underlying data source. | ||
| * | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give an example here to users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok