-
Notifications
You must be signed in to change notification settings - Fork 29.3k
[SPARK-20055] [Docs] Added documentation for loading csv files into DataFrames #19429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
f5941bf
812bdf7
4e4a02b
a2ec38a
793628b
cd69fa2
68799ed
7ff1d84
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -461,6 +461,8 @@ name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you can al | |
| names (`json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`). DataFrames loaded from any data | ||
| source type can be converted into other types using this syntax. | ||
|
|
||
| To load a json file you can use: | ||
|
|
||
| <div class="codetabs"> | ||
| <div data-lang="scala" markdown="1"> | ||
| {% include_example manual_load_options scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} | ||
|
|
@@ -479,6 +481,25 @@ source type can be converted into other types using this syntax. | |
| </div> | ||
| </div> | ||
|
|
||
| To load a csv file you can use: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto |
||
|
|
||
| <div class="codetabs"> | ||
| <div data-lang="scala" markdown="1"> | ||
| {% include_example manual_load_options_csv scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} | ||
| </div> | ||
|
|
||
| <div data-lang="java" markdown="1"> | ||
| {% include_example manual_load_options_csv java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} | ||
| </div> | ||
|
|
||
| <div data-lang="python" markdown="1"> | ||
| {% include_example manual_load_options_csv python/sql/datasource.py %} | ||
| </div> | ||
|
|
||
| <div data-lang="r" markdown="1"> | ||
| {% include_example manual_load_options_csv r/RSparkSQLExample.R %} | ||
| </div> | ||
| </div> | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's add another newline here. It breaks rendering. |
||
| ### Run SQL on files directly | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup, that's okay. BTW, I initially what I meant in #19429 (comment) was a newline between Let's don't forget to fix this up before the release if the followup couldn't be made ahead.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @HyukjinKwon should I add a new line between line 503 and 504 ?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup, a newline between 503 and 504. |
||
|
|
||
| Instead of using read API to load a file into DataFrame and query it, you can also query that | ||
|
|
@@ -573,7 +594,7 @@ Note that partition information is not gathered by default when creating externa | |
|
|
||
| ### Bucketing, Sorting and Partitioning | ||
|
|
||
| For file-based data source, it is also possible to bucket and sort or partition the output. | ||
| For file-based data source, it is also possible to bucket and sort or partition the output. | ||
| Bucketing and sorting are applicable only to persistent tables: | ||
|
|
||
| <div class="codetabs"> | ||
|
|
@@ -598,7 +619,7 @@ CREATE TABLE users_bucketed_by_name( | |
| name STRING, | ||
| favorite_color STRING, | ||
| favorite_numbers array<integer> | ||
| ) USING parquet | ||
| ) USING parquet | ||
| CLUSTERED BY(name) INTO 42 BUCKETS; | ||
|
|
||
| {% endhighlight %} | ||
|
|
@@ -629,7 +650,7 @@ while partitioning can be used with both `save` and `saveAsTable` when using the | |
| {% highlight sql %} | ||
|
|
||
| CREATE TABLE users_by_favorite_color( | ||
| name STRING, | ||
| name STRING, | ||
| favorite_color STRING, | ||
| favorite_numbers array<integer> | ||
| ) USING csv PARTITIONED BY(favorite_color); | ||
|
|
@@ -664,7 +685,7 @@ CREATE TABLE users_bucketed_and_partitioned( | |
| name STRING, | ||
| favorite_color STRING, | ||
| favorite_numbers array<integer> | ||
| ) USING parquet | ||
| ) USING parquet | ||
| PARTITIONED BY (favorite_color) | ||
| CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS; | ||
|
|
||
|
|
@@ -675,7 +696,7 @@ CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS; | |
| </div> | ||
|
|
||
| `partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section. | ||
| Thus, it has limited applicability to columns with high cardinality. In contrast | ||
| Thus, it has limited applicability to columns with high cardinality. In contrast | ||
| `bucketBy` distributes | ||
| data across a fixed number of buckets and can be used when a number of unique values is unbounded. | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -116,6 +116,13 @@ private static void runBasicDataSourceExample(SparkSession spark) { | |
| spark.read().format("json").load("examples/src/main/resources/people.json"); | ||
| peopleDF.select("name", "age").write().format("parquet").save("namesAndAges.parquet"); | ||
| // $example off:manual_load_options$ | ||
| // $example on:manual_load_options_csv$ | ||
| Dataset<Row> peopleDFCsv = spark.read().format("csv") | ||
| .option("sep", ";") | ||
| .option("inferSchema", "true") | ||
| .option("header", "true") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you change the indents of line 121-123 to 2 spaces? |
||
| .load("examples/src/main/resources/people.csv"); | ||
| // $example off:manual_load_options_csv$ | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Line 125-131 is a duplicate. |
||
| // $example on:direct_sql$ | ||
| Dataset<Row> sqlDF = | ||
| spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`"); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -112,6 +112,11 @@ namesAndAges <- select(df, "name", "age") | |
| write.df(namesAndAges, "namesAndAges.parquet", "parquet") | ||
| # $example off:manual_load_options$ | ||
|
|
||
| # $example on:manual_load_options_csv$ | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd add a newline here above to keep consistent in this file |
||
| df <- read.df("examples/src/main/resources/people.csv", "csv") | ||
| namesAndAges <- select(df, "name", "age") | ||
| # $example off:manual_load_options_csv$ | ||
|
|
||
|
|
||
| # $example on:direct_sql$ | ||
| df <- sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`") | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| name;age;job | ||
| Jorge;30;Developer | ||
| Bob;32;Developer |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -49,6 +49,14 @@ object SQLDataSourceExample { | |
| val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json") | ||
| peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet") | ||
| // $example off:manual_load_options$ | ||
| // $example on:manual_load_options_csv$ | ||
| val peopleDFCsv = spark.read.format("csv") | ||
| .option("sep", ";") | ||
| .option("inferSchema", "true") | ||
| .option("header", "true") | ||
| .load("examples/src/main/resources/people.csv") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you change the indents of line 54-57 to 2 spaces? |
||
| // $example off:manual_load_options_csv$ | ||
|
|
||
| // $example on:direct_sql$ | ||
| val sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`") | ||
| // $example off:direct_sql$ | ||
|
|
||

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say
JSONinstead ofjson.