Skip to content

Commit ccdf21f

Browse files
jomachgatorsmile
authored andcommitted
[SPARK-20055][DOCS] Added documentation for loading csv files into DataFrames
## What changes were proposed in this pull request? Added documentation for loading csv files into Dataframes ## How was this patch tested? /dev/run-tests Author: Jorge Machado <[email protected]> Closes #19429 from jomach/master.
1 parent 645e108 commit ccdf21f

File tree

6 files changed

+56
-5
lines changed

6 files changed

+56
-5
lines changed

docs/sql-programming-guide.md

Lines changed: 27 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -461,6 +461,8 @@ name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you can al
461461
names (`json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`). DataFrames loaded from any data
462462
source type can be converted into other types using this syntax.
463463

464+
To load a JSON file you can use:
465+
464466
<div class="codetabs">
465467
<div data-lang="scala" markdown="1">
466468
{% include_example manual_load_options scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
@@ -479,6 +481,26 @@ source type can be converted into other types using this syntax.
479481
</div>
480482
</div>
481483

484+
To load a CSV file you can use:
485+
486+
<div class="codetabs">
487+
<div data-lang="scala" markdown="1">
488+
{% include_example manual_load_options_csv scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
489+
</div>
490+
491+
<div data-lang="java" markdown="1">
492+
{% include_example manual_load_options_csv java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
493+
</div>
494+
495+
<div data-lang="python" markdown="1">
496+
{% include_example manual_load_options_csv python/sql/datasource.py %}
497+
</div>
498+
499+
<div data-lang="r" markdown="1">
500+
{% include_example manual_load_options_csv r/RSparkSQLExample.R %}
501+
502+
</div>
503+
</div>
482504
### Run SQL on files directly
483505

484506
Instead of using read API to load a file into DataFrame and query it, you can also query that
@@ -573,7 +595,7 @@ Note that partition information is not gathered by default when creating externa
573595

574596
### Bucketing, Sorting and Partitioning
575597

576-
For file-based data source, it is also possible to bucket and sort or partition the output.
598+
For file-based data source, it is also possible to bucket and sort or partition the output.
577599
Bucketing and sorting are applicable only to persistent tables:
578600

579601
<div class="codetabs">
@@ -598,7 +620,7 @@ CREATE TABLE users_bucketed_by_name(
598620
name STRING,
599621
favorite_color STRING,
600622
favorite_numbers array<integer>
601-
) USING parquet
623+
) USING parquet
602624
CLUSTERED BY(name) INTO 42 BUCKETS;
603625

604626
{% endhighlight %}
@@ -629,7 +651,7 @@ while partitioning can be used with both `save` and `saveAsTable` when using the
629651
{% highlight sql %}
630652

631653
CREATE TABLE users_by_favorite_color(
632-
name STRING,
654+
name STRING,
633655
favorite_color STRING,
634656
favorite_numbers array<integer>
635657
) USING csv PARTITIONED BY(favorite_color);
@@ -664,7 +686,7 @@ CREATE TABLE users_bucketed_and_partitioned(
664686
name STRING,
665687
favorite_color STRING,
666688
favorite_numbers array<integer>
667-
) USING parquet
689+
) USING parquet
668690
PARTITIONED BY (favorite_color)
669691
CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS;
670692

@@ -675,7 +697,7 @@ CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS;
675697
</div>
676698

677699
`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section.
678-
Thus, it has limited applicability to columns with high cardinality. In contrast
700+
Thus, it has limited applicability to columns with high cardinality. In contrast
679701
`bucketBy` distributes
680702
data across a fixed number of buckets and can be used when a number of unique values is unbounded.
681703

examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,13 @@ private static void runBasicDataSourceExample(SparkSession spark) {
116116
spark.read().format("json").load("examples/src/main/resources/people.json");
117117
peopleDF.select("name", "age").write().format("parquet").save("namesAndAges.parquet");
118118
// $example off:manual_load_options$
119+
// $example on:manual_load_options_csv$
120+
Dataset<Row> peopleDFCsv = spark.read().format("csv")
121+
.option("sep", ";")
122+
.option("inferSchema", "true")
123+
.option("header", "true")
124+
.load("examples/src/main/resources/people.csv");
125+
// $example off:manual_load_options_csv$
119126
// $example on:direct_sql$
120127
Dataset<Row> sqlDF =
121128
spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`");

examples/src/main/python/sql/datasource.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,11 @@ def basic_datasource_example(spark):
5353
df.select("name", "age").write.save("namesAndAges.parquet", format="parquet")
5454
# $example off:manual_load_options$
5555

56+
# $example on:manual_load_options_csv$
57+
df = spark.read.load("examples/src/main/resources/people.csv",
58+
format="csv", sep=":", inferSchema="true", header="true")
59+
# $example off:manual_load_options_csv$
60+
5661
# $example on:write_sorting_and_bucketing$
5762
df.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")
5863
# $example off:write_sorting_and_bucketing$

examples/src/main/r/RSparkSQLExample.R

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,12 @@ write.df(namesAndAges, "namesAndAges.parquet", "parquet")
113113
# $example off:manual_load_options$
114114

115115

116+
# $example on:manual_load_options_csv$
117+
df <- read.df("examples/src/main/resources/people.csv", "csv")
118+
namesAndAges <- select(df, "name", "age")
119+
# $example off:manual_load_options_csv$
120+
121+
116122
# $example on:direct_sql$
117123
df <- sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
118124
# $example off:direct_sql$
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
name;age;job
2+
Jorge;30;Developer
3+
Bob;32;Developer

examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,14 @@ object SQLDataSourceExample {
4949
val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
5050
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
5151
// $example off:manual_load_options$
52+
// $example on:manual_load_options_csv$
53+
val peopleDFCsv = spark.read.format("csv")
54+
.option("sep", ";")
55+
.option("inferSchema", "true")
56+
.option("header", "true")
57+
.load("examples/src/main/resources/people.csv")
58+
// $example off:manual_load_options_csv$
59+
5260
// $example on:direct_sql$
5361
val sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
5462
// $example off:direct_sql$

0 commit comments

Comments
 (0)