Skip to content

Commit db121a2

Browse files
committed
[SPARK-25656][SQL][DOC][EXAMPLE][BRANCH-2.4] Add a doc and examples about extra data source options
## What changes were proposed in this pull request? Our current doc does not explain how we are passing the data source specific options to the underlying data source. According to [the review comment](#22622 (comment)), this PR aims to add more detailed information and examples. This is a backport of #22801. `orc.column.encoding.direct` is removed since it's not supported in ORC 1.5.2. ## How was this patch tested? Manual. Closes #22839 from dongjoon-hyun/SPARK-25656-2.4. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
1 parent 1b075f2 commit db121a2

File tree

6 files changed

+68
-1
lines changed

6 files changed

+68
-1
lines changed

docs/sql-data-sources-load-save-functions.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,49 @@ To load a CSV file you can use:
8282
</div>
8383
</div>
8484

85+
The extra options are also used during write operation.
86+
For example, you can control bloom filters and dictionary encodings for ORC data sources.
87+
The following ORC example will create bloom filter on `favorite_color` and use dictionary encoding for `name` and `favorite_color`.
88+
For Parquet, there exists `parquet.enable.dictionary`, too.
89+
To find more detailed information about the extra ORC/Parquet options,
90+
visit the official Apache ORC/Parquet websites.
91+
92+
<div class="codetabs">
93+
94+
<div data-lang="scala" markdown="1">
95+
{% include_example manual_save_options_orc scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
96+
</div>
97+
98+
<div data-lang="java" markdown="1">
99+
{% include_example manual_save_options_orc java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
100+
</div>
101+
102+
<div data-lang="python" markdown="1">
103+
{% include_example manual_save_options_orc python/sql/datasource.py %}
104+
</div>
105+
106+
<div data-lang="r" markdown="1">
107+
{% include_example manual_save_options_orc r/RSparkSQLExample.R %}
108+
</div>
109+
110+
<div data-lang="sql" markdown="1">
111+
112+
{% highlight sql %}
113+
CREATE TABLE users_with_options (
114+
name STRING,
115+
favorite_color STRING,
116+
favorite_numbers array<integer>
117+
) USING ORC
118+
OPTIONS (
119+
orc.bloom.filter.columns 'favorite_color',
120+
orc.dictionary.key.threshold '1.0'
121+
)
122+
{% endhighlight %}
123+
124+
</div>
125+
126+
</div>
127+
85128
### Run SQL on files directly
86129

87130
Instead of using read API to load a file into DataFrame and query it, you can also query that

examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,12 @@ private static void runBasicDataSourceExample(SparkSession spark) {
123123
.option("header", "true")
124124
.load("examples/src/main/resources/people.csv");
125125
// $example off:manual_load_options_csv$
126+
// $example on:manual_save_options_orc$
127+
usersDF.write().format("orc")
128+
.option("orc.bloom.filter.columns", "favorite_color")
129+
.option("orc.dictionary.key.threshold", "1.0")
130+
.save("users_with_options.orc");
131+
// $example off:manual_save_options_orc$
126132
// $example on:direct_sql$
127133
Dataset<Row> sqlDF =
128134
spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`");

examples/src/main/python/sql/datasource.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,14 @@ def basic_datasource_example(spark):
5757
format="csv", sep=":", inferSchema="true", header="true")
5858
# $example off:manual_load_options_csv$
5959

60+
# $example on:manual_save_options_orc$
61+
df = spark.read.orc("examples/src/main/resources/users.orc")
62+
(df.write.format("orc")
63+
.option("orc.bloom.filter.columns", "favorite_color")
64+
.option("orc.dictionary.key.threshold", "1.0")
65+
.save("users_with_options.orc"))
66+
# $example off:manual_save_options_orc$
67+
6068
# $example on:write_sorting_and_bucketing$
6169
df.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")
6270
# $example off:write_sorting_and_bucketing$

examples/src/main/r/RSparkSQLExample.R

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,10 +114,14 @@ write.df(namesAndAges, "namesAndAges.parquet", "parquet")
114114

115115

116116
# $example on:manual_load_options_csv$
117-
df <- read.df("examples/src/main/resources/people.csv", "csv", sep=";", inferSchema=T, header=T)
117+
df <- read.df("examples/src/main/resources/people.csv", "csv", sep = ";", inferSchema = TRUE, header = TRUE)
118118
namesAndAges <- select(df, "name", "age")
119119
# $example off:manual_load_options_csv$
120120

121+
# $example on:manual_save_options_orc$
122+
df <- read.df("examples/src/main/resources/users.orc", "orc")
123+
write.orc(df, "users_with_options.orc", orc.bloom.filter.columns = "favorite_color", orc.dictionary.key.threshold = 1.0)
124+
# $example off:manual_save_options_orc$
121125

122126
# $example on:direct_sql$
123127
df <- sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
547 Bytes
Binary file not shown.

examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,12 @@ object SQLDataSourceExample {
5656
.option("header", "true")
5757
.load("examples/src/main/resources/people.csv")
5858
// $example off:manual_load_options_csv$
59+
// $example on:manual_save_options_orc$
60+
usersDF.write.format("orc")
61+
.option("orc.bloom.filter.columns", "favorite_color")
62+
.option("orc.dictionary.key.threshold", "1.0")
63+
.save("users_with_options.orc")
64+
// $example off:manual_save_options_orc$
5965

6066
// $example on:direct_sql$
6167
val sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")

0 commit comments

Comments
 (0)