Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions docs/sql-data-sources-load-save-functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,50 @@ To load a CSV file you can use:
</div>
</div>

The extra options are also used during write operation.
For example, you can control bloom filters and dictionary encodings for ORC data sources.
The following ORC example will create bloom filter and use dictionary encoding only for `favorite_color`.
For Parquet, there exists `parquet.enable.dictionary`, too.
To find more detailed information about the extra ORC/Parquet options,
visit the official Apache ORC/Parquet websites.

<div class="codetabs">

<div data-lang="scala" markdown="1">
{% include_example manual_save_options_orc scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
</div>

<div data-lang="java" markdown="1">
{% include_example manual_save_options_orc java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
</div>

<div data-lang="python" markdown="1">
{% include_example manual_save_options_orc python/sql/datasource.py %}
</div>

<div data-lang="r" markdown="1">
{% include_example manual_save_options_orc r/RSparkSQLExample.R %}
</div>

<div data-lang="sql" markdown="1">

{% highlight sql %}
CREATE TABLE users_with_options (
name STRING,
favorite_color STRING,
favorite_numbers array<integer>
) USING ORC
OPTIONS (
orc.bloom.filter.columns 'favorite_color',
orc.dictionary.key.threshold '1.0',
orc.column.encoding.direct 'name'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you review this, @gatorsmile ? This is the example we discussed previously.

)
{% endhighlight %}

</div>

</div>

### Run SQL on files directly

Instead of using read API to load a file into DataFrame and query it, you can also query that
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,13 @@ private static void runBasicDataSourceExample(SparkSession spark) {
.option("header", "true")
.load("examples/src/main/resources/people.csv");
// $example off:manual_load_options_csv$
// $example on:manual_save_options_orc$
usersDF.write().format("orc")
.option("orc.bloom.filter.columns", "favorite_color")
.option("orc.dictionary.key.threshold", "1.0")
.option("orc.column.encoding.direct", "name")
.save("users_with_options.orc");
// $example off:manual_save_options_orc$
// $example on:direct_sql$
Dataset<Row> sqlDF =
spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`");
Expand Down
9 changes: 9 additions & 0 deletions examples/src/main/python/sql/datasource.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,15 @@ def basic_datasource_example(spark):
format="csv", sep=":", inferSchema="true", header="true")
# $example off:manual_load_options_csv$

# $example on:manual_save_options_orc$
df = spark.read.orc("examples/src/main/resources/users.orc")
(df.write.format("orc")
.option("orc.bloom.filter.columns", "favorite_color")
.option("orc.dictionary.key.threshold", "1.0")
.option("orc.column.encoding.direct", 'name')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use same quote? " or ' for name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep!

.save("users_with_options.orc"))
# $example off:manual_save_options_orc$

# $example on:write_sorting_and_bucketing$
df.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")
# $example off:write_sorting_and_bucketing$
Expand Down
4 changes: 4 additions & 0 deletions examples/src/main/r/RSparkSQLExample.R
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,10 @@ df <- read.df("examples/src/main/resources/people.csv", "csv", sep=";", inferSch
namesAndAges <- select(df, "name", "age")
# $example off:manual_load_options_csv$

# $example on:manual_save_options_orc$
df <- read.df("examples/src/main/resources/users.orc", "orc")
write.orc(df, "users_with_options.orc", mode="overwrite", orc.bloom.filter.columns="favorite_color", orc.dictionary.key.threshold=1.0, orc.column.encoding.direct="name")
# $example off:manual_save_options_orc$
Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @felixcheung . Could you review this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should put space after param
(gosh same for csv example above)

orc.bloom.filter.columns = "favorite_color", orc.dictionary.key.threshold = 1.0, orc.column.encoding.direct = "name")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!


# $example on:direct_sql$
df <- sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
Expand Down
Binary file added examples/src/main/resources/users.orc
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,13 @@ object SQLDataSourceExample {
.option("header", "true")
.load("examples/src/main/resources/people.csv")
// $example off:manual_load_options_csv$
// $example on:manual_save_options_orc$
usersDF.write.format("orc")
.option("orc.bloom.filter.columns", "favorite_color")
.option("orc.dictionary.key.threshold", "1.0")
.option("orc.column.encoding.direct", "name")
.save("users_with_options.orc")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, cc @dbtsai .
This doc is only for Spark 3.0.0 since orc.column.encoding.direct is added to master branch.

// $example off:manual_save_options_orc$

// $example on:direct_sql$
val sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
Expand Down