[SPARK-27162][SQL] Add new method asCaseSensitiveMap in CaseInsensitiveStringMap #24094

gengliangwang · 2019-03-14T16:16:05Z

What changes were proposed in this pull request?

Currently, DataFrameReader/DataFrameReader supports setting Hadoop configurations via method .option().
E.g, the following test case should be passed in both ORC V1 and V2

  class TestFileFilter extends PathFilter {
    override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
  }

  withTempPath { dir =>
      val path = dir.getCanonicalPath

      val df = spark.range(2)
      df.write.orc(path + "/p=1")
      df.write.orc(path + "/p=2")
      val extraOptions = Map(
        "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
        "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
      )
      assert(spark.read.options(extraOptions).orc(path).count() === 2)
    }
  }

While Hadoop Configurations are case sensitive, the current data source V2 APIs are using CaseInsensitiveStringMap in the top level entry TableProvider.
To create Hadoop configurations correctly, I suggest

adding a new method asCaseSensitiveMap in CaseInsensitiveStringMap.
Make CaseInsensitiveStringMap read-only to ambiguous conversion in asCaseSensitiveMap

How was this patch tested?

Unit test

gengliangwang · 2019-03-14T16:21:45Z

sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java

Things become a bit ambiguous here. E.g. after put("a", "b") and put("A", "B") to a empty CaseInsensitiveStringMap, originalMap has two new entries, while delegate has only one new entry.
To me, this is the simplest way to maintain originalMap. Suggestions are welcomed.

gengliangwang · 2019-03-14T16:22:00Z

sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java

Things become a bit ambiguous here too.

gengliangwang · 2019-03-14T16:22:28Z

@rdblue @cloud-fan

rdblue · 2019-03-14T16:39:19Z

@gengliangwang, is it required that Hadoop configuration options are passed through this map or is there another way?

gengliangwang · 2019-03-14T16:46:15Z

Yes, I think so.
E.g.
DataFrameReader.load() => TableProvider.getTable() => ...

SparkQA · 2019-03-14T20:48:53Z

Test build #103506 has finished for PR 24094 at commit 8271d74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-03-15T03:08:44Z

AFAIK hadoop conf can be set in 3 ways:

global level, via SparkContext.hadoopConfiguration
session level, via SparkSession.conf
operation level, via DataFrameReader/Writer.option

1 and 2 are fine, as they are case sensitive. The problem is 3, as data source v2 treats options as case-insensitive.

There are 2 solutions I can think of

Do not support operation level hadoop conf for data source v2.
Keep the original case sensitive map.

I think 2 is more reasonable, which is this PR trying to do.

cloud-fan · 2019-03-15T03:10:10Z

sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java

this should be new HashMap<>(originalMap.size);, otherwise we add data to it twice.

cloud-fan · 2019-03-15T03:12:02Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

I don't think we should pollute SessionState with the case insensitive map stuff. Can we inline this method?

Otherwise, developers might not be aware of using .getOriginalMap if they want to create Hadoop configuration from CaseInsensitiveStringMap.

Then we should document it in CaseInsensitiveMap. data source developers can't access SessionState

I agree with @cloud-fan, I'd rather use newHadoopConfWithOptions and not add a new method here.

cloud-fan · 2019-03-15T03:12:03Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

I don't think we should pollute SessionState with the case insensitive map stuff. Can we inline this method?

cloud-fan · 2019-03-15T03:54:47Z

sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java

The thing worries me most is the inconsistency between the case insensitive map and the original map. I think we should either fail or keep the latter entry if a -> 1, A -> 2 appears together.

One thing we can simplify is, CaseInsensitiveStringMap is read by data source and can be read-only. Then it can be easier to resolve conflicting entries at the beginning.

But the method put/remove/clear still need to be implemented...Do you mean that we can ignore original map in these method ?

We can just throw exception in these methods, and say this map is readonly.

@cloud-fan This is a good solution!
@rdblue What do you think?

That works for me. It is always a good idea to avoid passing mutable state to plugin code.

SparkQA · 2019-03-15T21:25:30Z

Test build #103533 has finished for PR 24094 at commit 33a15fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-03-15T23:27:17Z

sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java

+  /**
+   * Returns the original case-sensitive map.
+   */
+  public Map<String, String> getOriginalMap() {


What about asCaseSensitiveMap? That is more clear that using "original" because that doesn't indicate why someone would call this method.

cloud-fan · 2019-03-18T06:36:43Z

retest this please

cloud-fan · 2019-03-18T06:38:44Z

sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java

-    putAll(originalMap);
+    original = new HashMap<>(originalMap);
+    delegate = new HashMap<>(originalMap.size());
+    for (Map.Entry<? extends String, ? extends String> entry : originalMap.entrySet()) {


is the ? extends String required? Can we just use String?

sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java

SparkQA · 2019-03-18T07:05:02Z

Test build #103605 has finished for PR 24094 at commit f0f59e3.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-03-18T10:06:49Z

LGTM

rdblue · 2019-03-18T17:23:35Z

sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java

+   * Returns the original case-sensitive map.
+   */
+  public Map<String, String> asCaseSensitiveMap() {
+    return original;


This should return a read-only version of original. You can use Collections.unmodifiableMap.

rdblue · 2019-03-18T17:27:03Z

Just one more problem: the case sensitive map should not allow modifications. Once that's fixed and tests are passing, +1.

sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java

SparkQA · 2019-03-18T18:55:35Z

Test build #103616 has finished for PR 24094 at commit d252211.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-18T19:58:53Z

Test build #103626 has finished for PR 24094 at commit 08ab550.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-03-18T20:24:52Z

+1 when tests are passing.

SparkQA · 2019-03-18T23:49:35Z

Test build #103631 has finished for PR 24094 at commit 2245766.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-19T00:32:28Z

Test build #103634 has finished for PR 24094 at commit 28e05f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-03-19T05:36:17Z

thanks, merging to master!

…veStringMap Currently, DataFrameReader/DataFrameReader supports setting Hadoop configurations via method `.option()`. E.g, the following test case should be passed in both ORC V1 and V2 ``` class TestFileFilter extends PathFilter { override def accept(path: Path): Boolean = path.getParent.getName != "p=2" } withTempPath { dir => val path = dir.getCanonicalPath val df = spark.range(2) df.write.orc(path + "/p=1") df.write.orc(path + "/p=2") val extraOptions = Map( "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName, "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName ) assert(spark.read.options(extraOptions).orc(path).count() === 2) } } ``` While Hadoop Configurations are case sensitive, the current data source V2 APIs are using `CaseInsensitiveStringMap` in the top level entry `TableProvider`. To create Hadoop configurations correctly, I suggest 1. adding a new method `asCaseSensitiveMap` in `CaseInsensitiveStringMap`. 2. Make `CaseInsensitiveStringMap` read-only to ambiguous conversion in `asCaseSensitiveMap` Unit test Closes apache#24094 from gengliangwang/originalMap. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

gengliangwang commented Mar 14, 2019

View reviewed changes

cloud-fan reviewed Mar 15, 2019

View reviewed changes

gengliangwang added 3 commits March 15, 2019 17:44

maintain original map in CaseInsensitiveStringMap

7d711eb

rename

66792a5

address one comment

33a15fd

gengliangwang force-pushed the originalMap branch from 8271d74 to 33a15fd Compare March 15, 2019 09:52

rdblue reviewed Mar 15, 2019

View reviewed changes

address comments

f0f59e3

cloud-fan reviewed Mar 18, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java Outdated Show resolved Hide resolved

cloud-fan reviewed Mar 18, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java Outdated Show resolved Hide resolved

cloud-fan reviewed Mar 18, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala Outdated Show resolved Hide resolved

gengliangwang commented Mar 18, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java Outdated Show resolved Hide resolved

gengliangwang added 2 commits March 18, 2019 16:00

address comments

e153f03

revise

d252211

rdblue reviewed Mar 18, 2019

View reviewed changes

gengliangwang changed the title ~~[SPARK-27162][SQL] Add new method getOriginalMap in CaseInsensitiveStringMap~~ [SPARK-27162][SQL] Add new method asCaseSensitiveMap in CaseInsensitiveStringMap Mar 18, 2019

address comment

08ab550

rdblue reviewed Mar 18, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java Outdated Show resolved Hide resolved

fix test failure

2245766

avoid using wildcard

28e05f2

cloud-fan closed this in 28d35c8 Mar 19, 2019

xuanyuanking mentioned this pull request Mar 20, 2019

[SPARK-26594][SQL] DataSourceOptions.asMap should return CaseInsensitiveMap #24062

Closed

[SPARK-27162][SQL] Add new method asCaseSensitiveMap in CaseInsensitiveStringMap #24094

[SPARK-27162][SQL] Add new method asCaseSensitiveMap in CaseInsensitiveStringMap #24094

Uh oh!

Conversation

gengliangwang commented Mar 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Mar 14, 2019

Uh oh!

rdblue commented Mar 14, 2019

Uh oh!

gengliangwang commented Mar 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 14, 2019

Uh oh!

cloud-fan commented Mar 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 18, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Mar 18, 2019

Uh oh!

cloud-fan commented Mar 18, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Mar 18, 2019

Uh oh!

Uh oh!

SparkQA commented Mar 18, 2019

Uh oh!

SparkQA commented Mar 18, 2019

Uh oh!

rdblue commented Mar 18, 2019

Uh oh!

SparkQA commented Mar 18, 2019

Uh oh!

SparkQA commented Mar 19, 2019

Uh oh!

cloud-fan commented Mar 19, 2019

Uh oh!

Reviewers

gengliangwang commented Mar 14, 2019 •

edited

Loading

gengliangwang commented Mar 14, 2019 •

edited

Loading