Skip to content

Conversation

@gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Mar 14, 2019

What changes were proposed in this pull request?

Currently, DataFrameReader/DataFrameReader supports setting Hadoop configurations via method .option().
E.g, the following test case should be passed in both ORC V1 and V2

  class TestFileFilter extends PathFilter {
    override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
  }

  withTempPath { dir =>
      val path = dir.getCanonicalPath

      val df = spark.range(2)
      df.write.orc(path + "/p=1")
      df.write.orc(path + "/p=2")
      val extraOptions = Map(
        "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
        "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
      )
      assert(spark.read.options(extraOptions).orc(path).count() === 2)
    }
  }

While Hadoop Configurations are case sensitive, the current data source V2 APIs are using CaseInsensitiveStringMap in the top level entry TableProvider.
To create Hadoop configurations correctly, I suggest

  1. adding a new method asCaseSensitiveMap in CaseInsensitiveStringMap.
  2. Make CaseInsensitiveStringMap read-only to ambiguous conversion in asCaseSensitiveMap

How was this patch tested?

Unit test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things become a bit ambiguous here. E.g. after put("a", "b") and put("A", "B") to a empty CaseInsensitiveStringMap, originalMap has two new entries, while delegate has only one new entry.
To me, this is the simplest way to maintain originalMap. Suggestions are welcomed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things become a bit ambiguous here too.

@gengliangwang
Copy link
Member Author

@rdblue @cloud-fan

@rdblue
Copy link
Contributor

rdblue commented Mar 14, 2019

@gengliangwang, is it required that Hadoop configuration options are passed through this map or is there another way?

@gengliangwang
Copy link
Member Author

gengliangwang commented Mar 14, 2019

Yes, I think so.
E.g.
DataFrameReader.load() => TableProvider.getTable() => ...

@SparkQA
Copy link

SparkQA commented Mar 14, 2019

Test build #103506 has finished for PR 24094 at commit 8271d74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

AFAIK hadoop conf can be set in 3 ways:

  1. global level, via SparkContext.hadoopConfiguration
  2. session level, via SparkSession.conf
  3. operation level, via DataFrameReader/Writer.option

1 and 2 are fine, as they are case sensitive. The problem is 3, as data source v2 treats options as case-insensitive.

There are 2 solutions I can think of

  1. Do not support operation level hadoop conf for data source v2.
  2. Keep the original case sensitive map.

I think 2 is more reasonable, which is this PR trying to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be new HashMap<>(originalMap.size);, otherwise we add data to it twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should pollute SessionState with the case insensitive map stuff. Can we inline this method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, developers might not be aware of using .getOriginalMap if they want to create Hadoop configuration from CaseInsensitiveStringMap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should document it in CaseInsensitiveMap. data source developers can't access SessionState

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @cloud-fan, I'd rather use newHadoopConfWithOptions and not add a new method here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should pollute SessionState with the case insensitive map stuff. Can we inline this method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing worries me most is the inconsistency between the case insensitive map and the original map. I think we should either fail or keep the latter entry if a -> 1, A -> 2 appears together.

One thing we can simplify is, CaseInsensitiveStringMap is read by data source and can be read-only. Then it can be easier to resolve conflicting entries at the beginning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the method put/remove/clear still need to be implemented...Do you mean that we can ignore original map in these method ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just throw exception in these methods, and say this map is readonly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan This is a good solution!
@rdblue What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me. It is always a good idea to avoid passing mutable state to plugin code.

@SparkQA
Copy link

SparkQA commented Mar 15, 2019

Test build #103533 has finished for PR 24094 at commit 33a15fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Returns the original case-sensitive map.
*/
public Map<String, String> getOriginalMap() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about asCaseSensitiveMap? That is more clear that using "original" because that doesn't indicate why someone would call this method.

@cloud-fan
Copy link
Contributor

retest this please

putAll(originalMap);
original = new HashMap<>(originalMap);
delegate = new HashMap<>(originalMap.size());
for (Map.Entry<? extends String, ? extends String> entry : originalMap.entrySet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the ? extends String required? Can we just use String?

@SparkQA
Copy link

SparkQA commented Mar 18, 2019

Test build #103605 has finished for PR 24094 at commit f0f59e3.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM

* Returns the original case-sensitive map.
*/
public Map<String, String> asCaseSensitiveMap() {
return original;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return a read-only version of original. You can use Collections.unmodifiableMap.

@rdblue
Copy link
Contributor

rdblue commented Mar 18, 2019

Just one more problem: the case sensitive map should not allow modifications. Once that's fixed and tests are passing, +1.

@gengliangwang gengliangwang changed the title [SPARK-27162][SQL] Add new method getOriginalMap in CaseInsensitiveStringMap [SPARK-27162][SQL] Add new method asCaseSensitiveMap in CaseInsensitiveStringMap Mar 18, 2019
@SparkQA
Copy link

SparkQA commented Mar 18, 2019

Test build #103616 has finished for PR 24094 at commit d252211.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 18, 2019

Test build #103626 has finished for PR 24094 at commit 08ab550.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Mar 18, 2019

+1 when tests are passing.

@SparkQA
Copy link

SparkQA commented Mar 18, 2019

Test build #103631 has finished for PR 24094 at commit 2245766.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 19, 2019

Test build #103634 has finished for PR 24094 at commit 28e05f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 28d35c8 Mar 19, 2019
mccheah pushed a commit to palantir/spark that referenced this pull request May 15, 2019
…veStringMap

Currently, DataFrameReader/DataFrameReader supports setting Hadoop configurations via method `.option()`.
E.g, the following test case should be passed in both ORC V1 and V2
```
  class TestFileFilter extends PathFilter {
    override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
  }

  withTempPath { dir =>
      val path = dir.getCanonicalPath

      val df = spark.range(2)
      df.write.orc(path + "/p=1")
      df.write.orc(path + "/p=2")
      val extraOptions = Map(
        "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
        "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
      )
      assert(spark.read.options(extraOptions).orc(path).count() === 2)
    }
  }
```
While Hadoop Configurations are case sensitive, the current data source V2 APIs are using `CaseInsensitiveStringMap` in the top level entry `TableProvider`.
To create Hadoop configurations correctly, I suggest
1. adding a new method `asCaseSensitiveMap` in `CaseInsensitiveStringMap`.
2. Make `CaseInsensitiveStringMap` read-only to ambiguous conversion in `asCaseSensitiveMap`

Unit test

Closes apache#24094 from gengliangwang/originalMap.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants