-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27162][SQL] Add new method asCaseSensitiveMap in CaseInsensitiveStringMap #24094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things become a bit ambiguous here. E.g. after put("a", "b") and put("A", "B") to a empty CaseInsensitiveStringMap, originalMap has two new entries, while delegate has only one new entry.
To me, this is the simplest way to maintain originalMap. Suggestions are welcomed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things become a bit ambiguous here too.
|
@gengliangwang, is it required that Hadoop configuration options are passed through this map or is there another way? |
|
Yes, I think so. |
|
Test build #103506 has finished for PR 24094 at commit
|
|
AFAIK hadoop conf can be set in 3 ways:
1 and 2 are fine, as they are case sensitive. The problem is 3, as data source v2 treats options as case-insensitive. There are 2 solutions I can think of
I think 2 is more reasonable, which is this PR trying to do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be new HashMap<>(originalMap.size);, otherwise we add data to it twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should pollute SessionState with the case insensitive map stuff. Can we inline this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, developers might not be aware of using .getOriginalMap if they want to create Hadoop configuration from CaseInsensitiveStringMap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should document it in CaseInsensitiveMap. data source developers can't access SessionState
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @cloud-fan, I'd rather use newHadoopConfWithOptions and not add a new method here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should pollute SessionState with the case insensitive map stuff. Can we inline this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing worries me most is the inconsistency between the case insensitive map and the original map. I think we should either fail or keep the latter entry if a -> 1, A -> 2 appears together.
One thing we can simplify is, CaseInsensitiveStringMap is read by data source and can be read-only. Then it can be easier to resolve conflicting entries at the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the method put/remove/clear still need to be implemented...Do you mean that we can ignore original map in these method ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just throw exception in these methods, and say this map is readonly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan This is a good solution!
@rdblue What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works for me. It is always a good idea to avoid passing mutable state to plugin code.
8271d74 to
33a15fd
Compare
|
Test build #103533 has finished for PR 24094 at commit
|
| /** | ||
| * Returns the original case-sensitive map. | ||
| */ | ||
| public Map<String, String> getOriginalMap() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about asCaseSensitiveMap? That is more clear that using "original" because that doesn't indicate why someone would call this method.
|
retest this please |
| putAll(originalMap); | ||
| original = new HashMap<>(originalMap); | ||
| delegate = new HashMap<>(originalMap.size()); | ||
| for (Map.Entry<? extends String, ? extends String> entry : originalMap.entrySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the ? extends String required? Can we just use String?
sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java
Outdated
Show resolved
Hide resolved
|
Test build #103605 has finished for PR 24094 at commit
|
|
LGTM |
| * Returns the original case-sensitive map. | ||
| */ | ||
| public Map<String, String> asCaseSensitiveMap() { | ||
| return original; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should return a read-only version of original. You can use Collections.unmodifiableMap.
|
Just one more problem: the case sensitive map should not allow modifications. Once that's fixed and tests are passing, +1. |
sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java
Outdated
Show resolved
Hide resolved
|
Test build #103616 has finished for PR 24094 at commit
|
|
Test build #103626 has finished for PR 24094 at commit
|
|
+1 when tests are passing. |
|
Test build #103631 has finished for PR 24094 at commit
|
|
Test build #103634 has finished for PR 24094 at commit
|
|
thanks, merging to master! |
…veStringMap
Currently, DataFrameReader/DataFrameReader supports setting Hadoop configurations via method `.option()`.
E.g, the following test case should be passed in both ORC V1 and V2
```
class TestFileFilter extends PathFilter {
override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
}
withTempPath { dir =>
val path = dir.getCanonicalPath
val df = spark.range(2)
df.write.orc(path + "/p=1")
df.write.orc(path + "/p=2")
val extraOptions = Map(
"mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
"mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
)
assert(spark.read.options(extraOptions).orc(path).count() === 2)
}
}
```
While Hadoop Configurations are case sensitive, the current data source V2 APIs are using `CaseInsensitiveStringMap` in the top level entry `TableProvider`.
To create Hadoop configurations correctly, I suggest
1. adding a new method `asCaseSensitiveMap` in `CaseInsensitiveStringMap`.
2. Make `CaseInsensitiveStringMap` read-only to ambiguous conversion in `asCaseSensitiveMap`
Unit test
Closes apache#24094 from gengliangwang/originalMap.
Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Currently, DataFrameReader/DataFrameReader supports setting Hadoop configurations via method
.option().E.g, the following test case should be passed in both ORC V1 and V2
While Hadoop Configurations are case sensitive, the current data source V2 APIs are using
CaseInsensitiveStringMapin the top level entryTableProvider.To create Hadoop configurations correctly, I suggest
asCaseSensitiveMapinCaseInsensitiveStringMap.CaseInsensitiveStringMapread-only to ambiguous conversion inasCaseSensitiveMapHow was this patch tested?
Unit test