-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32539][INFRA] Disallow FileSystem.get(Configuration conf) in style check by default
#29357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #127089 has finished for PR 29357 at commit
|
| sparkConf: SparkConf, | ||
| hadoopConf: Configuration): Set[FileSystem] = { | ||
| // scalastyle:off FileSystemGet | ||
| val defaultFS = FileSystem.get(hadoopConf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bit arguable to disallow it in style. If you want to just get the default file system (when path is not available), FileSystem.get is a valid usage and a good fallback to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is arguable. The rule is motivated by a regression of another project. I think it is worth doing so if we treat it as a warning for potential mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I am fine with it.
scalastyle-config.xml
Outdated
| <customMessage><![CDATA[ | ||
| Are you sure that you want to use "FileSystem.get(Configuration conf)"? If the input | ||
| configuration is not set properly, a default FileSystem instance will be returned. Please use | ||
| "FileSystem.get(URI uri, Configuration conf)" or "Path.getFileSystem()" instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path.getFileSystem() also takes a conf parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thank you.
|
@gengliangwang can you put a real example in the PR description? e.g. |
Thanks, I have updated the description. |
xkrogen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worked on Hadoop (as a committer on HDFS) for a few years before starting my work on Spark, and I can attest that this is a common anti-pattern that can have severe implications. Even within the Hadoop codebase itself, I've had to go through and rip out old code which leveraged FileSystem.get(Configuration conf) because it breaks under some environments. Many developers don't realize it is almost always not the correct API to use.
So, strong +1 from me on this effort.
| Are you sure that you want to use "FileSystem.get(Configuration conf)"? If the input | ||
| configuration is not set properly, a default FileSystem instance will be returned. Please use | ||
| "FileSystem.get(URI uri, Configuration conf)" or "Path.getFileSystem(Configuration conf)" instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using FileSystem.get(Configuration conf) is dangerous even outside of the misconfiguration scenario you describe. It's entirely possible for Spark applications to be working with multiple FileSystem instances (say, local HDFS and a cloud blob store, or two HDFS clusters). Using FileSystem.get(Configuration conf) in this instance can also result in nasty errors.
Not sure if it's worth mentioning here, will leave that up to your judgement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xkrogen Thanks for the +1! I just update the message in style check. I think mentioning "reading a target path of non-default file system" is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
Are you sure that you want to use "FileSystem.get(Configuration conf)"? If the input
configuration is not set properly, a default FileSystem instance will be returned. It can
also lead to nasty errors when you deal with multiple file systems. Please use ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan ok, updated.
|
Seems reasonable to guard against. I know we have fixed this a few times over the years. |
|
Test build #127097 has finished for PR 29357 at commit
|
|
|
||
| val hadoopConf = spark.sessionState.newHadoopConf() | ||
| val fs = FileSystem.get(hadoopConf) | ||
| val fs = new Path(tableDir.getAbsolutePath).getFileSystem(hadoopConf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have this pattern at https://github.com/apache/spark/pull/29357/files#diff-2f2d1e1ffed122e56f3cb20cf7c320d6R276, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Do you mean making it
// scalastyle:off FileSystemGet
val fs = FileSystem.get(hadoopConf)
// scalastyle:on FileSystemGet
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method Path.getFileSystem(Configuration) is suggested: https://github.com/apache/spark/pull/29357/files#diff-2f2d1e1ffed122e56f3cb20cf7c320d6R273
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should not change the actual code when we just want to add a style check. But this is test and I think it's safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, @gengliangwang ~
|
Test build #127098 has finished for PR 29357 at commit
|
|
Test build #127100 has finished for PR 29357 at commit
|
|
thanks, merging to master! |
|
+1, late LGTM. |
What changes were proposed in this pull request?
Disallow
FileSystem.get(Configuration conf)in Scala style check by default and suggest developers useFileSystem.get(URI uri, Configuration conf)orPath.getFileSystem()instead.Why are the changes needed?
The method
FileSystem.get(Configuration conf)will return a default FileSystem instance if the conffs.file.implis not set. This can cause file not found exception on reading a target path of non-default file system, e.g. S3. It is hard to discover such a mistake via unit tests.If we disallow it in Scala style check by default and suggest developers use
FileSystem.get(URI uri, Configuration conf)orPath.getFileSystem(Configuration conf), we can reduce potential regression and PR review effort.Does this PR introduce any user-facing change?
No
How was this patch tested?
Manually run scala style check and test.