-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location #29881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ping @wangyum |
|
Test build #129140 has finished for PR 29881 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test status success |
|
Test build #129141 has finished for PR 29881 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @AngersZhuuuu . Is this PR limited to HDFS? What about S3 storage?
I will test for |
|
Kubernetes integration test starting |
|
Test build #129224 has finished for PR 29881 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
| // Convert to files and expand any directories. | ||
| val jars = | ||
| hiveMetastoreJars | ||
| .split(";") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we change this from File.pathSeparator to ;?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we change this from
File.pathSeparatorto;?
Since in unix system, File.pathSeparator is : will split hdfs path hdfs://xxx/xxx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path separator is OS dependent.. lets don't change the behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path separator is OS dependent.. lets don't change the behaviour
But a lot place we use comma separator to split file path, right? Also I am confused that why we must need to use this Path separator here since what we config is jar file list or *.
or you mean use will pass config like hadoop classpath, but seem no hive classpath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @dongjoon-hyun @wangyum @cloud-fan WDYT about this problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, yes using comma might be a good way but now you're breaking the user applications that use :. Let's don't break.
I think path separator is taken after the regular classpath behaviour when you run jars like java -cp.
| hiveMetastoreJars | ||
| .split(";") | ||
| .flatMap { | ||
| case path if path.contains("\\") => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not follow this. Do you want to check if this is a Windows path? If that's the case, you can use Utils.isWindows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\\ can be in file names in other OSs. I don't think it makes sense to assume that this is a local path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\\can be in file names in other OSs. I don't think it makes sense to assume that this is a local path.
All right, I will raise a pr to change this in SparkContext's addJar/addFile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not follow this. Do you want to check if this is a Windows path? If that's the case, you can use
Utils.isWindows.
Here need use condition Utils.isWindows && path.contains("\\") since path maybe remote path
|
Test build #129230 has finished for PR 29881 at commit
|
| val uri = new Path(path).toUri | ||
| uri.getScheme match { | ||
| case null | "file" => | ||
| addLocalHiveJars(new File(uri.getPath)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming null as a local path is kind of ambiguous (local paths vs fs.default.name). Maybe you'll have to document that it should always be fully qualified URL to indicate other file systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming
nullas a local path is kind of ambiguous (local paths vsfs.default.name). Maybe you'll have to document that it should always be fully qualified URL to indicate other file systems.
Add comment in doc of conf spark.sql.hive.metastore.jars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this comment on the option "path" instead of "classpath"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this comment on the option "path" instead of "classpath"?
Can't got your point, can you make it more clear?
| case null | "file" => | ||
| addLocalHiveJars(new File(uri.getPath)) | ||
| case "local" => | ||
| new File("file:" + uri.getPath).toURL :: Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think you can treat local as same as file since we're not downloading or fetching it (or distribute)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think you can treat
localas same asfilesince we're not downloading or fetching it (or distribute)
Yae
| s" 2. hdfs://nameservice/path/to/jar/xxx.jar" + | ||
| s" 3. /path/to/jar/ (path without URI scheme follow conf `fs.defaultFS`'s URI schema)" + | ||
| s" 4. [http/https/ftp]://path/to/jar/xxx.jar" + | ||
| s"For URI, we can't support path wildcard, but for other URL support nested path wildcard," + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we be more specific? e.g. http/https/ftp doesn't support wildcard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we be more specific? e.g.
http/https/ftpdoesn't support wildcard.
Done
| | Use Hive jars of specified version downloaded from Maven repositories. | ||
| | 3. A classpath in the standard format for both Hive and Hadoop. | ||
| | 3. "path" | ||
| | Use Hive jars configured by `spark.sql.hive.metastore.jars.path` in comma separated format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a period at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a period at the end?
Done
|
Test build #129933 has finished for PR 29881 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #129936 has finished for PR 29881 at commit
|
|
ping @cloud-fan |
| | 3. "path" | ||
| | Use Hive jars configured by `spark.sql.hive.metastore.jars.path` | ||
| | in comma separated format. Support both local or remote paths, it should always | ||
| | be fully qualified URL to indicate other file systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should always be fully qualified URL ... it's not true now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should always be fully qualified URL ...it's not true now.
Remove this now
| s" 2. hdfs://nameservice/path/to/jar/xxx.jar" + | ||
| s" 3. /path/to/jar/ (path without URI scheme follow conf `fs.defaultFS`'s URI schema)" + | ||
| s" 4. [http/https/ftp]://path/to/jar/xxx.jar" + | ||
| s"Notice: `http/https/ftp` doesn't support wildcard, but for other URLs support" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but for other ... -> but other ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but for other ...->but other ...
Done
| val HIVE_METASTORE_JARS_PATH = buildStaticConf("spark.sql.hive.metastore.jars.path") | ||
| .doc(s"Comma separated fully qualified URL of Hive jars, support both local and remote paths," + | ||
| s"Such as: " + | ||
| s" 1. file://path/to/jar/xxx.jar" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add \n at the end for each item?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add
\nat the end for each item?
Done
| s"Notice: `http/https/ftp` doesn't support wildcard, but for other URLs support" + | ||
| s" nested path wildcard, Such as: " + | ||
| s" 1. file://path/to/jar/*, file://path/to/jar/*/*" + | ||
| s" 2. hdfs://nameservice/path/to/jar/*, hdfs://nameservice/path/to/jar/*/*" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, \n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto,
\n
Done
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #130141 has finished for PR 29881 at commit
|
|
thanks, merging to master! |
…ql.hive.metastore.jars ### What changes were proposed in this pull request? This is a follow-up for apache/spark#29881. It revises the documentation of the configuration `spark.sql.hive.metastore.jars`. ### Why are the changes needed? Fix grammatical error in the doc. Also, make it more clear that the configuration is effective only when `spark.sql.hive.metastore.jars` is set as `path` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just doc changes. Closes #30407 from gengliangwang/reviseJarPathDoc. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
…istence when config hive jars location ### What changes were proposed in this pull request? Add notice about keep hive version consistence when config hive jars location With PR #29881, if we don't keep hive version consistence. we will got below error. ``` Builtin jars can only be used when hive execution version == hive metastore version. Execution: 2.3.8 != Metastore: 1.2.1. Specify a valid path to the correct hive jars using spark.sql.hive.metastore.jars or change spark.sql.hive.metastore.version to 2.3.8. ```  ### Why are the changes needed? Make config doc detail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31317 from AngersZhuuuu/SPARK-32852-followup. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…istence when config hive jars location ### What changes were proposed in this pull request? Add notice about keep hive version consistence when config hive jars location With PR #29881, if we don't keep hive version consistence. we will got below error. ``` Builtin jars can only be used when hive execution version == hive metastore version. Execution: 2.3.8 != Metastore: 1.2.1. Specify a valid path to the correct hive jars using spark.sql.hive.metastore.jars or change spark.sql.hive.metastore.version to 2.3.8. ```  ### Why are the changes needed? Make config doc detail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31317 from AngersZhuuuu/SPARK-32852-followup. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 7bd4165) Signed-off-by: HyukjinKwon <[email protected]>
…istence when config hive jars location ### What changes were proposed in this pull request? Add notice about keep hive version consistence when config hive jars location With PR apache#29881, if we don't keep hive version consistence. we will got below error. ``` Builtin jars can only be used when hive execution version == hive metastore version. Execution: 2.3.8 != Metastore: 1.2.1. Specify a valid path to the correct hive jars using spark.sql.hive.metastore.jars or change spark.sql.hive.metastore.version to 2.3.8. ```  ### Why are the changes needed? Make config doc detail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes apache#31317 from AngersZhuuuu/SPARK-32852-followup. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
Support
spark.sql.hive.metastore.jarsuse HDFS location.When user need to use path to set hive metastore jars, you should set
spark.sql.hive.metasstore.jars=pathand set real path inspark.sql.hive.metastore.jars.pathsince we use
File.pathSeperatorto split path, butFIle.pathSeparatoris:in unix, it will split hdfs locationhdfs://nameservice/xx. So add new configspark.sql.hive.metastore.jars.pathto set comma separated paths.To keep both two way supported
Why are the changes needed?
All spark app can fetch internal version hive jars in HDFS location, not need distribute to all node.
Does this PR introduce any user-facing change?
User can use HDFS location to store hive metastore jars
How was this patch tested?
Manuel tested.