Skip to content

Conversation

@AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Sep 27, 2020

What changes were proposed in this pull request?

Support spark.sql.hive.metastore.jars use HDFS location.

When user need to use path to set hive metastore jars, you should set
spark.sql.hive.metasstore.jars=path and set real path in spark.sql.hive.metastore.jars.path
since we use File.pathSeperator to split path, but FIle.pathSeparator is : in unix, it will split hdfs location hdfs://nameservice/xx. So add new config spark.sql.hive.metastore.jars.path to set comma separated paths.
To keep both two way supported

Why are the changes needed?

All spark app can fetch internal version hive jars in HDFS location, not need distribute to all node.

Does this PR introduce any user-facing change?

User can use HDFS location to store hive metastore jars

How was this patch tested?

Manuel tested.

@AngersZhuuuu
Copy link
Contributor Author

ping @wangyum

@SparkQA
Copy link

SparkQA commented Sep 27, 2020

Test build #129140 has finished for PR 29881 at commit 834c2a7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 27, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33756/

@SparkQA
Copy link

SparkQA commented Sep 27, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33757/

@SparkQA
Copy link

SparkQA commented Sep 27, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33756/

@SparkQA
Copy link

SparkQA commented Sep 27, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33757/

@SparkQA
Copy link

SparkQA commented Sep 27, 2020

Test build #129141 has finished for PR 29881 at commit bf3c4b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [SPARK-32852][SQL} spark.sql.hive.metastore.jars support HDFS location [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location Sep 27, 2020
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @AngersZhuuuu . Is this PR limited to HDFS? What about S3 storage?

@AngersZhuuuu
Copy link
Contributor Author

Hi, @AngersZhuuuu . Is this PR limited to HDFS? What about S3 storage?

I will test for s3.

@SparkQA
Copy link

SparkQA commented Sep 29, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33839/

@SparkQA
Copy link

SparkQA commented Sep 29, 2020

Test build #129224 has finished for PR 29881 at commit 1063a08.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 29, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33840/

@SparkQA
Copy link

SparkQA commented Sep 29, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33839/

@SparkQA
Copy link

SparkQA commented Sep 29, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33840/

@SparkQA
Copy link

SparkQA commented Sep 29, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33846/

@SparkQA
Copy link

SparkQA commented Sep 29, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33846/

// Convert to files and expand any directories.
val jars =
hiveMetastoreJars
.split(";")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we change this from File.pathSeparator to ;?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we change this from File.pathSeparator to ;?

Since in unix system, File.pathSeparator is : will split hdfs path hdfs://xxx/xxx

Copy link
Member

@HyukjinKwon HyukjinKwon Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path separator is OS dependent.. lets don't change the behaviour

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path separator is OS dependent.. lets don't change the behaviour

But a lot place we use comma separator to split file path, right? Also I am confused that why we must need to use this Path separator here since what we config is jar file list or *.

or you mean use will pass config like hadoop classpath, but seem no hive classpath

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @dongjoon-hyun @wangyum @cloud-fan WDYT about this problem

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, yes using comma might be a good way but now you're breaking the user applications that use :. Let's don't break.
I think path separator is taken after the regular classpath behaviour when you run jars like java -cp.

hiveMetastoreJars
.split(";")
.flatMap {
case path if path.contains("\\") =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not follow this. Do you want to check if this is a Windows path? If that's the case, you can use Utils.isWindows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\\ can be in file names in other OSs. I don't think it makes sense to assume that this is a local path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\\ can be in file names in other OSs. I don't think it makes sense to assume that this is a local path.

All right, I will raise a pr to change this in SparkContext's addJar/addFile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not follow this. Do you want to check if this is a Windows path? If that's the case, you can use Utils.isWindows.

Here need use condition Utils.isWindows && path.contains("\\") since path maybe remote path

@SparkQA
Copy link

SparkQA commented Sep 29, 2020

Test build #129230 has finished for PR 29881 at commit 6214ebf.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val uri = new Path(path).toUri
uri.getScheme match {
case null | "file" =>
addLocalHiveJars(new File(uri.getPath))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming null as a local path is kind of ambiguous (local paths vs fs.default.name). Maybe you'll have to document that it should always be fully qualified URL to indicate other file systems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming null as a local path is kind of ambiguous (local paths vs fs.default.name). Maybe you'll have to document that it should always be fully qualified URL to indicate other file systems.

Add comment in doc of conf spark.sql.hive.metastore.jars

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this comment on the option "path" instead of "classpath"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this comment on the option "path" instead of "classpath"?

Can't got your point, can you make it more clear?

case null | "file" =>
addLocalHiveJars(new File(uri.getPath))
case "local" =>
new File("file:" + uri.getPath).toURL :: Nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think you can treat local as same as file since we're not downloading or fetching it (or distribute)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think you can treat local as same as file since we're not downloading or fetching it (or distribute)

Yae

s" 2. hdfs://nameservice/path/to/jar/xxx.jar" +
s" 3. /path/to/jar/ (path without URI scheme follow conf `fs.defaultFS`'s URI schema)" +
s" 4. [http/https/ftp]://path/to/jar/xxx.jar" +
s"For URI, we can't support path wildcard, but for other URL support nested path wildcard," +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we be more specific? e.g. http/https/ftp doesn't support wildcard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we be more specific? e.g. http/https/ftp doesn't support wildcard.

Done

| Use Hive jars of specified version downloaded from Maven repositories.
| 3. A classpath in the standard format for both Hive and Hadoop.
| 3. "path"
| Use Hive jars configured by `spark.sql.hive.metastore.jars.path` in comma separated format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a period at the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a period at the end?

Done

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Test build #129933 has finished for PR 29881 at commit d631a75.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34538/

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34538/

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34541/

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34541/

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Test build #129936 has finished for PR 29881 at commit 6e67c7b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

ping @cloud-fan

| 3. "path"
| Use Hive jars configured by `spark.sql.hive.metastore.jars.path`
| in comma separated format. Support both local or remote paths, it should always
| be fully qualified URL to indicate other file systems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should always be fully qualified URL ... it's not true now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should always be fully qualified URL ... it's not true now.

Remove this now

s" 2. hdfs://nameservice/path/to/jar/xxx.jar" +
s" 3. /path/to/jar/ (path without URI scheme follow conf `fs.defaultFS`'s URI schema)" +
s" 4. [http/https/ftp]://path/to/jar/xxx.jar" +
s"Notice: `http/https/ftp` doesn't support wildcard, but for other URLs support" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but for other ... -> but other ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but for other ... -> but other ...

Done

val HIVE_METASTORE_JARS_PATH = buildStaticConf("spark.sql.hive.metastore.jars.path")
.doc(s"Comma separated fully qualified URL of Hive jars, support both local and remote paths," +
s"Such as: " +
s" 1. file://path/to/jar/xxx.jar" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add \n at the end for each item?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add \n at the end for each item?

Done

s"Notice: `http/https/ftp` doesn't support wildcard, but for other URLs support" +
s" nested path wildcard, Such as: " +
s" 1. file://path/to/jar/*, file://path/to/jar/*/*" +
s" 2. hdfs://nameservice/path/to/jar/*, hdfs://nameservice/path/to/jar/*/*" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, \n

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, \n

Done

@SparkQA
Copy link

SparkQA commented Oct 22, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34748/

@SparkQA
Copy link

SparkQA commented Oct 22, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34748/

@SparkQA
Copy link

SparkQA commented Oct 22, 2020

Test build #130141 has finished for PR 29881 at commit ea9ef2b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in a1629b4 Oct 22, 2020
a0x8o added a commit to a0x8o/spark that referenced this pull request Nov 18, 2020
…ql.hive.metastore.jars

### What changes were proposed in this pull request?

This is a follow-up for apache/spark#29881.
It revises the documentation of the configuration `spark.sql.hive.metastore.jars`.

### Why are the changes needed?

Fix grammatical error in the doc.
Also, make it more clear that the configuration is effective only when `spark.sql.hive.metastore.jars` is set as `path`

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Just doc changes.

Closes #30407 from gengliangwang/reviseJarPathDoc.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Jan 26, 2021
…istence when config hive jars location

### What changes were proposed in this pull request?
Add notice about keep hive version consistence when config hive jars location

With PR #29881, if we don't keep hive version consistence. we will got below error.
```
Builtin jars can only be used when hive execution version == hive metastore version. Execution: 2.3.8 != Metastore: 1.2.1. Specify a valid path to the correct hive jars using spark.sql.hive.metastore.jars or change spark.sql.hive.metastore.version to 2.3.8.
```

![image](https://user-images.githubusercontent.com/46485123/105795169-512d8380-5fc7-11eb-97c3-0259a0d2aa58.png)

### Why are the changes needed?
Make config doc detail

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #31317 from AngersZhuuuu/SPARK-32852-followup.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Jan 26, 2021
…istence when config hive jars location

### What changes were proposed in this pull request?
Add notice about keep hive version consistence when config hive jars location

With PR #29881, if we don't keep hive version consistence. we will got below error.
```
Builtin jars can only be used when hive execution version == hive metastore version. Execution: 2.3.8 != Metastore: 1.2.1. Specify a valid path to the correct hive jars using spark.sql.hive.metastore.jars or change spark.sql.hive.metastore.version to 2.3.8.
```

![image](https://user-images.githubusercontent.com/46485123/105795169-512d8380-5fc7-11eb-97c3-0259a0d2aa58.png)

### Why are the changes needed?
Make config doc detail

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #31317 from AngersZhuuuu/SPARK-32852-followup.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 7bd4165)
Signed-off-by: HyukjinKwon <[email protected]>
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
…istence when config hive jars location

### What changes were proposed in this pull request?
Add notice about keep hive version consistence when config hive jars location

With PR apache#29881, if we don't keep hive version consistence. we will got below error.
```
Builtin jars can only be used when hive execution version == hive metastore version. Execution: 2.3.8 != Metastore: 1.2.1. Specify a valid path to the correct hive jars using spark.sql.hive.metastore.jars or change spark.sql.hive.metastore.version to 2.3.8.
```

![image](https://user-images.githubusercontent.com/46485123/105795169-512d8380-5fc7-11eb-97c3-0259a0d2aa58.png)

### Why are the changes needed?
Make config doc detail

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes apache#31317 from AngersZhuuuu/SPARK-32852-followup.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants