-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17577][SparkR][Core] SparkR support add files to Spark job and get by executors #15131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
5c49428
65be499
542b981
fa82b3a
acfbd8a
9ed3c68
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -225,6 +225,54 @@ setCheckpointDir <- function(sc, dirName) { | |
| invisible(callJMethod(sc, "setCheckpointDir", suppressWarnings(normalizePath(dirName)))) | ||
| } | ||
|
|
||
| #' Add a file or directory to be downloaded with this Spark job on every node. | ||
| #' | ||
| #' The path passed can be either a local file, a file in HDFS (or other Hadoop-supported | ||
| #' filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, | ||
| #' use spark.getSparkFiles(fileName) to find its download location. | ||
| #' | ||
| #' @rdname spark.addFile | ||
| #' @param path The path of the file to be added | ||
| #' @export | ||
| #' @examples | ||
| #'\dontrun{ | ||
| #' spark.addFile("~/myfile") | ||
| #'} | ||
| #' @note spark.addFile since 2.1.0 | ||
| spark.addFile <- function(path) { | ||
| sc <- getSparkContext() | ||
| invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path)))) | ||
| } | ||
|
|
||
| #' Get the root directory that contains files added through spark.addFile. | ||
| #' | ||
| #' @rdname spark.getSparkFilesRootDirectory | ||
| #' @return the root directory that contains files added through spark.addFile | ||
| #' @export | ||
| #' @examples | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add @export |
||
| #'\dontrun{ | ||
| #' spark.getSparkFilesRootDirectory() | ||
| #'} | ||
| #' @note spark.getSparkFilesRootDirectory since 2.1.0 | ||
| spark.getSparkFilesRootDirectory <- function() { | ||
| callJStatic("org.apache.spark.SparkFiles", "getRootDirectory") | ||
| } | ||
|
|
||
| #' Get the absolute path of a file added through spark.addFile. | ||
| #' | ||
| #' @rdname spark.getSparkFiles | ||
| #' @param fileName The name of the file added through spark.addFile | ||
| #' @return the absolute path of a file added through spark.addFile. | ||
| #' @export | ||
| #' @examples | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add @export |
||
| #'\dontrun{ | ||
| #' spark.getSparkFiles("myfile") | ||
| #'} | ||
| #' @note spark.getSparkFiles since 2.1.0 | ||
| spark.getSparkFiles <- function(fileName) { | ||
| callJStatic("org.apache.spark.SparkFiles", "get", as.character(fileName)) | ||
| } | ||
|
|
||
| #' Run a function over a list of elements, distributing the computations with Spark | ||
| #' | ||
| #' Run a function over a list of elements, distributing the computations with Spark. Applies a | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1426,7 +1426,7 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli | |
| * supported for Hadoop-supported filesystems. | ||
| */ | ||
| def addFile(path: String, recursive: Boolean): Unit = { | ||
| val uri = new URI(path) | ||
| val uri = new Path(path).toUri | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should there be some tests we can add for this change?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do understand your concern @felixcheung. However, IMHO, it'd be okay to not test Hadoop library within Spark. I will try to find some tests/documentation related with Windows path in Hadoop and then will share to make sure. FWIW, this case was verified by one of comitters before for Windows path. So, it'd be okay.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, we could alternatively use
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just checked the documentation and tests. It seems Windows paths are being tested [1]. As I could not find the explicit mention about Windows path in the documentation [4]. [1] https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestPath.java
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @HyukjinKwon . Thanks!
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok thanks! |
||
| val schemeCorrectedPath = uri.getScheme match { | ||
| case null | "local" => new File(path).getCanonicalFile.toURI.toString | ||
| case _ => path | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In Could I please ask to change this to import java.net.URI
import org.apache.hadoop.fs.Path
scala> val a = new Path("C:\\a\\b\\c").toUri
a: java.net.URI = /C:/a/b/c
scala> val b = new URI(a.toString)
b: java.net.URI = /C:/a/b/c
scala> val a = new Path("C:/a/b/c").toUri
a: java.net.URI = /C:/a/b/c
scala> val b = new URI(a.toString)
b: java.net.URI = /C:/a/b/c
scala> val a = new Path("/a/b/c").toUri
a: java.net.URI = /a/b/c
scala> val b = new URI(a.toString)
b: java.net.URI = /a/b/c
scala> val a = new Path("file:///a/b/c").toUri
a: java.net.URI = file:///a/b/c
scala> val b = new URI(a.toString)
b: java.net.URI = file:///a/b/c
scala> val a = new Path("http://localhost/a/b/c").toUri
a: java.net.URI = http://localhost/a/b/c
scala> val b = new URI(a.toString)
b: java.net.URI = http://localhost/a/b/c
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just in case, I ran the tests after manually fixing this. Maybe we can wait for the result - https://ci.appveyor.com/project/HyukjinKwon/spark/build/108-pr-15131-path
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yanboliang Yeap, it passes the tests at least.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated, thanks! |
||
|
|
@@ -1457,8 +1457,8 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli | |
| logInfo(s"Added file $path at $key with timestamp $timestamp") | ||
| // Fetch the file locally so that closures which are run on the driver can still use the | ||
| // SparkFiles API to access files. | ||
| Utils.fetchFile(path, new File(SparkFiles.getRootDirectory()), conf, env.securityManager, | ||
| hadoopConfiguration, timestamp, useCache = false) | ||
| Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf, | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should there be some tests we can add for this change?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my peraonal opinion, I thought it's okay (I thought about this for a while) because |
||
| env.securityManager, hadoopConfiguration, timestamp, useCache = false) | ||
| postEnvironmentUpdate() | ||
| } | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add @export