-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22357][CORE] SparkContext.binaryFiles ignore minPartitions parameter #21638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T] | |
| def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) { | ||
| val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES) | ||
| val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES) | ||
| val defaultParallelism = sc.defaultParallelism | ||
| val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should have a test case; otherwise, we could hit the same issue again.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW, it is easy to add such a test case. We can even test the behaviors of the boundary cases. cc @srowen @HyukjinKwon @MaxGekk @jiangxb1987
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's hard to test, technically, because
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree it is hard to test. I appreciate If anyone can give me some hints of how to do these (how to verify and where to put my test cases).
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would you mind following up with a test that just asserts that asking for, say, 20 partitions results in 20 partitions? This is technically too specific as a test, but is probably fine for now.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From the codes, you can see the calculation is just the intermediate result and this method won't return any value. Checking the split size does not make sense for this test case because it depends on multiple variables and this is just one of them.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is not hard to verify whether the parameter |
||
| val files = listStatus(context).asScala | ||
| val totalBytes = files.filterNot(_.isDirectory).map(_.getLen + openCostInBytes).sum | ||
| val bytesPerCore = totalBytes / defaultParallelism | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
sc.defaultParallelism< 2, andminParititionsis not set inBinaryFileRDD, then previouslydefaultParallelismshall be the same assc.defaultParallelism, and after the change it will be2. Have you already consider this case and feel it's right behavior change to make?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to pass in the minPartitions to use this method, what do you mean minParititions is not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I metioned
BinaryFileRDDnot this method, you can check the code to see how it handles the default value.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BinaryFileRDD will set minPartitions, which will either be defaultMinPartitions, or the values you can set via binaryFiles(path, minPartitions) method. Eventually, this minPartitions value will be passed to setMinPartitions() method.