-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32149][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service #28967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Regarding the benchmarkThe code is in a separate commit where both solution is tested. This benchmark is not intended to be reused just to prove this one time change is well-founded and justified. The commit is on another branch which based on the same as the PR. And the commit with the benchmark is here. The code is: /**
* Benchmark for NormalizedInternedPathname.
* To run this benchmark:
* {{{
* 1. without sbt:
* bin/spark-submit --class <this class> --jars <spark core test jar>
* 2. build/sbt "core/test:runMain <this class>"
* 3. generate result:
* SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "core/test:runMain <this class>"
* Results will be written to "benchmarks/NormalizedInternedPathname-results.txt".
* }}}
* */
object NormalizedInternedPathnameBenchmark extends BenchmarkBase {
val seed = 0x1337
private def normalizePathnames(numIters: Int, newBefore: Boolean): Unit = {
val numLocalDir = 100
val numSubDir = 100
val numFilenames = 100
val sumPathNames = numLocalDir * numSubDir * numFilenames
val benchmark =
new Benchmark(s"Normalize pathnames newBefore=$newBefore", sumPathNames, output = output)
val localDir = s"/a//b//c/d/e//f/g//$newBefore"
val files = (1 to numLocalDir).flatMap { localDirId =>
(1 to numSubDir).flatMap { subDirId =>
(1 to numFilenames).map { filenameId =>
(localDir + localDirId, subDirId.toString, s"filename_$filenameId")
}
}
}
val namedNewMethod = "new" -> normalizeNewMethod
val namedOldMethod = "old" -> normalizeOldMethod
val ((firstName, firstMethod), (secondName, secondMethod)) =
if (newBefore) (namedNewMethod, namedOldMethod) else (namedOldMethod, namedNewMethod)
benchmark.addCase(
s"Normalize with the $firstName method", numIters) { _ =>
firstMethod(files)
}
benchmark.addCase(
s"Normalize with the $secondName method", numIters) { _ =>
secondMethod(files)
}
benchmark.run()
}
private val normalizeOldMethod = (files: Seq[(String, String, String)]) => {
files.map { case (localDir, subDir, filename) =>
ExecutorDiskUtils.createNormalizedInternedPathname(localDir, subDir, filename)
}.size
}
private val normalizeNewMethod = (files: Seq[(String, String, String)]) => {
files.map { case (localDir, subDir, filename) =>
new File(s"localDir${File.separator}subDir${File.separator}filename").getPath().intern()
}.size
}
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
val numIters = 25
runBenchmark("Normalize pathnames new method first") {
normalizePathnames(numIters, newBefore = true)
}
runBenchmark("Normalize pathnames old method first") {
normalizePathnames(numIters, newBefore = false)
}
}
}So it runs the new and old method for a 1000000 paths for 25 iteration then do the same for 1000000 other paths but first the old method is used then the new one. The reason behind to test both method in both way (one is first the other is second) is the assumption that string interning might be different when it is first used on a string and when there is a match second time. The benchmark resultSo the new method is about 7-10 times better than old one. |
|
Test build #124795 has finished for PR 28967 at commit
|
|
Test build #124793 has finished for PR 28967 at commit
|
|
jenkins retest this please |
|
Test build #124835 has finished for PR 28967 at commit
|
|
cc @Ngone51 |
|
@attilapiros, I just merged the PR from another contributor where we discussed this. Shall we rebase this? Otherwise, it seems pretty solid. |
This comment has been minimized.
This comment has been minimized.
|
retest this please |
This comment has been minimized.
This comment has been minimized.
|
retest this please |
This comment has been minimized.
This comment has been minimized.
|
retest this please |
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExecutorDiskUtils.java
Show resolved
Hide resolved
Ngone51
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, good job!!
|
Test build #124910 has finished for PR 28967 at commit
|
|
retest this please |
|
jenkins retest this please |
|
Ok to test |
|
I wonder we can link the spot on the JDK code (if it was from OpenJDK) so that we feel confident to remove the relevant test. I usually see the concern of removing tests, as the Spark codebase is quite complicated and in many cases UTs can only find the broken part. It tends to make us comfortable if the code is from JDK (or what we get rid of is actually what JDK is doing for us) but still be ideal which/where code it is. |
|
Test build #124935 has finished for PR 28967 at commit
|
I see your point but here the old spark/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExecutorDiskUtils.java Lines 59 to 60 in dea7bc4
In the old tests we should have to test how close these two transformation are but there was the same problem: the Now with this trick (reading back the path) we would test whether the result of For the same reason I do not see the value by giving a link to the exact code behind.
Still those are just implementation details which are not really relevant as it does not matter what the exact transformation behind the normalisation is (depending on the OS type) as in the final |
|
Thanks for the links. That's all what I'd like to see.
Yeah I just wanted to see which code JDK would run to normalize the path by itself (so the comment For sure, I prefer to follow the normalization provided by the JDK, which at least don't use regex which would be slower than the char manipulation. That said, I agree that we feel confident to exclude the test part as well, as the code is replaced with JDK one we tend to have belief. That said, assuming we never create weird file name containing separators, the only thing the normalization is in effect is localDirs - we could probably cost only once for each entry to normalize the entry, and avoid normalizing all further calls. (I meant path being changed during normalization. The normalization check can't be avoided, as JDK will do. That can be avoided in reality when we pre-create and pass File objects for localDirs, but yeah that might be just an unnecessary micro-optimization.) |
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExecutorDiskUtils.java
Show resolved
Hide resolved
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great work!
|
Jenkins retest this please |
|
Test build #125081 has finished for PR 28967 at commit
|
|
Jenkins retest this please |
|
Test build #125102 has finished for PR 28967 at commit
|
|
Test build #125100 has finished for PR 28967 at commit
|
|
retest this, please |
|
Test build #125135 has started for PR 28967 at commit |
|
test this please |
|
Test build #125144 has finished for PR 28967 at commit
|
|
cc. @srowen as the replaced code here was reviewed by him |
| } | ||
|
|
||
| @Test | ||
| public void testNormalizeAndInternPathname() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @attilapiros . Could you explain why we need to remove the existing test coverage in this Improvement PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun sure, here you are:
The createNormalizedInternedPathname was as good as it was close to java.io.FileSystem#normalize() as in the old code the String interning could only save as any memory when its result was in equal with java.io.FileSystem#normalize() which was called within File constructor. If there was any difference in the string then the path in File would use a different (not interned) string as that string would be transformed a bit more.
When the test was created in the first place if they could call java.io.FileSystem#normalize() somehow within the tests only then they would given asserts where the expected result of createNormalizedInternedPathname would be java.io.FileSystem#normalize() (instead of the hardcoded string paths). The test should check whether the same transformation is done on the incoming path depending on the OS.
So now as we can call indirectly the java.io.FileSystem#normalize() we could rewrite the old test but that would mean having assert where java.io.FileSystem#normalize() is checked whether it is really java.io.FileSystem#normalize(). This would be a trivial assert as it is always true (like an assertTrue(true) just longer). So this is why we do not need the old tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the test existed because Spark has been dealing with normalization by itself (createNormalizedInternedPathname), despite the fact we know File object will do the normalization. Given we get rid of the custom normalization, the test would become whether normalization in JDK File object works properly or not, say, testing JDK functionality, which feels redundant.
If we cannot rely on the JDK implementation then createNormalizedInternedPathname should be just rewritten to the optimized one and this test would then keep as it is, but I'm afraid it's good direction we don't trust JDK implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I'm afraid it's good direction we don't trust JDK implementation.
Let's assume JDK introduces a problem then the path is not totally normalized but still that string is interned when you use this PR so you saved the bytes. Your normalized path could be even better java.io.FileSystem#normalize() still the sole purpose of the createNormalizedInternedPathname method is to save heap.
Regarding memory saving you are as good as close to get to java.io.FileSystem#normalize().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the JDK tests for normalized paths:
- For Windows: https://github.com/openjdk/jdk/blob/6bab0f539fba8fb441697846347597b4a0ade428/test/jdk/java/nio/file/Path/PathOps.java#L1292-L1380
- For Unix: https://github.com/openjdk/jdk/blob/6bab0f539fba8fb441697846347597b4a0ade428/test/jdk/java/nio/file/Path/PathOps.java#L1986-L2018
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad “if” was missed between “direction” and “we”. Sorry about that. I made clear I have some sort of belief for JDK implementation and the maintenance, otherwise I would suggest to just port the optimized code into here. That said, to avoid further miscommunication, I’m positive on removing test, and I said it even previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Are you OK with the answer? If you're OK with it I'll move this forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep~
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I buy it.
|
Retest this please |
|
Test build #125528 has finished for PR 28967 at commit
|
|
Retest this please |
|
Test build #125552 has finished for PR 28967 at commit
|
|
Retest this please |
|
Test build #125588 has finished for PR 28967 at commit
|
|
Retest this please |
|
Test build #125606 has finished for PR 28967 at commit
|
|
Retest this please |
|
Test build #125649 has finished for PR 28967 at commit
|
|
Retest this please |
|
Test build #125679 has finished for PR 28967 at commit
|
|
Thanks! Merged into master. |
…resolution within the external shuffle service ### What changes were proposed in this pull request? Improving file path name normalisation by removing the approximate transformation from Spark and using the path normalization from the JDK. ### Why are the changes needed? In the external shuffle service during the block resolution the file paths (for disk persisted RDD and for shuffle blocks) are normalized by a custom Spark code which uses an OS dependent regexp. This is a redundant code of the package-private JDK counterpart. As the code not a perfect match even it could happen one method results in a bit different (but semantically equal) path. The reason of this redundant transformation is the interning of the normalized path to save some heap here which is only possible if both transformations results in the same string. Checking the JDK code I believe there is a better solution which is perfect match for the JDK code as it uses that package private method. Moreover based on some benchmarking even this new method seams to be more performant too. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? As we are reusing the JDK code for normalisation no test is needed. Even the existing test can be removed. But in a separate branch I have created a benchmark where the performance of the old and the new solution can be compared. It shows the new method is about 7-10 times better than old one. Closes apache#28967 from attilapiros/SPARK-32149. Authored-by: “attilapiros” <[email protected]> Signed-off-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
What changes were proposed in this pull request?
Improving file path name normalisation by removing the approximate transformation from Spark and using the path normalization from the JDK.
Why are the changes needed?
In the external shuffle service during the block resolution the file paths (for disk persisted RDD and for shuffle blocks) are normalized by a custom Spark code which uses an OS dependent regexp. This is a redundant code of the package-private JDK counterpart. As the code not a perfect match even it could happen one method results in a bit different (but semantically equal) path.
The reason of this redundant transformation is the interning of the normalized path to save some heap here which is only possible if both transformations results in the same string.
Checking the JDK code I believe there is a better solution which is perfect match for the JDK code as it uses that package private method. Moreover based on some benchmarking even this new method seams to be more performant too.
Does this PR introduce any user-facing change?
No
How was this patch tested?
As we are reusing the JDK code for normalisation no test is needed. Even the existing test can be removed.
But in a separate branch I have created a benchmark where the performance of the old and the new solution can be compared. It shows the new method is about 7-10 times better than old one.