[SPARK-32149][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service #28967

attilapiros · 2020-07-01T15:07:39Z

What changes were proposed in this pull request?

Improving file path name normalisation by removing the approximate transformation from Spark and using the path normalization from the JDK.

Why are the changes needed?

In the external shuffle service during the block resolution the file paths (for disk persisted RDD and for shuffle blocks) are normalized by a custom Spark code which uses an OS dependent regexp. This is a redundant code of the package-private JDK counterpart. As the code not a perfect match even it could happen one method results in a bit different (but semantically equal) path.

The reason of this redundant transformation is the interning of the normalized path to save some heap here which is only possible if both transformations results in the same string.

Checking the JDK code I believe there is a better solution which is perfect match for the JDK code as it uses that package private method. Moreover based on some benchmarking even this new method seams to be more performant too.

Does this PR introduce any user-facing change?

No

How was this patch tested?

As we are reusing the JDK code for normalisation no test is needed. Even the existing test can be removed.

But in a separate branch I have created a benchmark where the performance of the old and the new solution can be compared. It shows the new method is about 7-10 times better than old one.

attilapiros · 2020-07-01T15:24:40Z

Regarding the benchmark

The code is in a separate commit where both solution is tested. This benchmark is not intended to be reused just to prove this one time change is well-founded and justified.

The commit is on another branch which based on the same as the PR. And the commit with the benchmark is here.

The code is:

/**
 * Benchmark for NormalizedInternedPathname.
 * To run this benchmark:
 * {{{
 *   1. without sbt:
 *      bin/spark-submit --class <this class> --jars <spark core test jar>
 *   2. build/sbt "core/test:runMain <this class>"
 *   3. generate result:
 *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "core/test:runMain <this class>"
 *      Results will be written to "benchmarks/NormalizedInternedPathname-results.txt".
 * }}}
 * */
object NormalizedInternedPathnameBenchmark extends BenchmarkBase {
  val seed = 0x1337


  private def normalizePathnames(numIters: Int, newBefore: Boolean): Unit = {
    val numLocalDir = 100
    val numSubDir = 100
    val numFilenames = 100
    val sumPathNames = numLocalDir * numSubDir * numFilenames
    val benchmark =
      new Benchmark(s"Normalize pathnames newBefore=$newBefore", sumPathNames, output = output)
    val localDir = s"/a//b//c/d/e//f/g//$newBefore"
    val files = (1 to numLocalDir).flatMap { localDirId =>
      (1 to numSubDir).flatMap { subDirId =>
        (1 to numFilenames).map { filenameId =>
          (localDir + localDirId, subDirId.toString, s"filename_$filenameId")
        }
      }
    }
    val namedNewMethod = "new" -> normalizeNewMethod
    val namedOldMethod = "old" -> normalizeOldMethod

    val ((firstName, firstMethod), (secondName, secondMethod)) =
      if (newBefore) (namedNewMethod, namedOldMethod) else (namedOldMethod, namedNewMethod)


    benchmark.addCase(
      s"Normalize with the $firstName method", numIters) { _ =>
        firstMethod(files)
    }
    benchmark.addCase(
      s"Normalize with the $secondName method", numIters) { _ =>
        secondMethod(files)
    }
    benchmark.run()
  }


  private val normalizeOldMethod = (files: Seq[(String, String, String)]) => {
    files.map { case (localDir, subDir, filename) =>
      ExecutorDiskUtils.createNormalizedInternedPathname(localDir, subDir, filename)
    }.size
  }


  private val normalizeNewMethod = (files: Seq[(String, String, String)]) => {
    files.map { case (localDir, subDir, filename) =>
      new File(s"localDir${File.separator}subDir${File.separator}filename").getPath().intern()
    }.size
  }


  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val numIters = 25
    runBenchmark("Normalize pathnames new method first") {
      normalizePathnames(numIters, newBefore = true)
    }
    runBenchmark("Normalize pathnames old method first") {
      normalizePathnames(numIters, newBefore = false)
    }
  }
}

So it runs the new and old method for a 1000000 paths for 25 iteration then do the same for 1000000 other paths but first the old method is used then the new one. The reason behind to test both method in both way (one is first the other is second) is the assumption that string interning might be different when it is first used on a string and when there is a match second time.

The benchmark result

================================================================================================
Normalize pathnames new method first
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_242-b08 on Mac OS X 10.15.5
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Normalize pathnames newBefore=true:       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Normalize with the new method                       252            259          10          4.0         252.2       1.0X
Normalize with the old method                      1727           2018         162          0.6        1726.6       0.1X


================================================================================================
Normalize pathnames old method first
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_242-b08 on Mac OS X 10.15.5
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Normalize pathnames newBefore=false:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Normalize with the old method                      1812           2065         153          0.6        1812.3       1.0X
Normalize with the new method                       252            254           2          4.0         252.0       7.2X

So the new method is about 7-10 times better than old one.

SparkQA · 2020-07-01T18:56:50Z

Test build #124795 has finished for PR 28967 at commit d92d9c9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-01T19:17:35Z

Test build #124793 has finished for PR 28967 at commit eb8d446.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2020-07-02T01:15:59Z

jenkins retest this please

SparkQA · 2020-07-02T05:29:37Z

Test build #124835 has finished for PR 28967 at commit 23e3cbe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2020-07-02T06:38:32Z

cc @Ngone51

HyukjinKwon · 2020-07-02T10:24:59Z

@attilapiros, I just merged the PR from another contributor where we discussed this. Shall we rebase this? Otherwise, it seems pretty solid.

HyukjinKwon · 2020-07-02T12:47:14Z

retest this please

Ngone51 · 2020-07-02T13:28:32Z

retest this please

Ngone51 · 2020-07-02T14:10:27Z

retest this please

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExecutorDiskUtils.java

Ngone51

LGTM, good job!!

SparkQA · 2020-07-02T17:00:05Z

Test build #124910 has finished for PR 28967 at commit e7657cd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2020-07-03T11:12:57Z

retest this please

attilapiros · 2020-07-04T09:07:53Z

jenkins retest this please

attilapiros · 2020-07-04T09:32:14Z

Ok to test

HeartSaVioR · 2020-07-06T01:39:55Z

I wonder we can link the spot on the JDK code (if it was from OpenJDK) so that we feel confident to remove the relevant test.

I usually see the concern of removing tests, as the Spark codebase is quite complicated and in many cases UTs can only find the broken part. It tends to make us comfortable if the code is from JDK (or what we get rid of is actually what JDK is doing for us) but still be ideal which/where code it is.

SparkQA · 2020-07-06T03:04:22Z

Test build #124935 has finished for PR 28967 at commit e7657cd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2020-07-06T06:48:07Z

I wonder we can link the spot on the JDK code (if it was from OpenJDK) so that we feel confident to remove the relevant test.

I usually see the concern of removing tests, as the Spark codebase is quite complicated and in many cases UTs can only find the broken part. It tends to make us comfortable if the code is from JDK (or what we get rid of is actually what JDK is doing for us) but still be ideal which/where code it is.

I see your point but here the old createNormalizedInternedPathname was as good as it could imitate the java.io.FileSystem#normalize() because of the interning of the result String.
This is kind a mentioned in the old method javadoc:

spark/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExecutorDiskUtils.java

Lines 59 to 60 in dea7bc4

    
              * String copy. Unfortunately, we cannot just reuse the normalization code that java.io.File 
        
              * uses, since it is in the package-private class java.io.FileSystem.

In the old tests we should have to test how close these two transformation are but there was the same problem: the java.io.FileSystem#normalize() cannot be called there too.

Now with this trick (reading back the path) we would test whether the result of java.io.FileSystem#normalize() is the same java.io.FileSystem#normalize().

For the same reason I do not see the value by giving a link to the exact code behind.
They are here:

Still those are just implementation details which are not really relevant as it does not matter what the exact transformation behind the normalisation is (depending on the OS type) as in the final File constructor the JDK one will be executed and if we differ from it then interning the String does not help at all.

HeartSaVioR · 2020-07-06T07:30:36Z

Thanks for the links. That's all what I'd like to see.

This is a redundant code of the package-private JDK counterpart. As the code not a perfect match even it could happen one method results in a bit different (but semantically equal) path.

Yeah I just wanted to see which code JDK would run to normalize the path by itself (so the comment here the old createNormalizedInternedPathname was as good as it could imitate the java.io.FileSystem#normalize() is the answer for me), and honestly didn't know the method name would be just "normalize". (I should have just try finding by myself. My bad.)

For sure, I prefer to follow the normalization provided by the JDK, which at least don't use regex which would be slower than the char manipulation. That said, I agree that we feel confident to exclude the test part as well, as the code is replaced with JDK one we tend to have belief.

That said, assuming we never create weird file name containing separators, the only thing the normalization is in effect is localDirs - we could probably cost only once for each entry to normalize the entry, and avoid normalizing all further calls. (I meant path being changed during normalization. The normalization check can't be avoided, as JDK will do. That can be avoided in reality when we pre-create and pass File objects for localDirs, but yeah that might be just an unnecessary micro-optimization.)

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExecutorDiskUtils.java

HeartSaVioR

LGTM. Great work!

attilapiros · 2020-07-06T12:34:26Z

Jenkins retest this please

SparkQA · 2020-07-06T15:59:21Z

Test build #125081 has finished for PR 28967 at commit e7657cd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2020-07-06T16:02:01Z

Jenkins retest this please

SparkQA · 2020-07-06T20:09:17Z

Test build #125102 has finished for PR 28967 at commit e7657cd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-06T20:16:57Z

Test build #125100 has finished for PR 28967 at commit e7657cd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-06T21:27:35Z

retest this, please

SparkQA · 2020-07-06T21:30:00Z

Test build #125135 has started for PR 28967 at commit e7657cd.

shaneknapp · 2020-07-06T23:13:50Z

test this please

SparkQA · 2020-07-07T03:25:51Z

Test build #125144 has finished for PR 28967 at commit e7657cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-07T04:06:34Z

cc. @srowen as the replaced code here was reviewed by him

dongjoon-hyun · 2020-07-07T04:50:45Z

...huffle/src/test/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolverSuite.java

  }

-  @Test
-  public void testNormalizeAndInternPathname() {


Hi, @attilapiros . Could you explain why we need to remove the existing test coverage in this Improvement PR?

@dongjoon-hyun sure, here you are:

The createNormalizedInternedPathname was as good as it was close to java.io.FileSystem#normalize() as in the old code the String interning could only save as any memory when its result was in equal with java.io.FileSystem#normalize() which was called within File constructor. If there was any difference in the string then the path in File would use a different (not interned) string as that string would be transformed a bit more.

When the test was created in the first place if they could call java.io.FileSystem#normalize() somehow within the tests only then they would given asserts where the expected result of createNormalizedInternedPathname would be java.io.FileSystem#normalize() (instead of the hardcoded string paths). The test should check whether the same transformation is done on the incoming path depending on the OS.

So now as we can call indirectly the java.io.FileSystem#normalize() we could rewrite the old test but that would mean having assert where java.io.FileSystem#normalize() is checked whether it is really java.io.FileSystem#normalize(). This would be a trivial assert as it is always true (like an assertTrue(true) just longer). So this is why we do not need the old tests.

So the test existed because Spark has been dealing with normalization by itself (createNormalizedInternedPathname), despite the fact we know File object will do the normalization. Given we get rid of the custom normalization, the test would become whether normalization in JDK File object works properly or not, say, testing JDK functionality, which feels redundant.

If we cannot rely on the JDK implementation then createNormalizedInternedPathname should be just rewritten to the optimized one and this test would then keep as it is, but I'm afraid it's good direction we don't trust JDK implementation.

but I'm afraid it's good direction we don't trust JDK implementation.

Let's assume JDK introduces a problem then the path is not totally normalized but still that string is interned when you use this PR so you saved the bytes. Your normalized path could be even better java.io.FileSystem#normalize() still the sole purpose of the createNormalizedInternedPathname method is to save heap.

Regarding memory saving you are as good as close to get to java.io.FileSystem#normalize().

Here are the JDK tests for normalized paths:

For Windows: https://github.com/openjdk/jdk/blob/6bab0f539fba8fb441697846347597b4a0ade428/test/jdk/java/nio/file/Path/PathOps.java#L1292-L1380

For Unix: https://github.com/openjdk/jdk/blob/6bab0f539fba8fb441697846347597b4a0ade428/test/jdk/java/nio/file/Path/PathOps.java#L1986-L2018

My bad “if” was missed between “direction” and “we”. Sorry about that. I made clear I have some sort of belief for JDK implementation and the maintenance, otherwise I would suggest to just port the optimized code into here. That said, to avoid further miscommunication, I’m positive on removing test, and I said it even previous comment.

@dongjoon-hyun Are you OK with the answer? If you're OK with it I'll move this forward.

srowen

Yeah, I buy it.

dongjoon-hyun · 2020-07-10T00:42:35Z

Retest this please

SparkQA · 2020-07-10T04:25:05Z

Test build #125528 has finished for PR 28967 at commit e7657cd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-10T04:27:29Z

Retest this please

SparkQA · 2020-07-10T07:05:01Z

Test build #125552 has finished for PR 28967 at commit e7657cd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-10T08:36:55Z

Retest this please

SparkQA · 2020-07-10T11:52:42Z

Test build #125588 has finished for PR 28967 at commit e7657cd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-07-10T12:11:00Z

Retest this please

SparkQA · 2020-07-10T17:29:12Z

Test build #125606 has finished for PR 28967 at commit e7657cd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-10T22:54:14Z

Retest this please

SparkQA · 2020-07-11T03:17:49Z

Test build #125649 has finished for PR 28967 at commit e7657cd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-11T09:15:19Z

Retest this please

SparkQA · 2020-07-11T12:40:25Z

Test build #125679 has finished for PR 28967 at commit e7657cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-11T13:56:35Z

Thanks! Merged into master.

…resolution within the external shuffle service ### What changes were proposed in this pull request? Improving file path name normalisation by removing the approximate transformation from Spark and using the path normalization from the JDK. ### Why are the changes needed? In the external shuffle service during the block resolution the file paths (for disk persisted RDD and for shuffle blocks) are normalized by a custom Spark code which uses an OS dependent regexp. This is a redundant code of the package-private JDK counterpart. As the code not a perfect match even it could happen one method results in a bit different (but semantically equal) path. The reason of this redundant transformation is the interning of the normalized path to save some heap here which is only possible if both transformations results in the same string. Checking the JDK code I believe there is a better solution which is perfect match for the JDK code as it uses that package private method. Moreover based on some benchmarking even this new method seams to be more performant too. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? As we are reusing the JDK code for normalisation no test is needed. Even the existing test can be removed. But in a separate branch I have created a benchmark where the performance of the old and the new solution can be compared. It shows the new method is about 7-10 times better than old one. Closes apache#28967 from attilapiros/SPARK-32149. Authored-by: “attilapiros” <[email protected]> Signed-off-by: Jungtaek Lim (HeartSaVioR) <[email protected]>

probot-autolabeler bot added the CORE label Jul 1, 2020

attilapiros changed the title ~~[WIP][SPARK-32121][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service~~ [WIP][SPARK-32149][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service Jul 1, 2020

attilapiros force-pushed the SPARK-32149 branch from d92d9c9 to 23e3cbe Compare July 2, 2020 01:23

attilapiros changed the title ~~[WIP][SPARK-32149][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service~~ [SPARK-32149][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service Jul 2, 2020

Initial commit.

e7657cd

attilapiros force-pushed the SPARK-32149 branch from 23e3cbe to e7657cd Compare July 2, 2020 12:28

This comment has been minimized.

Sign in to view

Ngone51 reviewed Jul 2, 2020

View reviewed changes

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExecutorDiskUtils.java Show resolved Hide resolved

Ngone51 approved these changes Jul 2, 2020

View reviewed changes

HeartSaVioR reviewed Jul 6, 2020

View reviewed changes

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExecutorDiskUtils.java Show resolved Hide resolved

HeartSaVioR approved these changes Jul 6, 2020

View reviewed changes

dongjoon-hyun reviewed Jul 7, 2020

View reviewed changes

srowen reviewed Jul 7, 2020

View reviewed changes

HyukjinKwon approved these changes Jul 8, 2020

View reviewed changes

HeartSaVioR closed this in 1b3fc9a Jul 11, 2020

[SPARK-32149][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service #28967

[SPARK-32149][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service #28967

Uh oh!

Conversation

attilapiros commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

attilapiros commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regarding the benchmark

The benchmark result

Uh oh!

SparkQA commented Jul 1, 2020

Uh oh!

SparkQA commented Jul 1, 2020

Uh oh!

attilapiros commented Jul 2, 2020

Uh oh!

SparkQA commented Jul 2, 2020

Uh oh!

attilapiros commented Jul 2, 2020

Uh oh!

HyukjinKwon commented Jul 2, 2020

Uh oh!

This comment has been minimized.

HyukjinKwon commented Jul 2, 2020

Uh oh!

This comment has been minimized.

Ngone51 commented Jul 2, 2020

Uh oh!

This comment has been minimized.

Ngone51 commented Jul 2, 2020

Uh oh!

Uh oh!

Ngone51 left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 2, 2020

Uh oh!

attilapiros commented Jul 3, 2020

Uh oh!

attilapiros commented Jul 4, 2020

Uh oh!

attilapiros commented Jul 4, 2020

Uh oh!

HeartSaVioR commented Jul 6, 2020

Uh oh!

SparkQA commented Jul 6, 2020

Uh oh!

attilapiros commented Jul 6, 2020

Uh oh!

HeartSaVioR commented Jul 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

attilapiros commented Jul 6, 2020

Uh oh!

SparkQA commented Jul 6, 2020

Uh oh!

attilapiros commented Jul 6, 2020

Uh oh!

SparkQA commented Jul 6, 2020

Uh oh!

SparkQA commented Jul 6, 2020

Uh oh!

HeartSaVioR commented Jul 6, 2020

Uh oh!

SparkQA commented Jul 6, 2020

Uh oh!

shaneknapp commented Jul 6, 2020

Uh oh!

SparkQA commented Jul 7, 2020

Uh oh!

attilapiros commented Jul 1, 2020 •

edited

Loading

attilapiros commented Jul 1, 2020 •

edited

Loading

HeartSaVioR commented Jul 6, 2020 •

edited

Loading

attilapiros Jul 7, 2020 •

edited

Loading

HeartSaVioR Jul 7, 2020 •

edited

Loading