Skip to content

Conversation

@attilapiros
Copy link
Contributor

@attilapiros attilapiros commented Jul 1, 2020

What changes were proposed in this pull request?

Improving file path name normalisation by removing the approximate transformation from Spark and using the path normalization from the JDK.

Why are the changes needed?

In the external shuffle service during the block resolution the file paths (for disk persisted RDD and for shuffle blocks) are normalized by a custom Spark code which uses an OS dependent regexp. This is a redundant code of the package-private JDK counterpart. As the code not a perfect match even it could happen one method results in a bit different (but semantically equal) path.

The reason of this redundant transformation is the interning of the normalized path to save some heap here which is only possible if both transformations results in the same string.

Checking the JDK code I believe there is a better solution which is perfect match for the JDK code as it uses that package private method. Moreover based on some benchmarking even this new method seams to be more performant too.

Does this PR introduce any user-facing change?

No

How was this patch tested?

As we are reusing the JDK code for normalisation no test is needed. Even the existing test can be removed.

But in a separate branch I have created a benchmark where the performance of the old and the new solution can be compared. It shows the new method is about 7-10 times better than old one.

@attilapiros attilapiros changed the title [WIP][SPARK-32121][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service [WIP][SPARK-32149][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service Jul 1, 2020
@attilapiros
Copy link
Contributor Author

attilapiros commented Jul 1, 2020

Regarding the benchmark

The code is in a separate commit where both solution is tested. This benchmark is not intended to be reused just to prove this one time change is well-founded and justified.

The commit is on another branch which based on the same as the PR. And the commit with the benchmark is here.

The code is:

/**
 * Benchmark for NormalizedInternedPathname.
 * To run this benchmark:
 * {{{
 *   1. without sbt:
 *      bin/spark-submit --class <this class> --jars <spark core test jar>
 *   2. build/sbt "core/test:runMain <this class>"
 *   3. generate result:
 *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "core/test:runMain <this class>"
 *      Results will be written to "benchmarks/NormalizedInternedPathname-results.txt".
 * }}}
 * */
object NormalizedInternedPathnameBenchmark extends BenchmarkBase {
  val seed = 0x1337


  private def normalizePathnames(numIters: Int, newBefore: Boolean): Unit = {
    val numLocalDir = 100
    val numSubDir = 100
    val numFilenames = 100
    val sumPathNames = numLocalDir * numSubDir * numFilenames
    val benchmark =
      new Benchmark(s"Normalize pathnames newBefore=$newBefore", sumPathNames, output = output)
    val localDir = s"/a//b//c/d/e//f/g//$newBefore"
    val files = (1 to numLocalDir).flatMap { localDirId =>
      (1 to numSubDir).flatMap { subDirId =>
        (1 to numFilenames).map { filenameId =>
          (localDir + localDirId, subDirId.toString, s"filename_$filenameId")
        }
      }
    }
    val namedNewMethod = "new" -> normalizeNewMethod
    val namedOldMethod = "old" -> normalizeOldMethod

    val ((firstName, firstMethod), (secondName, secondMethod)) =
      if (newBefore) (namedNewMethod, namedOldMethod) else (namedOldMethod, namedNewMethod)


    benchmark.addCase(
      s"Normalize with the $firstName method", numIters) { _ =>
        firstMethod(files)
    }
    benchmark.addCase(
      s"Normalize with the $secondName method", numIters) { _ =>
        secondMethod(files)
    }
    benchmark.run()
  }


  private val normalizeOldMethod = (files: Seq[(String, String, String)]) => {
    files.map { case (localDir, subDir, filename) =>
      ExecutorDiskUtils.createNormalizedInternedPathname(localDir, subDir, filename)
    }.size
  }


  private val normalizeNewMethod = (files: Seq[(String, String, String)]) => {
    files.map { case (localDir, subDir, filename) =>
      new File(s"localDir${File.separator}subDir${File.separator}filename").getPath().intern()
    }.size
  }


  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val numIters = 25
    runBenchmark("Normalize pathnames new method first") {
      normalizePathnames(numIters, newBefore = true)
    }
    runBenchmark("Normalize pathnames old method first") {
      normalizePathnames(numIters, newBefore = false)
    }
  }
}

So it runs the new and old method for a 1000000 paths for 25 iteration then do the same for 1000000 other paths but first the old method is used then the new one. The reason behind to test both method in both way (one is first the other is second) is the assumption that string interning might be different when it is first used on a string and when there is a match second time.

The benchmark result

================================================================================================
Normalize pathnames new method first
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_242-b08 on Mac OS X 10.15.5
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Normalize pathnames newBefore=true:       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Normalize with the new method                       252            259          10          4.0         252.2       1.0X
Normalize with the old method                      1727           2018         162          0.6        1726.6       0.1X


================================================================================================
Normalize pathnames old method first
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_242-b08 on Mac OS X 10.15.5
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Normalize pathnames newBefore=false:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Normalize with the old method                      1812           2065         153          0.6        1812.3       1.0X
Normalize with the new method                       252            254           2          4.0         252.0       7.2X

So the new method is about 7-10 times better than old one.

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124795 has finished for PR 28967 at commit d92d9c9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124793 has finished for PR 28967 at commit eb8d446.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@attilapiros
Copy link
Contributor Author

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jul 2, 2020

Test build #124835 has finished for PR 28967 at commit 23e3cbe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@attilapiros attilapiros changed the title [WIP][SPARK-32149][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service [SPARK-32149][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service Jul 2, 2020
@attilapiros
Copy link
Contributor Author

cc @Ngone51

@HyukjinKwon
Copy link
Member

@attilapiros, I just merged the PR from another contributor where we discussed this. Shall we rebase this? Otherwise, it seems pretty solid.

@SparkQA

This comment has been minimized.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA

This comment has been minimized.

@Ngone51
Copy link
Member

Ngone51 commented Jul 2, 2020

retest this please

@SparkQA

This comment has been minimized.

@Ngone51
Copy link
Member

Ngone51 commented Jul 2, 2020

retest this please

Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good job!!

@SparkQA
Copy link

SparkQA commented Jul 2, 2020

Test build #124910 has finished for PR 28967 at commit e7657cd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@attilapiros
Copy link
Contributor Author

retest this please

@attilapiros
Copy link
Contributor Author

jenkins retest this please

@attilapiros
Copy link
Contributor Author

Ok to test

@HeartSaVioR
Copy link
Contributor

I wonder we can link the spot on the JDK code (if it was from OpenJDK) so that we feel confident to remove the relevant test.

I usually see the concern of removing tests, as the Spark codebase is quite complicated and in many cases UTs can only find the broken part. It tends to make us comfortable if the code is from JDK (or what we get rid of is actually what JDK is doing for us) but still be ideal which/where code it is.

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #124935 has finished for PR 28967 at commit e7657cd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@attilapiros
Copy link
Contributor Author

I wonder we can link the spot on the JDK code (if it was from OpenJDK) so that we feel confident to remove the relevant test.

I usually see the concern of removing tests, as the Spark codebase is quite complicated and in many cases UTs can only find the broken part. It tends to make us comfortable if the code is from JDK (or what we get rid of is actually what JDK is doing for us) but still be ideal which/where code it is.

I see your point but here the old createNormalizedInternedPathname was as good as it could imitate the java.io.FileSystem#normalize() because of the interning of the result String.
This is kind a mentioned in the old method javadoc:

* String copy. Unfortunately, we cannot just reuse the normalization code that java.io.File
* uses, since it is in the package-private class java.io.FileSystem.

In the old tests we should have to test how close these two transformation are but there was the same problem: the java.io.FileSystem#normalize() cannot be called there too.

Now with this trick (reading back the path) we would test whether the result of java.io.FileSystem#normalize() is the same java.io.FileSystem#normalize().

For the same reason I do not see the value by giving a link to the exact code behind.
They are here:

Still those are just implementation details which are not really relevant as it does not matter what the exact transformation behind the normalisation is (depending on the OS type) as in the final File constructor the JDK one will be executed and if we differ from it then interning the String does not help at all.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Jul 6, 2020

Thanks for the links. That's all what I'd like to see.

This is a redundant code of the package-private JDK counterpart. As the code not a perfect match even it could happen one method results in a bit different (but semantically equal) path.

Yeah I just wanted to see which code JDK would run to normalize the path by itself (so the comment here the old createNormalizedInternedPathname was as good as it could imitate the java.io.FileSystem#normalize() is the answer for me), and honestly didn't know the method name would be just "normalize". (I should have just try finding by myself. My bad.)

For sure, I prefer to follow the normalization provided by the JDK, which at least don't use regex which would be slower than the char manipulation. That said, I agree that we feel confident to exclude the test part as well, as the code is replaced with JDK one we tend to have belief.

That said, assuming we never create weird file name containing separators, the only thing the normalization is in effect is localDirs - we could probably cost only once for each entry to normalize the entry, and avoid normalizing all further calls. (I meant path being changed during normalization. The normalization check can't be avoided, as JDK will do. That can be avoided in reality when we pre-create and pass File objects for localDirs, but yeah that might be just an unnecessary micro-optimization.)

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great work!

@attilapiros
Copy link
Contributor Author

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125081 has finished for PR 28967 at commit e7657cd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@attilapiros
Copy link
Contributor Author

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125102 has finished for PR 28967 at commit e7657cd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125100 has finished for PR 28967 at commit e7657cd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125135 has started for PR 28967 at commit e7657cd.

@shaneknapp
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Jul 7, 2020

Test build #125144 has finished for PR 28967 at commit e7657cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

cc. @srowen as the replaced code here was reviewed by him

}

@Test
public void testNormalizeAndInternPathname() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @attilapiros . Could you explain why we need to remove the existing test coverage in this Improvement PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun sure, here you are:

The createNormalizedInternedPathname was as good as it was close to java.io.FileSystem#normalize() as in the old code the String interning could only save as any memory when its result was in equal with java.io.FileSystem#normalize() which was called within File constructor. If there was any difference in the string then the path in File would use a different (not interned) string as that string would be transformed a bit more.

When the test was created in the first place if they could call java.io.FileSystem#normalize() somehow within the tests only then they would given asserts where the expected result of createNormalizedInternedPathname would be java.io.FileSystem#normalize() (instead of the hardcoded string paths). The test should check whether the same transformation is done on the incoming path depending on the OS.

So now as we can call indirectly the java.io.FileSystem#normalize() we could rewrite the old test but that would mean having assert where java.io.FileSystem#normalize() is checked whether it is really java.io.FileSystem#normalize(). This would be a trivial assert as it is always true (like an assertTrue(true) just longer). So this is why we do not need the old tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the test existed because Spark has been dealing with normalization by itself (createNormalizedInternedPathname), despite the fact we know File object will do the normalization. Given we get rid of the custom normalization, the test would become whether normalization in JDK File object works properly or not, say, testing JDK functionality, which feels redundant.

If we cannot rely on the JDK implementation then createNormalizedInternedPathname should be just rewritten to the optimized one and this test would then keep as it is, but I'm afraid it's good direction we don't trust JDK implementation.

Copy link
Contributor Author

@attilapiros attilapiros Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I'm afraid it's good direction we don't trust JDK implementation.

Let's assume JDK introduces a problem then the path is not totally normalized but still that string is interned when you use this PR so you saved the bytes. Your normalized path could be even better java.io.FileSystem#normalize() still the sole purpose of the createNormalizedInternedPathname method is to save heap.

Regarding memory saving you are as good as close to get to java.io.FileSystem#normalize().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@HeartSaVioR HeartSaVioR Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad “if” was missed between “direction” and “we”. Sorry about that. I made clear I have some sort of belief for JDK implementation and the maintenance, otherwise I would suggest to just port the optimized code into here. That said, to avoid further miscommunication, I’m positive on removing test, and I said it even previous comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Are you OK with the answer? If you're OK with it I'll move this forward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep~

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I buy it.

@dongjoon-hyun
Copy link
Member

Retest this please

@SparkQA
Copy link

SparkQA commented Jul 10, 2020

Test build #125528 has finished for PR 28967 at commit e7657cd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

Retest this please

@SparkQA
Copy link

SparkQA commented Jul 10, 2020

Test build #125552 has finished for PR 28967 at commit e7657cd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

Retest this please

@SparkQA
Copy link

SparkQA commented Jul 10, 2020

Test build #125588 has finished for PR 28967 at commit e7657cd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Retest this please

@SparkQA
Copy link

SparkQA commented Jul 10, 2020

Test build #125606 has finished for PR 28967 at commit e7657cd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

Retest this please

@SparkQA
Copy link

SparkQA commented Jul 11, 2020

Test build #125649 has finished for PR 28967 at commit e7657cd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

Retest this please

@SparkQA
Copy link

SparkQA commented Jul 11, 2020

Test build #125679 has finished for PR 28967 at commit e7657cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

Thanks! Merged into master.

holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
…resolution within the external shuffle service

### What changes were proposed in this pull request?

Improving file path name normalisation by removing the approximate transformation from Spark and using the path normalization from the JDK.

### Why are the changes needed?

In the external shuffle service during the block resolution the file paths (for disk persisted RDD and for shuffle blocks) are normalized by a custom Spark code which uses an OS dependent regexp. This is a redundant code of the package-private JDK counterpart. As the code not a perfect match even it could happen one method results in a bit different (but semantically equal) path.

The reason of this redundant transformation is the interning of the normalized path to save some heap here which is only possible if both transformations results in the same string.

Checking the JDK code I believe there is a better solution which is perfect match for the JDK code as it uses that package private method. Moreover based on some benchmarking even this new method seams to be more performant too.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

As we are reusing the JDK code for normalisation no test is needed. Even the existing test can be removed.

But in a separate branch I have created a benchmark where the performance of the old and the new solution can be compared. It shows the new method is about 7-10 times better than old one.

Closes apache#28967 from attilapiros/SPARK-32149.

Authored-by: “attilapiros” <[email protected]>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants