[SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of conf values #22213

gerashegalov · 2018-08-24T03:14:36Z

What changes were proposed in this pull request?

Stop trimming values of properties loaded from a file

How was this patch tested?

Added unit test demonstrating the issue hit in production.

gerashegalov · 2018-08-24T03:22:57Z

@witgo please take a look since you worked on #2379

HyukjinKwon · 2018-08-24T06:00:15Z

core/src/main/scala/org/apache/spark/util/Utils.scala

Isn't this going to break existing apps dependent on trimmed values?

@gerashegalov would you please elaborate the use case here? I saw that you're treating \n as a property value, what is the specific usage scenario here?

@HyukjinKwon your concern is valid although it has a simple solution of fixing up the file to your liking. Moreover the user editing a file for --properties-file probably legitimately expects the format prescribed by JDK https://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)

@jerryshao the use case described in the JIRA is that our customers sometimes have some unusual line delimiters that we need to pass to Hadoop's TextInputFormat. In this case it's actually a conventional unix line separator and suppose the customer really insists that only '\n' should be the line separator and not the default set ["\n", "\r\n", "\r"] . We had no issues with configuring it via Spark until we switched from --conf to the --properties-file to reduce the command line length.

We could deprecate trimming, and have some other config to disable it but I think because it seems like a real bug to have disparity with --conf maybe we can just fix and document it?

The changes here will break the current assumptions. Some editors will leave the trailing WS without removing it, but in fact user doesn't need that trailing WS, the changes here will break the assumptions, user have to check and remove all the trailing WS to avoid unexpected things.

AFAIK in Hive usually it uses ASCII or others to specify the separator, not "\n" or "\r\n", which will be removed or converted during the parse (which is quite brittle). So this is more like things you could fix in your side, not necessary in Spark side.

@jerryshao trim removes leading spaces as well that are totally legit.

I also need more info regarding what you mean by ASCII in this context.

trim removes leading spaces as well that are totally legit.

It is hard to say which solution is legit, the way you proposed may be valid in your case, but it will be unexpected in other user's case. I'm not talking about legit or not, what I'm trying to say is that your proposal will break the convention, that's what I concerned about.

By ASCII I mean you can pass in ASCII number, and translate to actual char in the code, that will mitigate the problem here.

I have also mentioned that we can make this a conditional logic: #22213 (comment)

By ASCII I mean you can pass in ASCII number, and translate to actual char in the code, that will mitigate the problem here.

I think I'll just keep passing the delimiter via --conf to Hadoop and everything else in a single properties to avoid dealing with manual conversion of ints to char.

gerashegalov · 2018-08-29T18:44:40Z

@jerryshao here is my new take on the problem that should be more acceptable. The premise is that since JDK has already parsed out natural line delimiters '\r' and '\n', the remaining ones are user-provided escaped line delimiters.
@steveloughran would you mind taking a look as well?

steveloughran · 2018-08-30T09:41:42Z

This actually makes sense. We always forget this, but java properties file format is more complex than any of us remember

At the time of this trim taking place, all CR/LF chars in the source file will have been stripped through one of

being the end of an entry: property contains all chars up to that line break (or line skipped if empty/comment)
being proceeded by a backslash, in which case the following line will have its initial whitespace stripped then joined to the subsequent line.

Whoever did the wikipedia article did some good examples

What this means is: by the time the spark trim() code is reached, the only CR and LF entries in a property are those from expanding \r and \n character pairs in the actual property itself. All of these within a property, e.g p1=a\nb already get through, this extends it to propertlies like p2=\r.

should be able to easy to write some tests for trimExceptCRLF() directly, e.g. how it handles odd strings (one char, value == 0), empty string, ...
There's an XML format for properties too, which should also be tested to see WTF goes on there.

PS, looking up for the properties spec highlights that Java 9 uses UTF-8 for the properties encoding. Don't know of any implications here.

gerashegalov · 2018-08-31T02:20:13Z

thanks for the comment @steveloughran. I'll add more tests for now and see how the discussion goes from there.

as for transition to UTF I think it means to be fully correct Spark needs to switch to using strip starting JDK11 with or without this PR.

jerryshao · 2018-08-31T02:31:31Z

core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala

@gerashegalov , I'm not sure how do we manually add LF to the end of line using editor to edit property file? Here in your test, it is the program code to explicitly mimic the case, but I don't think in a real scenario, how do we manually update the property file with additional LF or CR?

@jerryshao I try not to spend time on issues unrelated to our production deployments. @steveloughran and this PR already pointed at the Properties#load method documenting the format.

Line terminator characters can be included using \r and \n escape sequences. Or you can encode any character using \uxxxx

In addition you can take a look at the file generated by this code:

#test whitespace #Thu Aug 30 20:20:33 PDT 2018 spark.my.delimiter.nonDelimSpaceFromFile=\ blah\f spark.my.delimiter.infixDelimFromFile=\rblah\n spark.my.delimiter.trailingDelimKeyFromFile=blah\r spark.my.delimiter.leadingDelimKeyFromFile=\nblah

Sorry for the stupid question. I guess I was thinking of something different.

gerashegalov · 2018-08-31T02:41:00Z

@steveloughran Regarding XML format, java.util.Properties has its dedicated storeTo/loadFromXML methods which Spark does not use, so we don't need to check this

steveloughran · 2018-08-31T11:03:24Z

code LGTM. Clearly its a tangible problem, especially for some one-char option like "myapp.line.separator"

…f values

@jerryshao

@jerryshao @HyukjinKwon @steveloughran

gerashegalov · 2018-09-07T06:39:01Z

rebased

HyukjinKwon · 2018-09-07T06:41:42Z

ok to test

HyukjinKwon · 2018-09-07T06:41:50Z

adding @vanzin as well.

SparkQA · 2018-09-07T06:45:10Z

Test build #95782 has finished for PR 22213 at commit 9e4ac10.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-07T12:15:42Z

Test build #95786 has finished for PR 22213 at commit 01d0904.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

Looks ok to me. Just have style nits.

core/src/main/scala/org/apache/spark/util/Utils.scala

core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala

core/src/test/scala/org/apache/spark/util/UtilsSuite.scala

SparkQA · 2018-09-08T06:54:20Z

Test build #95820 has finished for PR 22213 at commit 603fdc9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-08T07:05:01Z

Test build #95819 has finished for PR 22213 at commit d341dae.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-08T07:06:55Z

retest this please

SparkQA · 2018-09-08T11:37:47Z

Test build #95826 has finished for PR 22213 at commit 603fdc9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/util/Utils.scala

HyukjinKwon · 2018-09-10T02:11:06Z

Seems fine to me too.

SparkQA · 2018-09-11T07:05:02Z

Test build #95920 has finished for PR 22213 at commit 76635ce.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-11T10:16:11Z

retest this please

SparkQA · 2018-09-11T14:56:49Z

Test build #95934 has finished for PR 22213 at commit 76635ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-09-11T16:28:07Z

Merging to master / 2.4.

…f values ## What changes were proposed in this pull request? Stop trimming values of properties loaded from a file ## How was this patch tested? Added unit test demonstrating the issue hit in production. Closes #22213 from gerashegalov/gera/SPARK-25221. Authored-by: Gera Shegalov <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]> (cherry picked from commit bcb9a8c) Signed-off-by: Marcelo Vanzin <[email protected]>

…f values ## What changes were proposed in this pull request? Stop trimming values of properties loaded from a file ## How was this patch tested? Added unit test demonstrating the issue hit in production. Closes apache#22213 from gerashegalov/gera/SPARK-25221. Authored-by: Gera Shegalov <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

gerashegalov · 2018-09-13T14:17:35Z

Thank you for reviews @vanzin @steveloughran @jerryshao @HyukjinKwon

HyukjinKwon reviewed Aug 24, 2018

View reviewed changes

jerryshao reviewed Aug 31, 2018

View reviewed changes

gerashegalov added 3 commits September 6, 2018 23:26

[SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of con…

1d4b0d4

…f values

Preserve natural line delimiters

60a41dc

Unit test for trimExceptCRLF

9e4ac10

@jerryshao @HyukjinKwon @steveloughran

gerashegalov force-pushed the gera/SPARK-25221 branch from 731e47b to 9e4ac10 Compare September 7, 2018 06:36

scalastyle

01d0904

vanzin reviewed Sep 7, 2018

View reviewed changes

gerashegalov added 2 commits September 7, 2018 20:00

Feedback @vanzin 1

d341dae

nonPrintableButCRLF in one line

603fdc9

HyukjinKwon reviewed Sep 10, 2018

View reviewed changes

core/src/main/scala/org/apache/spark/util/Utils.scala Outdated Show resolved Hide resolved

Backticks per @HyukjinKwon

76635ce

asfgit closed this in bcb9a8c Sep 11, 2018

[SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of conf values #22213

[SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of conf values #22213

Uh oh!

Conversation

gerashegalov commented Aug 24, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gerashegalov commented Aug 24, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryshao Aug 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gerashegalov commented Aug 29, 2018

Uh oh!

steveloughran commented Aug 30, 2018

Uh oh!

gerashegalov commented Aug 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gerashegalov commented Aug 31, 2018

Uh oh!

steveloughran commented Aug 31, 2018

Uh oh!

gerashegalov commented Sep 7, 2018

Uh oh!

HyukjinKwon commented Sep 7, 2018

Uh oh!

HyukjinKwon commented Sep 7, 2018

Uh oh!

SparkQA commented Sep 7, 2018

Uh oh!

SparkQA commented Sep 7, 2018

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Sep 8, 2018

Uh oh!

SparkQA commented Sep 8, 2018

Uh oh!

HyukjinKwon commented Sep 8, 2018

Uh oh!

SparkQA commented Sep 8, 2018

Uh oh!

Uh oh!

HyukjinKwon commented Sep 10, 2018

Uh oh!

SparkQA commented Sep 11, 2018

Uh oh!

HyukjinKwon commented Sep 11, 2018

jerryshao Aug 28, 2018 •

edited

Loading

gerashegalov commented Aug 31, 2018 •

edited

Loading