Skip to content

Conversation

@gerashegalov
Copy link
Contributor

What changes were proposed in this pull request?

Stop trimming values of properties loaded from a file

How was this patch tested?

Added unit test demonstrating the issue hit in production.

@gerashegalov
Copy link
Contributor Author

@witgo please take a look since you worked on #2379

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this going to break existing apps dependent on trimmed values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gerashegalov would you please elaborate the use case here? I saw that you're treating \n as a property value, what is the specific usage scenario here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon your concern is valid although it has a simple solution of fixing up the file to your liking. Moreover the user editing a file for --properties-file probably legitimately expects the format prescribed by JDK https://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)

@jerryshao the use case described in the JIRA is that our customers sometimes have some unusual line delimiters that we need to pass to Hadoop's TextInputFormat. In this case it's actually a conventional unix line separator and suppose the customer really insists that only '\n' should be the line separator and not the default set ["\n", "\r\n", "\r"] . We had no issues with configuring it via Spark until we switched from --conf to the --properties-file to reduce the command line length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could deprecate trimming, and have some other config to disable it but I think because it seems like a real bug to have disparity with --conf maybe we can just fix and document it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here will break the current assumptions. Some editors will leave the trailing WS without removing it, but in fact user doesn't need that trailing WS, the changes here will break the assumptions, user have to check and remove all the trailing WS to avoid unexpected things.

AFAIK in Hive usually it uses ASCII or others to specify the separator, not "\n" or "\r\n", which will be removed or converted during the parse (which is quite brittle). So this is more like things you could fix in your side, not necessary in Spark side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryshao trim removes leading spaces as well that are totally legit.

I also need more info regarding what you mean by ASCII in this context.

Copy link
Contributor

@jerryshao jerryshao Aug 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trim removes leading spaces as well that are totally legit.

It is hard to say which solution is legit, the way you proposed may be valid in your case, but it will be unexpected in other user's case. I'm not talking about legit or not, what I'm trying to say is that your proposal will break the convention, that's what I concerned about.

By ASCII I mean you can pass in ASCII number, and translate to actual char in the code, that will mitigate the problem here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also mentioned that we can make this a conditional logic: #22213 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By ASCII I mean you can pass in ASCII number, and translate to actual char in the code, that will mitigate the problem here.

I think I'll just keep passing the delimiter via --conf to Hadoop and everything else in a single properties to avoid dealing with manual conversion of ints to char.

@gerashegalov
Copy link
Contributor Author

@jerryshao here is my new take on the problem that should be more acceptable. The premise is that since JDK has already parsed out natural line delimiters '\r' and '\n', the remaining ones are user-provided escaped line delimiters.
@steveloughran would you mind taking a look as well?

@steveloughran
Copy link
Contributor

This actually makes sense. We always forget this, but java properties file format is more complex than any of us remember

At the time of this trim taking place, all CR/LF chars in the source file will have been stripped through one of

  • being the end of an entry: property contains all chars up to that line break (or line skipped if empty/comment)
  • being proceeded by a backslash, in which case the following line will have its initial whitespace stripped then joined to the subsequent line.

Whoever did the wikipedia article did some good examples

What this means is: by the time the spark trim() code is reached, the only CR and LF entries in a property are those from expanding \r and \n character pairs in the actual property itself. All of these within a property, e.g p1=a\nb already get through, this extends it to propertlies like p2=\r.

  • should be able to easy to write some tests for trimExceptCRLF() directly, e.g. how it handles odd strings (one char, value == 0), empty string, ...
  • There's an XML format for properties too, which should also be tested to see WTF goes on there.

PS, looking up for the properties spec highlights that Java 9 uses UTF-8 for the properties encoding. Don't know of any implications here.

@gerashegalov
Copy link
Contributor Author

gerashegalov commented Aug 31, 2018

thanks for the comment @steveloughran. I'll add more tests for now and see how the discussion goes from there.

as for transition to UTF I think it means to be fully correct Spark needs to switch to using strip starting JDK11 with or without this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gerashegalov , I'm not sure how do we manually add LF to the end of line using editor to edit property file? Here in your test, it is the program code to explicitly mimic the case, but I don't think in a real scenario, how do we manually update the property file with additional LF or CR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryshao I try not to spend time on issues unrelated to our production deployments. @steveloughran and this PR already pointed at the Properties#load method documenting the format.

Line terminator characters can be included using \r and \n escape sequences. Or you can encode any character using \uxxxx

In addition you can take a look at the file generated by this code:

#test whitespace
#Thu Aug 30 20:20:33 PDT 2018
spark.my.delimiter.nonDelimSpaceFromFile=\ blah\f
spark.my.delimiter.infixDelimFromFile=\rblah\n
spark.my.delimiter.trailingDelimKeyFromFile=blah\r
spark.my.delimiter.leadingDelimKeyFromFile=\nblah

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the stupid question. I guess I was thinking of something different.

@gerashegalov
Copy link
Contributor Author

@steveloughran Regarding XML format, java.util.Properties has its dedicated storeTo/loadFromXML methods which Spark does not use, so we don't need to check this

@steveloughran
Copy link
Contributor

code LGTM. Clearly its a tangible problem, especially for some one-char option like "myapp.line.separator"

@gerashegalov
Copy link
Contributor Author

rebased

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

adding @vanzin as well.

@SparkQA
Copy link

SparkQA commented Sep 7, 2018

Test build #95782 has finished for PR 22213 at commit 9e4ac10.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 7, 2018

Test build #95786 has finished for PR 22213 at commit 01d0904.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me. Just have style nits.

@SparkQA
Copy link

SparkQA commented Sep 8, 2018

Test build #95820 has finished for PR 22213 at commit 603fdc9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 8, 2018

Test build #95819 has finished for PR 22213 at commit d341dae.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Sep 8, 2018

Test build #95826 has finished for PR 22213 at commit 603fdc9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Seems fine to me too.

@SparkQA
Copy link

SparkQA commented Sep 11, 2018

Test build #95920 has finished for PR 22213 at commit 76635ce.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Sep 11, 2018

Test build #95934 has finished for PR 22213 at commit 76635ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Sep 11, 2018

Merging to master / 2.4.

asfgit pushed a commit that referenced this pull request Sep 11, 2018
…f values

## What changes were proposed in this pull request?

Stop trimming values of properties loaded from a file

## How was this patch tested?

Added unit test demonstrating the issue hit in production.

Closes #22213 from gerashegalov/gera/SPARK-25221.

Authored-by: Gera Shegalov <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
(cherry picked from commit bcb9a8c)
Signed-off-by: Marcelo Vanzin <[email protected]>
@asfgit asfgit closed this in bcb9a8c Sep 11, 2018
fjh100456 pushed a commit to fjh100456/spark that referenced this pull request Sep 13, 2018
…f values

## What changes were proposed in this pull request?

Stop trimming values of properties loaded from a file

## How was this patch tested?

Added unit test demonstrating the issue hit in production.

Closes apache#22213 from gerashegalov/gera/SPARK-25221.

Authored-by: Gera Shegalov <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
@gerashegalov
Copy link
Contributor Author

Thank you for reviews @vanzin @steveloughran @jerryshao @HyukjinKwon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants