-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of conf values #22213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this going to break existing apps dependent on trimmed values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gerashegalov would you please elaborate the use case here? I saw that you're treating \n as a property value, what is the specific usage scenario here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon your concern is valid although it has a simple solution of fixing up the file to your liking. Moreover the user editing a file for --properties-file probably legitimately expects the format prescribed by JDK https://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)
@jerryshao the use case described in the JIRA is that our customers sometimes have some unusual line delimiters that we need to pass to Hadoop's TextInputFormat. In this case it's actually a conventional unix line separator and suppose the customer really insists that only '\n' should be the line separator and not the default set ["\n", "\r\n", "\r"] . We had no issues with configuring it via Spark until we switched from --conf to the --properties-file to reduce the command line length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could deprecate trimming, and have some other config to disable it but I think because it seems like a real bug to have disparity with --conf maybe we can just fix and document it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes here will break the current assumptions. Some editors will leave the trailing WS without removing it, but in fact user doesn't need that trailing WS, the changes here will break the assumptions, user have to check and remove all the trailing WS to avoid unexpected things.
AFAIK in Hive usually it uses ASCII or others to specify the separator, not "\n" or "\r\n", which will be removed or converted during the parse (which is quite brittle). So this is more like things you could fix in your side, not necessary in Spark side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryshao trim removes leading spaces as well that are totally legit.
I also need more info regarding what you mean by ASCII in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trim removes leading spaces as well that are totally legit.
It is hard to say which solution is legit, the way you proposed may be valid in your case, but it will be unexpected in other user's case. I'm not talking about legit or not, what I'm trying to say is that your proposal will break the convention, that's what I concerned about.
By ASCII I mean you can pass in ASCII number, and translate to actual char in the code, that will mitigate the problem here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also mentioned that we can make this a conditional logic: #22213 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By ASCII I mean you can pass in ASCII number, and translate to actual char in the code, that will mitigate the problem here.
I think I'll just keep passing the delimiter via --conf to Hadoop and everything else in a single properties to avoid dealing with manual conversion of ints to char.
|
@jerryshao here is my new take on the problem that should be more acceptable. The premise is that since JDK has already parsed out natural line delimiters '\r' and '\n', the remaining ones are user-provided escaped line delimiters. |
|
This actually makes sense. We always forget this, but java properties file format is more complex than any of us remember At the time of this trim taking place, all CR/LF chars in the source file will have been stripped through one of
Whoever did the wikipedia article did some good examples What this means is: by the time the spark trim() code is reached, the only CR and LF entries in a property are those from expanding \r and \n character pairs in the actual property itself. All of these within a property, e.g
PS, looking up for the properties spec highlights that Java 9 uses UTF-8 for the properties encoding. Don't know of any implications here. |
|
thanks for the comment @steveloughran. I'll add more tests for now and see how the discussion goes from there. as for transition to UTF I think it means to be fully correct Spark needs to switch to using strip starting JDK11 with or without this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gerashegalov , I'm not sure how do we manually add LF to the end of line using editor to edit property file? Here in your test, it is the program code to explicitly mimic the case, but I don't think in a real scenario, how do we manually update the property file with additional LF or CR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryshao I try not to spend time on issues unrelated to our production deployments. @steveloughran and this PR already pointed at the Properties#load method documenting the format.
Line terminator characters can be included using \r and \n escape sequences. Or you can encode any character using \uxxxx
In addition you can take a look at the file generated by this code:
#test whitespace
#Thu Aug 30 20:20:33 PDT 2018
spark.my.delimiter.nonDelimSpaceFromFile=\ blah\f
spark.my.delimiter.infixDelimFromFile=\rblah\n
spark.my.delimiter.trailingDelimKeyFromFile=blah\r
spark.my.delimiter.leadingDelimKeyFromFile=\nblah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the stupid question. I guess I was thinking of something different.
|
@steveloughran Regarding XML format, java.util.Properties has its dedicated storeTo/loadFromXML methods which Spark does not use, so we don't need to check this |
|
code LGTM. Clearly its a tangible problem, especially for some one-char option like "myapp.line.separator" |
731e47b to
9e4ac10
Compare
|
rebased |
|
ok to test |
|
adding @vanzin as well. |
|
Test build #95782 has finished for PR 22213 at commit
|
|
Test build #95786 has finished for PR 22213 at commit
|
vanzin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok to me. Just have style nits.
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #95820 has finished for PR 22213 at commit
|
|
Test build #95819 has finished for PR 22213 at commit
|
|
retest this please |
|
Test build #95826 has finished for PR 22213 at commit
|
|
Seems fine to me too. |
|
Test build #95920 has finished for PR 22213 at commit
|
|
retest this please |
|
Test build #95934 has finished for PR 22213 at commit
|
|
Merging to master / 2.4. |
…f values ## What changes were proposed in this pull request? Stop trimming values of properties loaded from a file ## How was this patch tested? Added unit test demonstrating the issue hit in production. Closes #22213 from gerashegalov/gera/SPARK-25221. Authored-by: Gera Shegalov <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]> (cherry picked from commit bcb9a8c) Signed-off-by: Marcelo Vanzin <[email protected]>
…f values ## What changes were proposed in this pull request? Stop trimming values of properties loaded from a file ## How was this patch tested? Added unit test demonstrating the issue hit in production. Closes apache#22213 from gerashegalov/gera/SPARK-25221. Authored-by: Gera Shegalov <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>
|
Thank you for reviews @vanzin @steveloughran @jerryshao @HyukjinKwon |
What changes were proposed in this pull request?
Stop trimming values of properties loaded from a file
How was this patch tested?
Added unit test demonstrating the issue hit in production.