-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException #28750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @maropu |
| assert(checkLimit("1234567890")) | ||
| } | ||
|
|
||
| test("StringConcat doesn't overflow on many inputs") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you please add the prefix: "SPARK-31916: " ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiszk Thanks.. will do.
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After your changes, the error message https://github.com/apache/spark/pull/28750/files#diff-46beff422d36d30d696d137ae1cfcf2eR151 will not show the actual size.
| val available = maxLength - length | ||
| val stringToAppend = if (available >= sLen) s else s.substring(0, available) | ||
| strings.append(stringToAppend) | ||
| length += sLen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we move line 120 into {...} to make the scope sure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| strings.append(stringToAppend) | ||
| length += sLen | ||
| } | ||
| length += sLen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
length = Math.min(length.toLong + sLen, ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH).toIntThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk Yeah.. Sure. Just wondering if we absolutely need to change the length in case we are not doing an append. The error message you are referring to, is this captured in some test ? I ran catalyst and sql tests and didn't see any failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message you are referring to, is this captured in some test ?
Looks like not. Could you add a test for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk Sure.. Will add a test. Can you please point me to the error message please ? The link you have, when i click .. its not taking me to the relevant code ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the child class, here
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala
Line 156 in d9b3069
| s"... ${length - maxLength} more characters" |
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala
Line 151 in d9b3069
| s"plan had length ${length} and the maximum is ${maxLength}. This behavior " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. If overflow is an issue, then make it a long? Capping it at this constant would then seem to potentially underreport the actual appended total.
Maybe that creates other problems, not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen You are right that, we may under report when the length becomes more than MAX_ROUNDED_ARRAY_LENGTH. I did consider changing the message to cover the case to say something like "greater than or equal to N". But thought against it as we are adding code to cover may be a extremely corner case. Please let me know if you have any suggestion here.
About changing to long, even if we did it, we can in theory hit the long limit, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure you can overflow a long - if 2^31-1 strings (max length of the array tracking them) are added, of length 2^31-1, you still don't overflow.
It's probably not worth debating so this is fine but I think a simpler change was also fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @srowen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, could you leave some comments about the intent here? I think its a bit hard to tell it at a glance.
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Test build #123622 has finished for PR 28750 at commit
|
|
Test build #123626 has finished for PR 28750 at commit
|
|
|
||
| test("SPARK-31916: StringConcat doesn't overflow on many inputs") { | ||
| val concat = new StringConcat(maxLength = 100) | ||
| 0.to(Integer.MAX_VALUE).foreach { _ => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you reduce # of loops in this test? I think its a bit expensive.
| Seq(Row(Short.MinValue.toLong * -1))) | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove the single blank.
|
|
||
|
|
||
| test("SPARK-31916: verify that PlanStringConcat's output shows the actual length of the plan") { | ||
| withSQLConf("spark.sql.maxPlanStringLength" -> "0") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz use SQLConf.MAX_PLAN_STRING_LENGTH.key instead.
| assert(concat.toString === "Truncated plan of 60 characters") | ||
| } | ||
|
|
||
| withSQLConf("spark.sql.maxPlanStringLength" -> "60") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| } | ||
|
|
||
|
|
||
| test("SPARK-31916: verify that PlanStringConcat's output shows the actual length of the plan") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since PlanStringConcat's is in the catalyst package, could you move this tests there?
|
Test build #123670 has finished for PR 28750 at commit
|
| import org.apache.spark.sql.catalyst.optimizer.{ConvertToLocalRelation, NestedColumnAliasingSuite} | ||
| import org.apache.spark.sql.catalyst.plans.logical.Project | ||
| import org.apache.spark.sql.catalyst.util.StringUtils | ||
| import org.apache.spark.sql.catalyst.util.StringUtils.PlanStringConcat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary change?
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also update the PR description as looks like it is still for original fix.
|
Test build #123671 has finished for PR 28750 at commit
|
|
Test build #123672 has finished for PR 28750 at commit
|
|
retest this please |
|
@dilipbiswal Have you checked my last comment? #28750 (comment) The PR itself looks okay. |
|
Test build #123678 has finished for PR 28750 at commit
|
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay although I think making it a long might be also good and simpler.
|
Test build #123728 has finished for PR 28750 at commit
|
Sorry i missed that. I have added the comment now.
We could make it a long. The only thing is we may still overflow but it will take perhaps long time to hit it. I can repro the StringIndexOutOfBoundsException after changing the length to long. I made a minor tweak to change the append method to fake the input's length to be Int.MaxValue and adjust the test to increase the loop count. |
|
|
||
| // Cap the length at ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH. Otherwise, we | ||
| // will overflow length causing StringIndexOutOfBoundsException in the substring call | ||
| // above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dilipbiswal Sorry and my comment looked ambiguous. I just wanted to leave a note about your comment above;
I think, the original intent of keeping this length outside the if block (which i missed and @MaxGekk pointed out) is to keep the true length of the input. So when we are producing a SQL plan, we want to tell users how much was that the size of the explain string and how much we truncated based on max length specification. So caller could call append method even after we have gone past the max.
I couldn't tell why the count-up is placed outside the if (!atLimit) block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Ah .. so you wanted to explain the above comment better in the PR and not in the code ? Right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment in the code side looks okay and how about this?
// Keeps the total length of appended strings. Note that we need to cap the length at
// `ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH`; otherwise, we will overflow
// length causing StringIndexOutOfBoundsException in the substring call above.
length = Math.min(length.toLong + sLen, ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH).toInt
|
Test build #123730 has finished for PR 28750 at commit
|
|
retest this please |
|
Test build #123797 has finished for PR 28750 at commit
|
|
Test build #123817 has finished for PR 28750 at commit
|
|
retest this please |
|
Test build #123837 has finished for PR 28750 at commit
|
…eption ### What changes were proposed in this pull request? A minor fix to fix the append method of StringConcat to cap the length at MAX_ROUNDED_ARRAY_LENGTH to make sure it does not overflow and cause StringIndexOutOfBoundsException Thanks to **Jeffrey Stokes** for reporting the issue and explaining the underlying problem in detail in the JIRA. ### Why are the changes needed? This fixes StringIndexOutOfBoundsException on an overflow. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added a test in StringsUtilSuite. Closes #28750 from dilipbiswal/SPARK-31916. Authored-by: Dilip Biswal <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit b87a342) Signed-off-by: Takeshi Yamamuro <[email protected]>
|
Thanks! Merged to master/branch-3.0. |
|
@maropu Sure. Thanks a lot for your comments on the pr. |
|
Thanks! |
|
Hi @maropu Please let me know. |
|
Ah, okay. We don't need the backport. Thanks for the check, anyway! |
…eption ### What changes were proposed in this pull request? A minor fix to fix the append method of StringConcat to cap the length at MAX_ROUNDED_ARRAY_LENGTH to make sure it does not overflow and cause StringIndexOutOfBoundsException Thanks to **Jeffrey Stokes** for reporting the issue and explaining the underlying problem in detail in the JIRA. ### Why are the changes needed? This fixes StringIndexOutOfBoundsException on an overflow. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added a test in StringsUtilSuite. Closes apache#28750 from dilipbiswal/SPARK-31916. Authored-by: Dilip Biswal <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit b87a342) Signed-off-by: Takeshi Yamamuro <[email protected]>
What changes were proposed in this pull request?
A minor fix to fix the append method of StringConcat to cap the length at MAX_ROUNDED_ARRAY_LENGTH to make sure it does not overflow and cause StringIndexOutOfBoundsException
Thanks to Jeffrey Stokes for reporting the issue and explaining the underlying problem in detail in the JIRA.
Why are the changes needed?
This fixes StringIndexOutOfBoundsException on an overflow.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added a test in StringsUtilSuite.