Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

@dilipbiswal dilipbiswal commented Jun 8, 2020

What changes were proposed in this pull request?

A minor fix to fix the append method of StringConcat to cap the length at MAX_ROUNDED_ARRAY_LENGTH to make sure it does not overflow and cause StringIndexOutOfBoundsException

Thanks to Jeffrey Stokes for reporting the issue and explaining the underlying problem in detail in the JIRA.

Why are the changes needed?

This fixes StringIndexOutOfBoundsException on an overflow.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added a test in StringsUtilSuite.

@dilipbiswal
Copy link
Contributor Author

cc @maropu

assert(checkLimit("1234567890"))
}

test("StringConcat doesn't overflow on many inputs") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you please add the prefix: "SPARK-31916: " ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiszk Thanks.. will do.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val available = maxLength - length
val stringToAppend = if (available >= sLen) s else s.substring(0, available)
strings.append(stringToAppend)
length += sLen
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we move line 120 into {...} to make the scope sure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiszk If i incorporate @MaxGekk's suggestion, then i will need to keep 120 where it is :-) as we need it outside the if block.

strings.append(stringToAppend)
length += sLen
}
length += sLen
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

        length = Math.min(length.toLong + sLen, ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH).toInt

Copy link
Contributor Author

@dilipbiswal dilipbiswal Jun 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk Yeah.. Sure. Just wondering if we absolutely need to change the length in case we are not doing an append. The error message you are referring to, is this captured in some test ? I ran catalyst and sql tests and didn't see any failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message you are referring to, is this captured in some test ?

Looks like not. Could you add a test for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk Sure.. Will add a test. Can you please point me to the error message please ? The link you have, when i click .. its not taking me to the relevant code ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the child class, here


s"plan had length ${length} and the maximum is ${maxLength}. This behavior " +

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. If overflow is an issue, then make it a long? Capping it at this constant would then seem to potentially underreport the actual appended total.
Maybe that creates other problems, not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen You are right that, we may under report when the length becomes more than MAX_ROUNDED_ARRAY_LENGTH. I did consider changing the message to cover the case to say something like "greater than or equal to N". But thought against it as we are adding code to cover may be a extremely corner case. Please let me know if you have any suggestion here.

About changing to long, even if we did it, we can in theory hit the long limit, right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you can overflow a long - if 2^31-1 strings (max length of the array tracking them) are added, of length 2^31-1, you still don't overflow.
It's probably not worth debating so this is fine but I think a simpler change was also fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @srowen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, could you leave some comments about the intent here? I think its a bit hard to tell it at a glance.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Jun 8, 2020

Test build #123622 has finished for PR 28750 at commit 8d723d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 8, 2020

Test build #123626 has finished for PR 28750 at commit 0e71179.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


test("SPARK-31916: StringConcat doesn't overflow on many inputs") {
val concat = new StringConcat(maxLength = 100)
0.to(Integer.MAX_VALUE).foreach { _ =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you reduce # of loops in this test? I think its a bit expensive.

Seq(Row(Short.MinValue.toLong * -1)))
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the single blank.



test("SPARK-31916: verify that PlanStringConcat's output shows the actual length of the plan") {
withSQLConf("spark.sql.maxPlanStringLength" -> "0") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz use SQLConf.MAX_PLAN_STRING_LENGTH.key instead.

assert(concat.toString === "Truncated plan of 60 characters")
}

withSQLConf("spark.sql.maxPlanStringLength" -> "60") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

}


test("SPARK-31916: verify that PlanStringConcat's output shows the actual length of the plan") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since PlanStringConcat's is in the catalyst package, could you move this tests there?

@maropu maropu changed the title [SPARK-31916][SQL] StringConcat can overflow , leads to StringIndexOutOfBoundsException [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException Jun 8, 2020
@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123670 has finished for PR 28750 at commit 3bedc7f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class StringUtilsSuite extends SparkFunSuite with SQLHelper

import org.apache.spark.sql.catalyst.optimizer.{ConvertToLocalRelation, NestedColumnAliasingSuite}
import org.apache.spark.sql.catalyst.plans.logical.Project
import org.apache.spark.sql.catalyst.util.StringUtils
import org.apache.spark.sql.catalyst.util.StringUtils.PlanStringConcat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update the PR description as looks like it is still for original fix.

@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123671 has finished for PR 28750 at commit df58e5c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123672 has finished for PR 28750 at commit 78cf8b5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jun 9, 2020

retest this please

@maropu
Copy link
Member

maropu commented Jun 9, 2020

@dilipbiswal Have you checked my last comment? #28750 (comment) The PR itself looks okay.

@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123678 has finished for PR 28750 at commit 78cf8b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay although I think making it a long might be also good and simpler.

@SparkQA
Copy link

SparkQA commented Jun 10, 2020

Test build #123728 has finished for PR 28750 at commit 1050df3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Jun 10, 2020

@maropu

Have you checked my last comment? #28750 (comment) The PR itself looks okay.

Sorry i missed that. I have added the comment now.

@viirya

Looks okay although I think making it a long might be also good and simpler.

We could make it a long. The only thing is we may still overflow but it will take perhaps long time to hit it. I can repro the StringIndexOutOfBoundsException after changing the length to long. I made a minor tweak to change the append method to fake the input's length to be Int.MaxValue and adjust the test to increase the loop count.


// Cap the length at ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH. Otherwise, we
// will overflow length causing StringIndexOutOfBoundsException in the substring call
// above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dilipbiswal Sorry and my comment looked ambiguous. I just wanted to leave a note about your comment above;

I think, the original intent of keeping this length outside the if block (which i missed and @MaxGekk pointed out) is to keep the true length of the input. So when we are producing a SQL plan, we want to tell users how much was that the size of the explain string and how much we truncated based on max length specification. So caller could call append method even after we have gone past the max.

I couldn't tell why the count-up is placed outside the if (!atLimit) block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Ah .. so you wanted to explain the above comment better in the PR and not in the code ? Right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment in the code side looks okay and how about this?

        // Keeps the total length of appended strings. Note that we need to cap the length at
        // `ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH`; otherwise, we will overflow
        // length causing StringIndexOutOfBoundsException in the substring call above.
        length = Math.min(length.toLong + sLen, ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH).toInt

@SparkQA
Copy link

SparkQA commented Jun 10, 2020

Test build #123730 has finished for PR 28750 at commit c7e780b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jun 10, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jun 11, 2020

Test build #123797 has finished for PR 28750 at commit c7e780b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 11, 2020

Test build #123817 has finished for PR 28750 at commit 804edd0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 11, 2020

Test build #123837 has finished for PR 28750 at commit 804edd0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu closed this in b87a342 Jun 12, 2020
maropu pushed a commit that referenced this pull request Jun 12, 2020
…eption

### What changes were proposed in this pull request?
A minor fix to fix the append method of StringConcat to cap the length at MAX_ROUNDED_ARRAY_LENGTH to make sure it does not overflow and cause StringIndexOutOfBoundsException

Thanks to **Jeffrey Stokes** for reporting the issue and explaining the underlying problem in detail in the JIRA.

### Why are the changes needed?
This fixes StringIndexOutOfBoundsException on an overflow.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Added a test in StringsUtilSuite.

Closes #28750 from dilipbiswal/SPARK-31916.

Authored-by: Dilip Biswal <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit b87a342)
Signed-off-by: Takeshi Yamamuro <[email protected]>
@maropu
Copy link
Member

maropu commented Jun 12, 2020

Thanks! Merged to master/branch-3.0.
@dilipbiswal Could you make a backport PR to branch-2.4?

@dilipbiswal
Copy link
Contributor Author

@maropu Sure. Thanks a lot for your comments on the pr.

@maropu
Copy link
Member

maropu commented Jun 12, 2020

Thanks!

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Jun 12, 2020

Hi @maropu
Just checked the 2.4 codebase. We don't have this Class in 2.4. Looks like we directed appended to StringBuilder. This class was added as part of [SPARK-26504] (https://issues.apache.org/jira/browse/SPARK-26504) and fix version is 3.0.0

Please let me know.

@maropu
Copy link
Member

maropu commented Jun 12, 2020

Ah, okay. We don't need the backport. Thanks for the check, anyway!

holdenk pushed a commit to holdenk/spark that referenced this pull request Jun 25, 2020
…eption

### What changes were proposed in this pull request?
A minor fix to fix the append method of StringConcat to cap the length at MAX_ROUNDED_ARRAY_LENGTH to make sure it does not overflow and cause StringIndexOutOfBoundsException

Thanks to **Jeffrey Stokes** for reporting the issue and explaining the underlying problem in detail in the JIRA.

### Why are the changes needed?
This fixes StringIndexOutOfBoundsException on an overflow.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Added a test in StringsUtilSuite.

Closes apache#28750 from dilipbiswal/SPARK-31916.

Authored-by: Dilip Biswal <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit b87a342)
Signed-off-by: Takeshi Yamamuro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants