[SPARK-47088] Using BigDecimal to do the resource calculation #45157

wbo4958 · 2024-02-18T23:33:27Z

What changes were proposed in this pull request?

This PR utilizes BigDecimal to calculate the GPU resources.

Why are the changes needed?

To prevent precision errors, the current method of calculating GPU resources involves multiplying by 1E16 to convert doubles to Longs. If needed, it will also convert Longs back to doubles. This approach introduces redundancy in the code, especially for test code.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass the CIs

Was this patch authored or co-authored using generative AI tooling?

No

wbo4958 · 2024-02-19T10:23:09Z

Hi @srowen, Could you help review this PR? Thx very much.

srowen

To take advantage of BigDecimal, you need to avoid floats entirely. You want to go from the string the user supplies to BigDecimal, otherwise, you're supplying a float to BigDecimal and still have the precision issues. Unless there's a reason that can't work, that's the right solution and most of this change goes away. There should be no "ONE_ENTIRE_RESOURCE" etc

wbo4958 · 2024-02-20T00:30:47Z

Hi @srowen, The public API def resource(resourceName: String, amount: Double): this.type = { in TaskResourceRequests require double, and the TaskResourceRequest also require double. So if we would like to go from the string, that probably breaks the APIs.

srowen · 2024-02-20T00:35:33Z

Darn. Yeah I don't think there is a way to do better without changing the API and that's too much change for what isn't typically an issue in practice. I can't think of a way that keeps the double API but doesn't have a round off error in some case. Let's leave this.

tgravescs · 2024-02-20T15:26:38Z

sorry, I've been ooo and busy with other things. I didn't fully follow the discussion in #44690.

Can you two please summarize exactly why are we changing this? Is there actually some case the previous method did not handle? Yes I understand there could be some precision loss when dealing with floats/double but I think that is going to come from the API.

Is this purely for readability at this point?

I think this goes from storing longs to storing BigDecimals, which is not going to be more memory efficient and I thought BigDecimal was fairly slow at operations, but maybe Java has made that better. There obviously was some conversion going on before with toInternalResource so some overhead there if this makes things more efficient. Doing some very poor many testing in a scala shell creating bigdecimal takes around 30,000ns and doing the conversions is around 6,000ns.

srowen · 2024-02-20T16:45:47Z

The issue is roughly: let's say you want to schedule a task that needs 1/9th of a GPU, and so you request resource amount 1.0/9.0. Let's say that float is just a tiny bit bigger than 1/9th. After 8 of those resource requests have gone through, the amount of resource left is less than 1/9th, so the 9th can't schedule.

I think trying to do integer math after floating-point still has the issue; the value is already 'wrong' when it comes in as a double in the API.

It may in practice not come up a lot, as often these values are set as a Spark config, as a string, and a user is going to write "0.111" or something; some prefix of what is notionally a big repeating decimal, and so they will ask for less than 1/9th (say) of a GPU and schedule 9 as desired, leaving Spark thinking there's a tiny bit of GPU available when there isn't, but certainly nothing like 1/9th.

It's still a bit ugly and yeah the API kind of needs to take a different input type to really address this but maybe not worth breakage.

tgravescs · 2024-02-20T18:36:57Z

so you request resource amount 1.0/9.0

How does a user specify "1.0/9.0" in the configs?

I tend to agree with your second statement that I don't see this as a problem in practice so my question is why are we changing it to be slower and user more memory if its not solving a problem.

srowen · 2024-02-20T18:41:09Z

It would have to be programmatically. Imagine, for example, that I generate a config with value 1.0/9.0 that works out to the string "0.1111111111111111" and when parsed as double, that's slightly bigger than 1/9. I think there is a real case here, just possibly rare; that's not unimaginable though.

(Performance is not the issue here, these calculations are trivial even in BigDecimal)

tgravescs · 2024-02-20T19:05:05Z

Sorry, I'm not following what this pr changes from the previous implementation. the method toInternalResource handles 1.0/9.0 as well, correct? That is perhaps what I'm missing, what case does that not work for.. maybe 1/9.0 works out but you are saying others won't?

say 1/9.0 comes out a big bigger like 0.111111111111111112:

scala> toInternalResource(0.111111111111111112)
res17: Long = 1111111111111111

In this pr, the key line seems to be (the round down):

var taskAmount = BigDecimal(taskReqs.amount).setScale(ResourceAmountUtils.SCALE, RoundingMode.DOWN)

but as we have discussed the api takes the double so if someone is programmatically passing something in its going to be computed on their side not by Spark.

srowen · 2024-02-20T19:18:21Z

I think the discussion on this got lost after a force-push, and referred to an earlier implementation, but my assertion is: if you start from double, then no matter how you represent that with integers or BigDecimal downstream, there will be some problem of this form, yes. In some cases your double representation is too small and it works, in some cases it isn't. You should be able to find counterexamples, indeed. The representation would have to be exact from start to finish to really solve this and that seems to need API change.

tgravescs · 2024-02-21T20:36:33Z

ok so we should just close this then, if it becomes an issue we could revisit the API.

wbo4958 · 2024-02-26T01:35:11Z

Hi @srowen, May I ask what's your opinion on @tgravescs comment?

srowen · 2024-02-26T01:45:08Z

I think I agree with it. I don't think there's a 'real' fix without changing the API and not sure it's worth it for the corner case here

wbo4958 · 2024-02-26T02:02:10Z

Thx @srowen, Let me close this PR. And how about #44690 for spark 3.5 branch? we think it's a real bug, so we hope it can be merged into spark 3.5

Using BigDecimal to do the resource calculation

aca07a5

github-actions bot added the CORE label Feb 18, 2024

wbo4958 changed the title ~~Using BigDecimal to do the resource calculation~~ [SPARK-47088] Using BigDecimal to do the resource calculation Feb 19, 2024

wbo4958 marked this pull request as ready for review February 19, 2024 10:22

srowen reviewed Feb 19, 2024

View reviewed changes

wbo4958 closed this Feb 26, 2024

[SPARK-47088] Using BigDecimal to do the resource calculation #45157

[SPARK-47088] Using BigDecimal to do the resource calculation #45157

Uh oh!

Conversation

wbo4958 commented Feb 18, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

wbo4958 commented Feb 19, 2024

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

wbo4958 commented Feb 20, 2024

Uh oh!

srowen commented Feb 20, 2024

Uh oh!

tgravescs commented Feb 20, 2024

Uh oh!

srowen commented Feb 20, 2024

Uh oh!

tgravescs commented Feb 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Feb 20, 2024

Uh oh!

tgravescs commented Feb 20, 2024

Uh oh!

srowen commented Feb 20, 2024

Uh oh!

tgravescs commented Feb 21, 2024

Uh oh!

wbo4958 commented Feb 26, 2024

Uh oh!

srowen commented Feb 26, 2024

Uh oh!

wbo4958 commented Feb 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tgravescs commented Feb 20, 2024 •

edited

Loading