Skip to content

Conversation

@wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Feb 18, 2024

What changes were proposed in this pull request?

This PR utilizes BigDecimal to calculate the GPU resources.

Why are the changes needed?

To prevent precision errors, the current method of calculating GPU resources involves multiplying by 1E16 to convert doubles to Longs. If needed, it will also convert Longs back to doubles. This approach introduces redundancy in the code, especially for test code.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass the CIs

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Feb 18, 2024
@wbo4958 wbo4958 changed the title Using BigDecimal to do the resource calculation [SPARK-47088] Using BigDecimal to do the resource calculation Feb 19, 2024
@wbo4958 wbo4958 marked this pull request as ready for review February 19, 2024 10:22
@wbo4958
Copy link
Contributor Author

wbo4958 commented Feb 19, 2024

Hi @srowen, Could you help review this PR? Thx very much.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To take advantage of BigDecimal, you need to avoid floats entirely. You want to go from the string the user supplies to BigDecimal, otherwise, you're supplying a float to BigDecimal and still have the precision issues. Unless there's a reason that can't work, that's the right solution and most of this change goes away. There should be no "ONE_ENTIRE_RESOURCE" etc

@wbo4958
Copy link
Contributor Author

wbo4958 commented Feb 20, 2024

Hi @srowen, The public API def resource(resourceName: String, amount: Double): this.type = { in TaskResourceRequests require double, and the TaskResourceRequest also require double. So if we would like to go from the string, that probably breaks the APIs.

@srowen
Copy link
Member

srowen commented Feb 20, 2024

Darn. Yeah I don't think there is a way to do better without changing the API and that's too much change for what isn't typically an issue in practice. I can't think of a way that keeps the double API but doesn't have a round off error in some case. Let's leave this.

@tgravescs
Copy link
Contributor

sorry, I've been ooo and busy with other things. I didn't fully follow the discussion in #44690.

Can you two please summarize exactly why are we changing this? Is there actually some case the previous method did not handle? Yes I understand there could be some precision loss when dealing with floats/double but I think that is going to come from the API.

Is this purely for readability at this point?

I think this goes from storing longs to storing BigDecimals, which is not going to be more memory efficient and I thought BigDecimal was fairly slow at operations, but maybe Java has made that better. There obviously was some conversion going on before with toInternalResource so some overhead there if this makes things more efficient. Doing some very poor many testing in a scala shell creating bigdecimal takes around 30,000ns and doing the conversions is around 6,000ns.

@srowen
Copy link
Member

srowen commented Feb 20, 2024

The issue is roughly: let's say you want to schedule a task that needs 1/9th of a GPU, and so you request resource amount 1.0/9.0. Let's say that float is just a tiny bit bigger than 1/9th. After 8 of those resource requests have gone through, the amount of resource left is less than 1/9th, so the 9th can't schedule.

I think trying to do integer math after floating-point still has the issue; the value is already 'wrong' when it comes in as a double in the API.

It may in practice not come up a lot, as often these values are set as a Spark config, as a string, and a user is going to write "0.111" or something; some prefix of what is notionally a big repeating decimal, and so they will ask for less than 1/9th (say) of a GPU and schedule 9 as desired, leaving Spark thinking there's a tiny bit of GPU available when there isn't, but certainly nothing like 1/9th.

It's still a bit ugly and yeah the API kind of needs to take a different input type to really address this but maybe not worth breakage.

@tgravescs
Copy link
Contributor

tgravescs commented Feb 20, 2024

so you request resource amount 1.0/9.0

How does a user specify "1.0/9.0" in the configs?

I tend to agree with your second statement that I don't see this as a problem in practice so my question is why are we changing it to be slower and user more memory if its not solving a problem.

@srowen
Copy link
Member

srowen commented Feb 20, 2024

It would have to be programmatically. Imagine, for example, that I generate a config with value 1.0/9.0 that works out to the string "0.1111111111111111" and when parsed as double, that's slightly bigger than 1/9. I think there is a real case here, just possibly rare; that's not unimaginable though.

(Performance is not the issue here, these calculations are trivial even in BigDecimal)

@tgravescs
Copy link
Contributor

Sorry, I'm not following what this pr changes from the previous implementation. the method toInternalResource handles 1.0/9.0 as well, correct? That is perhaps what I'm missing, what case does that not work for.. maybe 1/9.0 works out but you are saying others won't?

say 1/9.0 comes out a big bigger like 0.111111111111111112:

scala> toInternalResource(0.111111111111111112)
res17: Long = 1111111111111111

In this pr, the key line seems to be (the round down):

var taskAmount = BigDecimal(taskReqs.amount).setScale(ResourceAmountUtils.SCALE, RoundingMode.DOWN)

but as we have discussed the api takes the double so if someone is programmatically passing something in its going to be computed on their side not by Spark.

@srowen
Copy link
Member

srowen commented Feb 20, 2024

I think the discussion on this got lost after a force-push, and referred to an earlier implementation, but my assertion is: if you start from double, then no matter how you represent that with integers or BigDecimal downstream, there will be some problem of this form, yes. In some cases your double representation is too small and it works, in some cases it isn't. You should be able to find counterexamples, indeed. The representation would have to be exact from start to finish to really solve this and that seems to need API change.

@tgravescs
Copy link
Contributor

ok so we should just close this then, if it becomes an issue we could revisit the API.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Feb 26, 2024

Hi @srowen, May I ask what's your opinion on @tgravescs comment?

@srowen
Copy link
Member

srowen commented Feb 26, 2024

I think I agree with it. I don't think there's a 'real' fix without changing the API and not sure it's worth it for the corner case here

@wbo4958
Copy link
Contributor Author

wbo4958 commented Feb 26, 2024

Thx @srowen, Let me close this PR. And how about #44690 for spark 3.5 branch? we think it's a real bug, so we hope it can be merged into spark 3.5

@wbo4958 wbo4958 closed this Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants