-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Prevent stack overflow in rounding #80450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The code that rounds dates - `Rounding.round` could possbly stack overflow on sufficiently strange time zones, bringing the whole node down. These time zones shouldn't be a thing - the largest stack depth I was able to get with my tzdb was 500 and that's using 80,000 day date ranges in the time zone with the most DST transitions - `Asia/Tehran`. Even so, if you are able to cause thousands of transitions it'd be kinder for us to fail with a "your time zone is bad" message rather than a huge stackoverflow that knocks over the whole node. So this changes the date resolution code to be a loop instead of a recursive function. I looked into making it not have to repeat at all, but I couldn't put all the of the rounding things into my head to have much confidence in it.
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
nik9000
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly feel like I should be able to rewrite this method not to need to loop at all. The non-java.time version of this code doesn't loop. But I knew what I was doing back then. Now I've forgotten. So the best I can manage is un-recursive-ing it.
| ZoneId tz = ZoneId.of("Asia/Tehran"); | ||
| Rounding rounding = Rounding.builder(TimeValue.timeValueDays(80000)).timeZone(tz).build(); | ||
| assertThat(rounding.round(time("2078-11-10T02:51:22.662Z")), isDate(time("1970-01-01T00:00:00+03:30"), tz)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't write a test locally that'd fail without going to pretty intense lengths. So I just used the worse case I could find.
| } | ||
| throw new IllegalArgumentException( | ||
| this | ||
| + " failed to round " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please log the timezone and the interval in the exception message if this happens? This way if we have missed a case that we are not aware of we'll at least be able to debug it. Perhaps a logger warning too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make sure it gets in there. I think folks intended this.toString to have the zone but I expect that got lost and the lack of unit test didn't notice.
| * Asia/Tehran with an 80,000 day range. You just can't declare | ||
| * ranges much larger than that in ES right now. | ||
| */ | ||
| attempt: while (attempts < 5000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can make this constant a field that can be modified with the builder which defaults to 5000? That way we can write a unit test that ensures we correctly throw the exception in case the loop limit is reached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| } | ||
| final OffsetDateTime offsetTime = roundedLocalDateTime.atOffset(currentOffsets.get(0)); | ||
| final Instant offsetInstant = offsetTime.toInstant(); | ||
| throw new IllegalArgumentException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is changing behaviour and it's unclear to me how we end up in this state. I'd be concerned that a customer is successfully getting a date now, where their application will effectively break after this change? Is there a way to reproduce this with a unit test? If not maybe we can take the approach to log an error/warning and retain the existing behaviour of returning a value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes. I made this change earlier, took a day off, and then forgot about it. I'll revert it. I don't particularly like the old behavior. And, no, didn't have a way of getting this behavior before with a unit test. I'll just set it back how it was.
|
I was so concerned with reproducing the stack overflow I didn't think about artificially setting the limit lower. I've pushed a change that does that. |
grcevski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! The toString changes are awesome.
|
A question about the original issue, does it make sense to backport this fix to 7.16 as well? |
The code that rounds dates - `Rounding.round` could possbly stack overflow on sufficiently strange time zones, bringing the whole node down. These time zones shouldn't be a thing - the largest stack depth I was able to get with my tzdb was 500 and that's using 80,000 day date ranges in the time zone with the most DST transitions - `Asia/Tehran`. Even so, if you are able to cause thousands of transitions it'd be kinder for us to fail with a "your time zone is bad" message rather than a huge stackoverflow that knocks over the whole node. So this changes the date resolution code to be a loop instead of a recursive function. I looked into making it not have to repeat at all, but I couldn't put all the of the rounding things into my head to have much confidence in it.
The code that rounds dates - `Rounding.round` could possbly stack overflow on sufficiently strange time zones, bringing the whole node down. These time zones shouldn't be a thing - the largest stack depth I was able to get with my tzdb was 500 and that's using 80,000 day date ranges in the time zone with the most DST transitions - `Asia/Tehran`. Even so, if you are able to cause thousands of transitions it'd be kinder for us to fail with a "your time zone is bad" message rather than a huge stackoverflow that knocks over the whole node. So this changes the date resolution code to be a loop instead of a recursive function. I looked into making it not have to repeat at all, but I couldn't put all the of the rounding things into my head to have much confidence in it.
I hate to sneak it in after the branch cut. But I think it's probably the right thing to do it. I wouldn't stop the presses or anything because it's super rare. But it seems pretty low risk. I've hit it with a zillion tests and it's not a new algorithm or anything. |
The code that rounds dates - `Rounding.round` could possbly stack overflow on sufficiently strange time zones, bringing the whole node down. These time zones shouldn't be a thing - the largest stack depth I was able to get with my tzdb was 500 and that's using 80,000 day date ranges in the time zone with the most DST transitions - `Asia/Tehran`. Even so, if you are able to cause thousands of transitions it'd be kinder for us to fail with a "your time zone is bad" message rather than a huge stackoverflow that knocks over the whole node. So this changes the date resolution code to be a loop instead of a recursive function. I looked into making it not have to repeat at all, but I couldn't put all the of the rounding things into my head to have much confidence in it.
The code that rounds dates - `Rounding.round` could possbly stack overflow on sufficiently strange time zones, bringing the whole node down. These time zones shouldn't be a thing - the largest stack depth I was able to get with my tzdb was 500 and that's using 80,000 day date ranges in the time zone with the most DST transitions - `Asia/Tehran`. Even so, if you are able to cause thousands of transitions it'd be kinder for us to fail with a "your time zone is bad" message rather than a huge stackoverflow that knocks over the whole node. So this changes the date resolution code to be a loop instead of a recursive function. I looked into making it not have to repeat at all, but I couldn't put all the of the rounding things into my head to have much confidence in it.
The code that rounds dates -
Rounding.roundcould possbly stackoverflow on sufficiently strange time zones, bringing the whole node
down. These time zones shouldn't be a thing - the largest stack depth I
was able to get with my tzdb was 500 and that's using 80,000 day date
ranges in the time zone with the most DST transitions -
Asia/Tehran.Even so, if you are able to cause thousands of transitions it'd be
kinder for us to fail with a "your time zone is bad" message rather than
a huge stackoverflow that knocks over the whole node.
So this changes the date resolution code to be a loop instead of a
recursive function. I looked into making it not have to repeat at all,
but I couldn't put all the of the rounding things into my head to have
much confidence in it.