Skip to content

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented Nov 5, 2021

The code that rounds dates - Rounding.round could possbly stack
overflow on sufficiently strange time zones, bringing the whole node
down. These time zones shouldn't be a thing - the largest stack depth I
was able to get with my tzdb was 500 and that's using 80,000 day date
ranges in the time zone with the most DST transitions - Asia/Tehran.
Even so, if you are able to cause thousands of transitions it'd be
kinder for us to fail with a "your time zone is bad" message rather than
a huge stackoverflow that knocks over the whole node.

So this changes the date resolution code to be a loop instead of a
recursive function. I looked into making it not have to repeat at all,
but I couldn't put all the of the rounding things into my head to have
much confidence in it.

The code that rounds dates - `Rounding.round` could possbly stack
overflow on sufficiently strange time zones, bringing the whole node
down. These time zones shouldn't be a thing - the largest stack depth I
was able to get with my tzdb was 500 and that's using 80,000 day date
ranges in the time zone with the most DST transitions - `Asia/Tehran`.
Even so, if you are able to cause thousands of transitions it'd be
kinder for us to fail with a "your time zone is bad" message rather than
a huge stackoverflow that knocks over the whole node.

So this changes the date resolution code to be a loop instead of a
recursive function. I looked into making it not have to repeat at all,
but I couldn't put all the of the rounding things into my head to have
much confidence in it.
@nik9000 nik9000 added >bug :Core/Infra/Core Core issues without another label v8.1.0 v8.0.0-beta1 labels Nov 5, 2021
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Nov 5, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

Copy link
Member Author

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly feel like I should be able to rewrite this method not to need to loop at all. The non-java.time version of this code doesn't loop. But I knew what I was doing back then. Now I've forgotten. So the best I can manage is un-recursive-ing it.

ZoneId tz = ZoneId.of("Asia/Tehran");
Rounding rounding = Rounding.builder(TimeValue.timeValueDays(80000)).timeZone(tz).build();
assertThat(rounding.round(time("2078-11-10T02:51:22.662Z")), isDate(time("1970-01-01T00:00:00+03:30"), tz));
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't write a test locally that'd fail without going to pretty intense lengths. So I just used the worse case I could find.

}
throw new IllegalArgumentException(
this
+ " failed to round "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please log the timezone and the interval in the exception message if this happens? This way if we have missed a case that we are not aware of we'll at least be able to debug it. Perhaps a logger warning too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make sure it gets in there. I think folks intended this.toString to have the zone but I expect that got lost and the lack of unit test didn't notice.

* Asia/Tehran with an 80,000 day range. You just can't declare
* ranges much larger than that in ES right now.
*/
attempt: while (attempts < 5000) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can make this constant a field that can be modified with the builder which defaults to 5000? That way we can write a unit test that ensures we correctly throw the exception in case the loop limit is reached.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
final OffsetDateTime offsetTime = roundedLocalDateTime.atOffset(currentOffsets.get(0));
final Instant offsetInstant = offsetTime.toInstant();
throw new IllegalArgumentException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changing behaviour and it's unclear to me how we end up in this state. I'd be concerned that a customer is successfully getting a date now, where their application will effectively break after this change? Is there a way to reproduce this with a unit test? If not maybe we can take the approach to log an error/warning and retain the existing behaviour of returning a value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes. I made this change earlier, took a day off, and then forgot about it. I'll revert it. I don't particularly like the old behavior. And, no, didn't have a way of getting this behavior before with a unit test. I'll just set it back how it was.

@nik9000
Copy link
Member Author

nik9000 commented Nov 7, 2021

I was so concerned with reproducing the stack overflow I didn't think about artificially setting the limit lower. I've pushed a change that does that.

Copy link
Contributor

@grcevski grcevski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The toString changes are awesome.

@grcevski
Copy link
Contributor

grcevski commented Nov 8, 2021

A question about the original issue, does it make sense to backport this fix to 7.16 as well?

@nik9000 nik9000 merged commit 425359b into elastic:master Nov 10, 2021
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Nov 10, 2021
The code that rounds dates - `Rounding.round` could possbly stack
overflow on sufficiently strange time zones, bringing the whole node
down. These time zones shouldn't be a thing - the largest stack depth I
was able to get with my tzdb was 500 and that's using 80,000 day date
ranges in the time zone with the most DST transitions - `Asia/Tehran`.
Even so, if you are able to cause thousands of transitions it'd be
kinder for us to fail with a "your time zone is bad" message rather than
a huge stackoverflow that knocks over the whole node.

So this changes the date resolution code to be a loop instead of a
recursive function. I looked into making it not have to repeat at all,
but I couldn't put all the of the rounding things into my head to have
much confidence in it.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Nov 10, 2021
The code that rounds dates - `Rounding.round` could possbly stack
overflow on sufficiently strange time zones, bringing the whole node
down. These time zones shouldn't be a thing - the largest stack depth I
was able to get with my tzdb was 500 and that's using 80,000 day date
ranges in the time zone with the most DST transitions - `Asia/Tehran`.
Even so, if you are able to cause thousands of transitions it'd be
kinder for us to fail with a "your time zone is bad" message rather than
a huge stackoverflow that knocks over the whole node.

So this changes the date resolution code to be a loop instead of a
recursive function. I looked into making it not have to repeat at all,
but I couldn't put all the of the rounding things into my head to have
much confidence in it.
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
7.16
8.0

@nik9000
Copy link
Member Author

nik9000 commented Nov 10, 2021

A question about the original issue, does it make sense to backport this fix to 7.16 as well?

I hate to sneak it in after the branch cut. But I think it's probably the right thing to do it. I wouldn't stop the presses or anything because it's super rare. But it seems pretty low risk. I've hit it with a zillion tests and it's not a new algorithm or anything.

elasticsearchmachine pushed a commit that referenced this pull request Nov 10, 2021
The code that rounds dates - `Rounding.round` could possbly stack
overflow on sufficiently strange time zones, bringing the whole node
down. These time zones shouldn't be a thing - the largest stack depth I
was able to get with my tzdb was 500 and that's using 80,000 day date
ranges in the time zone with the most DST transitions - `Asia/Tehran`.
Even so, if you are able to cause thousands of transitions it'd be
kinder for us to fail with a "your time zone is bad" message rather than
a huge stackoverflow that knocks over the whole node.

So this changes the date resolution code to be a loop instead of a
recursive function. I looked into making it not have to repeat at all,
but I couldn't put all the of the rounding things into my head to have
much confidence in it.
elasticsearchmachine pushed a commit that referenced this pull request Nov 10, 2021
The code that rounds dates - `Rounding.round` could possbly stack
overflow on sufficiently strange time zones, bringing the whole node
down. These time zones shouldn't be a thing - the largest stack depth I
was able to get with my tzdb was 500 and that's using 80,000 day date
ranges in the time zone with the most DST transitions - `Asia/Tehran`.
Even so, if you are able to cause thousands of transitions it'd be
kinder for us to fail with a "your time zone is bad" message rather than
a huge stackoverflow that knocks over the whole node.

So this changes the date resolution code to be a loop instead of a
recursive function. I looked into making it not have to repeat at all,
but I couldn't put all the of the rounding things into my head to have
much confidence in it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team v7.16.0 v8.0.0-beta1 v8.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants