-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable per-resource timeout customization via an annotation #232
Conversation
@kwestein Once this PR ships we will be able to fix the timeout problems you are having. |
@@ -22,6 +22,8 @@ class KubernetesResource | |||
If you have reason to believe it will succeed, retry the deploy to continue to monitor the rollout. | |||
MSG | |||
|
|||
TIMEOUT_OVERRIDE_ANNOTATION = "kubernetes-deploy.shopify.io/timeout-override-seconds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to support more ways to specify the duration like 10s
or 5m
? Or perhaps just make the annotation kubernetes-deploy.shopify.io/timeout-override
and state that for the time being the only acceptable value is seconds?
I am bringing this up because a timeout of 15m
(which may not be unreasonable for Redis) would be 900s
which isn't as easy for a human to parse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I'll update it to accept "s" or "m", and to assume that unqualified values are seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well if you're going to have m
and s
, you might as well have h
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend ISO8601
format (to be used with ActiveSupport::Duration::ISO8601Parse
) for time durations. It's what I used in my open PR on resource discovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im on board with this idea for redis 👍
in fact the Redis controller should probably use that strategy when we switch to controller-reported statuses for CRs.
Is this something that is happening soon? I'm okay with shipping this (as I understand it it will be useful for others as well) and adding the annotations to quickly fix the few apps we have seen have problems with their deployments timing out and then refactor to the ProgressDeadlineSeconds strategy.
On our side, we are not in a rush to ship this today as everyone is on holiday and we have a workaround in place if those apps do need to deploy over the break.
@stefanmb is working on it in #188, but it's a big change to both the gem and the controllers, so I wouldn't expect it super soon.
Thanks for letting me know 👍 |
It should be possible to write a test like
Unless that's too Integrationy for the unit tests. |
6265fba
to
0e4eeb0
Compare
Updated to use ISO8601 durations* and add more tests. * These are neat and powerful and standard, and it's great to get the parser for free, but I felt like the intuitive "1s/m/h" should also work without the "PT" prefix, and that being case-sensitive would be annoying. So I wrapped the parser in a custom one with those features. LMK what you think. |
class ParsingError < ArgumentError; end | ||
|
||
def initialize(value) | ||
@iso8601_str = value.to_s.strip.upcase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe if you add:
@iso8601_str = "PT" + @iso8601_str unless @iso8601_str.start_with?("P")
You get all of the features you want from just normal ActiveSupport::Duration.parse(@iso8601_str)
without needing to reach into the parser or write calculate_total_seconds
.
The only issue is that the error message on a failed parse becomes Invalid ISO 8601 duration: "PTFOO"
, which is not really what the user provided.
You could maybe solve this with a sketchy Find&Replace on the returned message... maybe
Or you could do a more complicated thing where you do:
begin
ActiveSupport::Duration.parse(@iso8601_str)
rescue ActiveSupport::Duration::ISO8601Parser::ParsingError => e
begin
ActiveSupport::Duration.parse("PT" + @iso8601_str)
rescue ActiveSupport::Duration::ISO8601Parser::ParsingError
raise ParsingError, e.message
end
end
Where you do the blind parse, and if it fails you always try jamming PT
on the front.
If that succeeds, we return that one, and if it also fails, we return the first failure and discard the second (We raise the e.message
, which is the first exception, in the second rescue block and don't even capture the second exception)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the retry-and-append solution! The only thing that bugs me about it is that we'll be retying in what I'm expecting to be the average case. I can't imagine that's meaningfully expensive, but for the purposes of reflecting intent, I'd slightly prefer to have the first try use the PT +
version.
def test_timeout_override_upper_bound_validation | ||
customized_resource = DummyResource.new(definition_extras: build_timeout_metadata("24H1S")) | ||
customized_resource.validate_definition | ||
assert customized_resource.validation_failed?, "Annotation with '0' was valid" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is not Annotation with '0'
(These tests seem great, by the way!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! I like the "try the PT version first" hack on my hack.
README.md
Outdated
@@ -120,6 +121,9 @@ All templates must be YAML formatted. You can also use ERB. The following local | |||
You can add additional variables using the `--bindings=BINDINGS` option. For example, `kubernetes-deploy my-app cluster1 --bindings=color=blue,size=large` will expose `color` and `size` in your templates. | |||
|
|||
|
|||
### Customizing behaviour with annotations | |||
|
|||
- `kubernetes-deploy.shopify.io/timeout-override` (any resource): Override the tool's hard timeout for one specific resource. Both full ISO8601 durations and the time portion of ISO8601 durations are valid. Value must be between 1 second and 24 hours. Example values: 45s / 3m / 1h / PT0.25H |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding: "This feature does not work as expected for the deployment
resources because progressDeadlineSeconds takes precedence over hard timeouts and have a server side default value for deployments."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with more details on compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Approach is sound and tests look pretty exhaustive. Thanks!
end | ||
|
||
def parse! | ||
ActiveSupport::Duration.parse("PT#{@iso8601_str}") # By default assume it is just a time component |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also prefer trying the more specific logic first.
basic_resource = DummyResource.new | ||
assert_equal 300, basic_resource.timeout | ||
|
||
customized_resource = DummyResource.new(definition_extras: build_timeout_metadata("60S")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a bit overkill to the multiple formats here since we already have unit tests for the parser, but it doesn't hurt.
@Shopify/cloudplatform @Shopify/redis-memcached @psycotica0-shopify @kirs
What this does
This PR makes it possible to add a
kubernetes-deploy.shopify.io/timeout-override
annotation to any individual resource. If that annotation is present and valid, its value will be used in place of the global hard timeout for that resource type.I think this is a reasonable feature even though it may let people shoot themselves/Shipit in the foot with reeeeeaaaaaly long timeouts because:
Alternatives