admin: add /runtime_modify endpoint and make runtime work without fs backing#2837
admin: add /runtime_modify endpoint and make runtime work without fs backing#2837mattklein123 merged 11 commits intoenvoyproxy:masterfrom jsedgwick:runtime_modification
Conversation
|
Per new release note policy please add a release note with any applicable feature docs to https://github.com/envoyproxy/data-plane-api/blob/master/docs/root/intro/version_history.rst. Thank you! |
|
@mattklein123 lost you, you linked to data-plane-api there? I added a note in |
|
@mattklein123 nvm, just saw the new docs, willfix |
|
@jsedgwick this needs docs also, so you can just do docs/release in same data-plane-api PR if that is easier. |
mattklein123
left a comment
There was a problem hiding this comment.
This is great thanks. Well done. Main comment is around operational usability.
source/common/runtime/runtime_impl.h
Outdated
There was a problem hiding this comment.
I know you copied tis code, but can you constify root_path_ and override_path_
There was a problem hiding this comment.
nit: as this is a shared_ptr, I think you can just capture ptr directly and don't need ptr_copy ?
There was a problem hiding this comment.
oops, yeah that was supposed to be ptr = std::move(ptr), thx
source/server/http/admin.cc
Outdated
There was a problem hiding this comment.
My concern with the overrides is that if people are doing local tests, it would be very nice for /runtime to be able to print out the original values, and the override value, and which one is in effect (roughly). Is there a way to make that happen? I think that would make this substantially easier to use. In the future you could also imagine something like POST /runtime_reset which would reset all changed values to the original (whether on disk or in the base implementation nothing).
Can you take a look at what would be required to modify /runtime to show state in a more obvious way? For runtime_reset I would just throw a TODO somewhere.
There was a problem hiding this comment.
I hear you on that. I'm thinking of formalizing/abstracting the concept of an override and then at the endpoint you can see all three (or more, in the future, who knows) layered - disk, override on disk, override in memory.
My concern is whether it's worth the complexity to make that happen in this PR as opposed to just waiting for more structured admin stuff to land and implementing a better, proto-ized version of this. It would make a good case study in to think about how mutations would interact with structured admin.
So, WDYT about getting this in more-or-less as-is and combining your suggestions with proto-izing these endpoints? It's not about the (mild) engineering effort but more that any complexity might be immediately refactored/deleted anyway.
There was a problem hiding this comment.
I think we need to fix this as part of this change, even if it's not a super optimal solution. It's going to be too confusing if the user does not know what was overriden using /runtime_modify. They will basically have to restart Envoy to get back to a fresh state. We need to be able to view the memory overrides and set them back if desired.
Along these lines, one additional request: Please add a gauge that is something like runtime.memory_overrides_active which is set to 1 if any are set, and 0 otherwise. This will allow alarms for when someone changes something if desired.
|
@ggreenway can you have a look as well? |
|
OK. I'll add something minimal to be able to view the layers of overrides.
ack on the gauge.
On Mon, Mar 19, 2018 at 1:11 PM Matt Klein ***@***.***> wrote:
***@***.**** commented on this pull request.
In source/server/http/admin.cc
<#2837 (comment)>:
> @@ -650,7 +665,9 @@ AdminImpl::AdminImpl(const std::string& access_log_path, const std::string& prof
{"/stats", "print server stats", MAKE_ADMIN_HANDLER(handlerStats), false, false},
{"/listeners", "print listener addresses", MAKE_ADMIN_HANDLER(handlerListenerInfo), false,
false},
- {"/runtime", "print runtime values", MAKE_ADMIN_HANDLER(handlerRuntime), false, false}},
+ {"/runtime", "print runtime values", MAKE_ADMIN_HANDLER(handlerRuntime), false, false},
I think we need to fix this as part of this change, even if it's not a
super optimal solution. It's going to be too confusing if the user does not
know what was overriden using /runtime_modify. They will basically have
to restart Envoy to get back to a fresh state. We need to be able to view
the memory overrides and set them back if desired.
Along these lines, one additional request: Please add a gauge that is
something like runtime.memory_overrides_active which is set to 1 if any
are set, and 0 otherwise. This will allow alarms for when someone changes
something if desired.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2837 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAc4InXJlIKnFDO0y48ZXfYxSenLTl2Fks5tf-bTgaJpZM4SuQ2_>
.
--
*James Sedgwick*
Software Engineer, Networking Infrastructure
973.738.0737 <+19737380737>
[image: Lyft] <http://www.lyft.com/>
|
source/server/http/admin.cc
Outdated
There was a problem hiding this comment.
@mattklein123 was I think commenting on this line in #2771
Sorry I didn't notice this till just now, but per the docs this should only respond to a POST. Can we fix that? Since we are probably going to have more POST methods in the future (cc @jmarantz) it might be worth it to build this into the registration system somehow, but up to you.
Can I just write a separate PR that adds POST checking whenever UrlHandler.mutates_server_state_ is true, which is the case here? That would go into AdminImpl::runCallback inside if (path_and_query.compare(0, query_index, handler.prefix_) == 0) {
The advantage of doing that there is that it solves this issue, as well as some aspects of #2763. The disadvantage is that it would require changes to existing scripts that use curl GET for mutations, when they should be POST. WDYT?
There was a problem hiding this comment.
At this point I think it's probably OK. Which are the endpoints that would be switched to POST?
There was a problem hiding this comment.
all the ones whose last initializer-arg is true:
/cpuprofiler
/healthcheck/fail
/healthcheck/ok
/logging
/quitquitquit
/reset_counters
/runtime_modify
There was a problem hiding this comment.
/healthcheck/fail
/healthcheck/ok
Is going to cause issues for Lyft, and potentially others. I'm wondering if we should leave those as GET with a deprecation cycle? (Like support both GET and POST and get people to migrate).
There was a problem hiding this comment.
SGTM. Currently the http method is ignored so GET and POST will both work. Do you want also a warning logged when POST is issued?
|
Are all these settings going to be subsumed by xds at some point or runtime will continue to exist? |
Runtime will continue to exist. It fills a parallel function as a kill switch/operational system. However, as a best practice all settings should be configurable via normal config, and not just runtime. |
|
@mattklein123 PTAL, updated to expose override "layers;" the Left Please see a couple of TODOs addressed to reviewers in the code, especially the ones around structured output for |
test/server/http/admin_test.cc
Outdated
There was a problem hiding this comment.
I'm not super-fluent in gmock for various reasons, but this looks funny. What's the lifetime of expression std::string{"layer1"} such that it makes sense to return a ref to it?
I'd feel better if you made std::string{"layer1"} a method-scoped const.
There was a problem hiding this comment.
We have to return a reference because the name() method returns const std::string&, but it's also nice to just pass the temporary string there for brevity. This is exactly what ReturnRefOfCopy is for; as opposed to ReturnRef, to which your suspicion applies, it will copy and hold on to the temporary, so you don't end up with a dangling reference. See https://github.com/google/googlemock/blob/master/googlemock/docs/CheatSheet.md#returning-a-value
There was a problem hiding this comment.
Forgot to add, full disclosure I just learned about ReturnRefOfCopy today as well! :)
mattklein123
left a comment
There was a problem hiding this comment.
In general looks great. Some comments to get started. Will need to take another pass. @jmarantz do you mind also reviewing once the next round of updates are done?
include/envoy/runtime/runtime.h
Outdated
include/envoy/runtime/runtime.h
Outdated
There was a problem hiding this comment.
nit: looks like returning vector, not map
source/common/runtime/runtime_impl.h
Outdated
There was a problem hiding this comment.
nit: "whose source is ..."
source/common/runtime/runtime_impl.h
Outdated
source/common/runtime/runtime_impl.h
Outdated
source/common/runtime/runtime_impl.h
Outdated
There was a problem hiding this comment.
Please add tests for proper set/unset of this gauge.
include/envoy/runtime/runtime.h
Outdated
There was a problem hiding this comment.
nit: can this be const OverrideLayer and OverrideLayerConstSharedPtr
There was a problem hiding this comment.
nit: move to initializer list
There was a problem hiding this comment.
What was the behavior before when we had a load error. Did we have any runtime values at all? Just wondering if this is a change in behavior or not. It seems reasonable to have admin layer, or even root, if some layer doesn't load, but just want to be clear of what behavior was before and now.
There was a problem hiding this comment.
Before this PR, no values at all. And yes, the admin layer will be put in place even if one or both of the disk layers doesn't load. Now that I think about it, should it be all or nothing for the two disk layers? What behavior do you think is best?
There was a problem hiding this comment.
I think it might be simplest to just do all or nothing for all layers. If not that, I would definitely do all or nothing for the disk layers IMO. I'm fine either way, but I think I would avoid the incremental behavior we have now.
source/common/runtime/runtime_impl.h
Outdated
There was a problem hiding this comment.
nit: can we pass std::vector<OverrideLayerSharedPtr>&& and move. I think all callers are move capable.
…t fs backing Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
|
PTAL |
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
|
@jsedgwick reviewing now. In the future please do not rebase once you open a PR, it makes it much harder to review. Once opened, just merge master and add commits. Thanks! |
mattklein123
left a comment
There was a problem hiding this comment.
Generally looks great, thanks. Just a few small comments.
|
|
||
| void LoaderImpl::loadNewSnapshot() { | ||
| ThreadLocal::ThreadLocalObjectSharedPtr ptr = createNewSnapshot(); | ||
| tls_->set([ptr = std::move(ptr)](Event::Dispatcher&)->ThreadLocal::ThreadLocalObjectSharedPtr { |
There was a problem hiding this comment.
nit: You can just capture ptr directly without move. With C++14 you might even be able to just do ptr = createNewSnapshot()?
There was a problem hiding this comment.
std::move avoids a shared_ptr copy so I did this by instinct. And no, since, we need ptr to be the TLS shared_ptr, not a unique_ptr<Snapshot>, since the TLS system requires this lambda to be copyable. IMO this is the cleanest way to get the casts/copyability right.
*edit: to be clear, you can do this without the first line but it would make the capture clause really long and unwieldy
source/common/runtime/runtime_impl.h
Outdated
| /** | ||
| * Copy-constructible so that it can snapshotted. | ||
| */ | ||
| AdminLayer(const AdminLayer& o) : AdminLayer{o.stats_} { values_ = o.values(); } |
|
|
||
| private: | ||
| RuntimeStats generateStats(Stats::Store& store); | ||
| ThreadLocal::SlotPtr tls_; |
There was a problem hiding this comment.
nit: newline before this line (whitespace between methods and members)
source/server/server.cc
Outdated
|
|
||
| Api::OsSysCallsPtr os_sys_calls(new Api::OsSysCallsImpl); | ||
| return Runtime::LoaderPtr{new Runtime::LoaderImpl( | ||
| return Runtime::LoaderPtr{new Runtime::DiskBackedLoaderImpl( |
There was a problem hiding this comment.
nit: std::make_unique (not your code but might as well fix). Same below.
test/server/http/admin_test.cc
Outdated
| "layer2" | ||
| ], | ||
| "entries": { | ||
| "extra_key": [ |
There was a problem hiding this comment.
Can we make each entry look like:
"extra_key": {
"final_value": "...",
"layers": [ ... ]
}
I think it would make it easier to see what is the current value quickly.
| } else { | ||
| response.add("usage: /runtime?format=json\n"); | ||
| rc = Http::Code::BadRequest; | ||
| rapidjson::Document document; |
There was a problem hiding this comment.
Can you add a TODO here to protoize this? I should be able to get to this soon when I finish up the other config_dump stuff.
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
|
PTAL |
Description:
Adds
/runtime_modifyendpoint to modify runtime values. Values added/modified via this endpoint are effectively another override layer. Deleting one by setting it to empty string only deletes a previously added override, not any underlying fs-backed key. The added tests demonstrate this behavior, as do added comments.This ended up mostly being a refactor of runtime_impl such that the "null" runtime actually operates on an in memory map manipulated by this endpoint instead of on an empty map. The disk backed implementation becomes an extension of the in memory implementation where the disk values are primary.
Risk Level: Low. New admin endpoint, largely a refactor.
Testing:
Added unit. Did a little manual sanity testing as well
Fixes #514