Adaptive Load Controller main loop#535
Conversation
update from master
merge from upstream
merge from upstream
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
…double Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
…variable_setter Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
merge from upstream
…ent.Output turns out not to include the status Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
…plugin names, log thresholds only once per session Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
oschaaf
left a comment
There was a problem hiding this comment.
Generally LGTM, it's quite cool to see everything come together. I feel it reads like poetry :-) except for the special way we treat duration, which still feels a tiny bit off; maybe a code level comment can help here. Somehow I'm also wondering if duration isn't something we should allow implementing as a parameter, like qps. E.g. to allow one to research a question like "how does test duration influence convergence speed at a steady rps".
Left a few small other remarks.
|
@pamorgan Please review and assign back to me once done. |
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
| /** | ||
| * Performs an adaptive load session consisting of the Adjusting Stage and the | ||
| * Testing Stage. Adjusting Stage: Runs a series of short benchmarks, checks metrics according to | ||
| * MetricSpecs, and adjusts load up or down based on the result; returns an error if convergence |
| * computed by a StepController plugin. Metric values are obtained through MetricsPlugins. Testing | ||
| * Stage: When the optimal load is found, runs one long benchmark to validate it. | ||
| * | ||
| * @param nighthawk_service_stub A Nighthawk Service gRPC stub. |
There was a problem hiding this comment.
nit: Maybe stub is more appropriate as a member in the class than an argument to the method. I don't think we expect it to change on a multiple call scenario.
There was a problem hiding this comment.
This class is intended to be stateless helpers and it's only a class to make mocking easier. We will have a chance to design a more convenient client that might be stateful when we merge RemoteProcessImpl and NighthawkServiceClientImpl.
| absl::StatusOr<nighthawk::adaptive_load::AdaptiveLoadSessionOutput> PerformAdaptiveLoadSession( | ||
| nighthawk::client::NighthawkService::StubInterface* nighthawk_service_stub, | ||
| const nighthawk::adaptive_load::AdaptiveLoadSessionSpec& spec, Envoy::TimeSource& time_source); | ||
| class AdaptiveLoadController { |
There was a problem hiding this comment.
should this be renamed to interface now, since we now a pure methods.
There was a problem hiding this comment.
The convention for Nighthawk seems to be that everything under include should be pure but no existing interfaces have Interface in their name.
| * controller.PerformAdaptiveLoadSession(&nighthawk_service_stub, spec); | ||
| * | ||
| * @param nighthawk_service_client A helper that executes Nighthawk Service benchmarks given a | ||
| * gRPC stub. |
There was a problem hiding this comment.
What is the reason behing seperating the client from the stub?
There was a problem hiding this comment.
This is for historical reasons: We had a helper that made it easier to call the stub, then moved the helper into its own class because it's only easy to mock out methods when they are in a class.
I'm not sure what will happen with this, but we may add a more convenient Nighthawk Service client wrapper object that contains the stub, and then take this wrapper instead. Or the stub might be the best thing for the public API to take.
| * metrics from MetricsPlugins, scores the results, and consults a StepController plugin to | ||
| * determine the next load and detect convergence. All plugins are specified through the | ||
| * AdaptiveLoadSessionSpec proto. | ||
| * |
There was a problem hiding this comment.
can you add a comment on the thread safety? Can you run concurrent sessions?
| AdaptiveLoadSessionSpec spec = session_spec_proto_helper_.SetSessionSpecDefaults(input_spec); | ||
| absl::Status validation_status = session_spec_proto_helper_.CheckSessionSpec(spec); | ||
| if (!validation_status.ok()) { | ||
| return validation_status; |
There was a problem hiding this comment.
Should we log reasons of failures.
| absl::StrCat("Step controller determined that it can never converge: ", doom_reason)); | ||
| } | ||
|
|
||
| absl::StatusOr<BenchmarkResult> result_or = PerformAndAnalyzeNighthawkBenchmark( |
There was a problem hiding this comment.
Can you comment on why calling this again after complete.
There was a problem hiding this comment.
Done (this is the testing stage)
There was a problem hiding this comment.
BTW we have #539 to consider making the testing stage an optional thing, decided by the StepController, so this special call might go away.
| return absl::DeadlineExceededError(absl::StrFormat( | ||
| "Failed to converge before deadline of %.2f seconds.", time_limit_ns.count() / 1e9)); | ||
| } | ||
| } while (!step_controller->IsConverged() && !step_controller->IsDoomed(doom_reason)); |
There was a problem hiding this comment.
Usually while loops are more readable than do/while. Is there a reason not to use a simple while loop here?
There was a problem hiding this comment.
We had discussed this, I think the motivation for do/while is that IsConverged and IsDoomed don't have meaningful answers until the loop body has executed.
| } | ||
| BenchmarkResult result = result_or.value(); | ||
| *output.mutable_adjusting_stage_results()->Add() = result; | ||
| step_controller->UpdateAndRecompute(result); |
There was a problem hiding this comment.
It took me a second to follow the logic because step_controller->GetCurrentCommandLineOptions and step_controller->UpdateAndRecompute(result); are called in 2 different methods.
I think it would be easier to follow if you moved step_controller->GetCurrentCommandLineOptions to this method.
There was a problem hiding this comment.
Just moved UpdateAndRecompute into the helper, which hopefully makes the same amount of sense. I had moved GetCurrentCommandLineOptions to the helper to reduce duplication in the main loop.
| absl::flat_hash_map<std::string, MetricsPluginPtr> name_to_custom_metrics_plugin_map; | ||
| for (const envoy::config::core::v3::TypedExtensionConfig& config : | ||
| spec.metrics_plugin_configs()) { | ||
| absl::StatusOr<MetricsPluginPtr> metrics_plugin_or = LoadMetricsPlugin(config); |
There was a problem hiding this comment.
nit: usually it's better to perform the load right after validations and pass in the metric plugins to avoid the assert. No action needed if it's too much of code refactoring.
There was a problem hiding this comment.
Agreed, it's a bit awkward for historical reasons -- filed #547
…o helper Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
mum4k
left a comment
There was a problem hiding this comment.
Looks great, it is obvious that a lot of work was done here to improve readability. This PR only contains the business logic of the adaptive controller. The main functions and test cases are easy to follow. Thank you for taking all the additional steps.
This is the main function of the Adaptive Load Controller library:
Fixes #485.
Part 9 of splitting PR #483.