Skip to content

Redis fault injection#10784

Merged
mattklein123 merged 92 commits intoenvoyproxy:masterfrom
HenryYYang:redis-fault-injection
Jul 17, 2020
Merged

Redis fault injection#10784
mattklein123 merged 92 commits intoenvoyproxy:masterfrom
HenryYYang:redis-fault-injection

Conversation

@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg
Copy link
Contributor

Description:
This PR implements fault injection for Redis; specifically delay and error faults (which themselves can have delays added). I chose not to implement a separate filter after discussing with Henry; we concluded that the faults we felt were useful didn't need many levels- just a delay on top of the original fault, if any. In addition, as the Redis protocol doesn't support headers that makes it a bit different again from Envoy's http fault injection.

Faults record metrics on the original request- and the delay fault adds extra latency which is included in the command latency for that request. Also, faults can apply only to certain commands.
Future work: Add several other faults, including cache misses and connection failures.

Risk Level: Medium
Testing:

  • Unit tests on fault manager and command splitter.
  • Testing on local machine. For 3 faults, the faults/requests after a few dozen requests reflected the desired configuration.
    Docs Changes: yes- updated Redis configuration section with notes on metrics and fault injection proper in docs/root/configuration/listeners/network_filters/redis_proxy_filter.rst .
    Release Notes: yes, under 1.15 (pending)

@HenryYYang for first pass.

@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #10784 was opened by FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg.

see: more, trace.

Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg
Copy link
Contributor Author

Waiting on #10794 so that I can push my branch with latest master/updated version history. Not sure why the converage build fails; looks unrelated to my changes and presumably will work on latest master.

@mattklein123
Copy link
Member

Please use --no-verify to bypass the checks. I'm not sure why the pre-push checks are checking thirdparty. I think @yanavlasov was looking to this.

@zhjdenislyft
Copy link

sorry, the logic looks good but I don't understand C++.

@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg
Copy link
Contributor Author

sorry, the logic looks good but I don't understand C++.

The best thing for you to review is the logic in the fault manager where we get the command for a fault- specifically, the amortization strategy.

Copy link
Contributor

@HenryYYang HenryYYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments

for (FaultMapType::iterator it = range.first; it != range.second; ++it) {
envoy::extensions::filters::network::redis_proxy::v3::RedisProxy_RedisFault fault = it->second;
const uint64_t fault_injection_percentage = calculateFaultInjectionPercentage(fault);
if (random_number % (100 - amortized_fault) < fault_injection_percentage) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we doing the (100 - amortized_fault) here? This doesn't seem right to me. If we have 3 faults: 30%, 30%, 40% each. and the random number is r, we'll be doing these checks:
r % 100 < 30
r % 70 < 30
r % 40 < 40
those are different from the correct checks, which are:
r % 100 < 30
r % 100 < 60
r % 100 < 100
or
r % 100 < 30
(r % 100) - 30 < 30
(r % 100) - 60 < 40

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that was wrong. I've fixed and updated test.


int FaultManagerImpl::numberOfFaults() { return fault_map_.size(); }

uint64_t FaultManagerImpl::calculateFaultInjectionPercentage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is duplicating methods already in snapshot class

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I was worried about null/empty inputs to the snapshot method, but if I pre-calculate the adjusted numerator on fault manager construction I don't need to worry about this.


// To support delay faults, we allow faults to override the regular command latency
// recording behavior.
void delayLatencyMetric() { delay_command_latency_ = true; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need these new interface on the base class instead of just in the delay fault?

Copy link
Contributor Author

@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request that is being delayed records its latency before calling the onResponse() callback. That means unless the request is somehow aware that it should not calculate it and whatever is on the other side of the callback will, we will generate the wrong latency data.

In a request the updateStats() call always precedes callbacks.onResponse():

void SplitRequestBase::updateStats(const bool success) {
  if (success) {
    command_stats_.success_.inc();
  } else {
    command_stats_.error_.inc();
  }
  if (!delay_command_latency_) {
    command_latency_->complete();
  }
}

Does that make sense?

Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Copy link
Contributor

@HenryYYang HenryYYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Comment on lines +30 to +41
if (base_fault.fault_enabled().has_default_value()) {
if (base_fault.fault_enabled().default_value().denominator() ==
envoy::type::v3::FractionalPercent::HUNDRED) {
default_value_ = base_fault.fault_enabled().default_value().numerator();
} else {
auto denominator = ProtobufPercentHelper::fractionalPercentDenominatorToInt(
base_fault.fault_enabled().default_value().denominator());
default_value_ =
(base_fault.fault_enabled().default_value().numerator() * 100) / denominator;
}
}
runtime_key_ = base_fault.fault_enabled().runtime_key();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this code is needed, you should just use FractionalPercent which wraps RuntimeFractionalPercent

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime_.snapshot().getInteger(...) takes the default value as int, like most of the other featureEnabled(...) methods. The featureEnabled(...) methods that take a FractionalPercent default value return a boolean, which doesn't tell us what our amortized probability is. Then, we'd need to do the calculation for the amortized percentage with a check on the denominator.

}
};

std::vector<std::string> Fault::getCommands(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this is a redundant value copying, is this only used for initializing the commands_ field?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this only for initialization. We want to make all the commands lowercase, hence the method. Would you prefer the lowercasing be a linter check?

// 2. For each fault, calculate the amortized fault injection percentage.
absl::optional<std::pair<FaultType, std::chrono::milliseconds>>
FaultManagerImpl::getFaultForCommandInternal(std::string command) {
auto random_number = random_.random() % 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create the random number only if we have a matching command. Also what do we do if the fault_injection_percentage adds up higher than 100?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I was under the impression we were going to treat that as user error. I'll add that to the documentation section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we checking for this user error anywhere?

if (it_outer != fault_map_.end()) {
for (auto fault_ptr : it_outer->second) {
auto fault_injection_percentage = runtime_.snapshot().getInteger(
fault_ptr->runtime_key_.value(), fault_ptr->default_value_.value());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would calling value() on an unset optional throw exception here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call- I'll add a check.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll also add a test.

std::chrono::milliseconds delay_ms_;
const std::vector<std::string> commands_;
bool has_fault_enabled_;
absl::optional<int> default_value_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we do with unset default_value_ and runtime_key_?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm changing it for default_value_ to not be optional and be zero by default.

* all commands, we use a special ALL_KEYS entry in the map.
*/
class FaultManagerImpl : public FaultManager {
typedef std::unordered_map<std::string, std::vector<FaultPtr>> FaultMap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use using instead of typedef, also should the faultmap be immutable?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fault map gets built at runtime, so not immutable. Using noted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can be changed post constructor, so it should be immutable. Also the getFaultForCommand method should be marked const.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok done

envoy::extensions::filters::network::redis_proxy::v3::RedisProxy_RedisFault base_fault);

public:
FaultType fault_type_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticed these fields are public, this feels like an antipattern

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add getters

Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-watchers: FYI only for changes made to api/.

🐱

Caused by: #10784 was synchronize by FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg.

see: more, trace.

Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
This reverts commit ef0545a.

Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg
Copy link
Contributor Author

Master branch format check is broken due to some http codec issue, so I've removed the master merge.

Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
@mattklein123
Copy link
Member

@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg please never force push. It makes reviews very difficult. Thank you!

@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg
Copy link
Contributor Author

Sorry! I was trying to get further back on master and broke the cardinal rule!

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123 mattklein123 merged commit 048583b into envoyproxy:master Jul 17, 2020
FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg added a commit to HenryYYang/envoy that referenced this pull request Jul 29, 2020
This reverts commit 048583b.

Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
mattklein123 pushed a commit that referenced this pull request Jul 30, 2020
This reverts commit 048583b.

Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
KBaichoo pushed a commit to KBaichoo/envoy that referenced this pull request Jul 30, 2020
This PR implements fault injection for Redis; specifically delay and error faults (which themselves can have delays added). I chose not to implement a separate filter after discussing with Henry; we concluded that the faults we felt were useful didn't need many levels- just a delay on top of the original fault, if any. In addition, as the Redis protocol doesn't support headers that makes it a bit different again from Envoy's http fault injection.

Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
scheler pushed a commit to scheler/envoy that referenced this pull request Aug 4, 2020
This PR implements fault injection for Redis; specifically delay and error faults (which themselves can have delays added). I chose not to implement a separate filter after discussing with Henry; we concluded that the faults we felt were useful didn't need many levels- just a delay on top of the original fault, if any. In addition, as the Redis protocol doesn't support headers that makes it a bit different again from Envoy's http fault injection.

Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: scheler <santosh.cheler@appdynamics.com>
chaoqin-li1123 pushed a commit to chaoqin-li1123/envoy that referenced this pull request Aug 7, 2020
This reverts commit 048583b.

Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
chaoqin-li1123 pushed a commit to chaoqin-li1123/envoy that referenced this pull request Aug 7, 2020
This reverts commit 048583b.

Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Signed-off-by: chaoqinli <chaoqinli@google.com>
FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg added a commit to HenryYYang/envoy that referenced this pull request Aug 7, 2020
This reverts commit 77e9ed7.

Signed-off-by: FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg <nflacco@lyft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants