Fix issue with Envoy not reference counting across scopes under not-hot restart#3163
Fix issue with Envoy not reference counting across scopes under not-hot restart#3163ambuc wants to merge 8 commits intoenvoyproxy:masterfrom
Conversation
…ot restart Signed-off-by: James Buckland <jbuckland@google.com>
Signed-off-by: James Buckland <jbuckland@google.com>
so that when Envoy compiles in not-hot-restart mode and elides the hot_restart_impl files, we can still use the relevant utilities Signed-off-by: James Buckland <jbuckland@google.com>
Signed-off-by: James Buckland <jbuckland@google.com>
|
@jmarantz do you mind taking a first pass here since you are familiar with these code paths? |
Signed-off-by: James Buckland <jbuckland@google.com>
jmarantz
left a comment
There was a problem hiding this comment.
flushing comments; still not done going over this.
| HotRestartNopImpl() {} | ||
| HotRestartNopImpl(Options& options) | ||
| : stats_set_options_(Stats::blockMemHashOptions(options.maxStats())) { | ||
| uint32_t num_bytes = BlockMemoryHashSet<Stats::RawStatData>::numBytes(stats_set_options_); |
There was a problem hiding this comment.
it's worth noting in a comment somewhere that this strategy helps code sharing, but is potentially suboptimal with memory, as you have to have NUM_STATS * MaxSizeOfAllStats() bytes allocated for the string storage in the block, as opposed to just allocating each individually. In practice if the dynamic range is small, the waste may not be much.
Also the block hash-set algorithm may be inferior to std::unordered_set, but probably that won't be noticeable as this is really just a startup thing.
| HotRestartNopImpl() {} | ||
| HotRestartNopImpl(Options& options) | ||
| : stats_set_options_(Stats::blockMemHashOptions(options.maxStats())) { | ||
| uint32_t num_bytes = BlockMemoryHashSet<Stats::RawStatData>::numBytes(stats_set_options_); |
There was a problem hiding this comment.
nit: const uint32_t num_bytes = ...
takes me a while to get used to this style; I never used to declare local scalars as const, but now I'm trying to pick up on it :)
| : stats_set_options_(Stats::blockMemHashOptions(options.maxStats())) { | ||
| uint32_t num_bytes = BlockMemoryHashSet<Stats::RawStatData>::numBytes(stats_set_options_); | ||
| memory_.reset(new uint8_t[num_bytes]); | ||
| memset(memory_.get(), 0, num_bytes); |
There was a problem hiding this comment.
RELEASE_ASSERT(memory_.get());
I know I didn't do that in the test this was modeled after, but I think in server logs it's probably better to have an explicit assert if it does fail, rather than a segv.
| Stats::RawStatData::configureForTestsOnly(options_); | ||
| BlockMemoryHashSetOptions memory_hash_set_options_ = | ||
| Stats::blockMemHashOptions(options_.maxStats()); | ||
| uint32_t num_bytes = BlockMemoryHashSet<Stats::RawStatData>::numBytes(memory_hash_set_options_); |
test/common/stats/stats_impl_test.cc
Outdated
| EXPECT_EQ(s3, nullptr); | ||
| } | ||
|
|
||
| // Because the shared memory is managed manually, make sure it meets |
|
|
||
| absl::string_view key = name; | ||
| if (key.size() > Stats::RawStatData::maxNameLength()) { | ||
| key.remove_suffix(key.size() - Stats::RawStatData::maxNameLength()); |
There was a problem hiding this comment.
should we warn here, suggesting to increase the max name length? this could lead to surprising results if two stats with a common prefix get aliased.
| return; | ||
| } | ||
| bool key_removed = stats_set_->remove(data.key()); | ||
| ASSERT(key_removed); |
There was a problem hiding this comment.
this will fail in release builds; there's a macro to avoid the warning on key_removed being written-never-read. I forgot what it was :)
| if (--data.ref_count_ > 0) { | ||
| return; | ||
| } | ||
| bool key_removed = stats_set_->remove(data.key()); |
|
|
||
| RawStatData* alloc(const std::string& name); | ||
| void free(RawStatData& data); | ||
| std::string version() { return stats_set_->version(); } |
There was a problem hiding this comment.
std::string version() const { return stats_set_->version(); }
| std::unique_ptr<uint8_t[]> mem_buffer_for_dry_run_(new uint8_t[bytes]); | ||
| RawStatDataSet stats_set(options, true /* init */, mem_buffer_for_dry_run_.get()); | ||
| Stats::RawStatDataSet stats_set(options, true /* init */, mem_buffer_for_dry_run_.get()); | ||
| auto stats_set_version = stats_set.version(); |
There was a problem hiding this comment.
s/auto/std::string/ use explicit types unless locally obvious.
jmarantz
left a comment
There was a problem hiding this comment.
OK these are almost all stylistic nits. Looks great, James!
| void sendMessage(sockaddr_un& address, RpcBase& rpc); | ||
| static std::string versionHelper(uint64_t max_num_stats, uint64_t max_stat_name_len, | ||
| RawStatDataSet& stats_set); | ||
| std::string stats_set_version); |
| public: | ||
| HotRestartNopImpl() {} | ||
| HotRestartNopImpl(Options& options) | ||
| : stats_set_options_(Stats::blockMemHashOptions(options.maxStats())) { |
There was a problem hiding this comment.
actually can we put this non-trivial impl in a .cc file?
|
|
||
| #include "spdlog/spdlog.h" | ||
|
|
||
| // Can be overridden at compile time |
There was a problem hiding this comment.
can't it also be overridden at startup time?
test/common/stats/stats_impl_test.cc
Outdated
|
|
||
| class BlockRawStatDataAllocatorTest : public testing::Test { | ||
| public: | ||
| void setup() { |
There was a problem hiding this comment.
void SetUp() override {
and remove the explicit calls to setUp()
| } | ||
|
|
||
| TEST_F(BlockRawStatDataAllocatorTest, allocFail) { | ||
| EXPECT_CALL(options_, maxStats()).WillRepeatedly(Return(2)); |
There was a problem hiding this comment.
I think this one will be OK even if it runs after SetUp, right?
There was a problem hiding this comment.
I think the calls it's expecting are made during setup(), right?
| BlockRawStatDataAllocatorAlignmentTest() : name_len_(8 + GetParam()) { | ||
| EXPECT_CALL(options_, maxObjNameLength()).WillRepeatedly(Return(name_len_)); | ||
| setup(); | ||
| EXPECT_EQ(name_len_, Stats::RawStatData::maxObjNameLength()); |
There was a problem hiding this comment.
here the pattern is to move these into another overriddent SetUp() method, and you can call BlockRawStatDataAllocatorTest::SetUp() in the middle like this.
test/common/stats/stats_impl_test.cc
Outdated
| class BlockRawStatDataAllocatorAlignmentTest : public BlockRawStatDataAllocatorTest, | ||
| public testing::WithParamInterface<uint64_t> { | ||
| public: | ||
| BlockRawStatDataAllocatorAlignmentTest() : name_len_(8 + GetParam()) { |
There was a problem hiding this comment.
Can you use num_stats_ instead of 8 here, or is this 8 something else, in which case can you make a new constant for that?
| Stats::RawStatData* stat = allocator_->alloc(fmt::format("stat {}", i)); | ||
| EXPECT_TRUE((reinterpret_cast<uintptr_t>(stat) % alignof(decltype(*stat))) == 0); | ||
| EXPECT_TRUE(used.find(stat) == used.end()); | ||
| used.insert(stat); |
There was a problem hiding this comment.
you can replace the above two lines with:
EXPECT_TRUE(used.insert(stat).second);
jmarantz
left a comment
There was a problem hiding this comment.
in your description, you have "extends that functionality to the no-op case." did you mean the "non-hot-restart" case?
Also you might want to make a dummy PR to iterate on the DCO issue you are having. My guess is the signed-off-by email address does not match exactly your canonical username in github.
Note that all the commits have to have that signed-off thing right, so once you commit and are violating DCO, the PR is basically crippled until you squash your commits, losing all comments. Super-annoying. Now that we have comment history, one strategy might be to orphan this PR and make a new one with a clean DCO-correct single commit of all the changes, where the description points to this one for comments.
But others may have better ideas for DCO resolution.
|
The DCO-bot is having issues right now; there's probably nothing you can do at this time, so ignore it until we've sorted that out. |
|
Thanks for the heads up @ggreenway -- James ignore my DCO comment :) |
|
DCO-bot should be working now. It should run on the next commit pushed (or you can test it by pushing an empty commit to this PR). The previous failure does not indicate that there's anything wrong with the DCO-sign-off in this PR. |
Signed-off-by: James Buckland <jbuckland@google.com>
Signed-off-by: James Buckland <jbuckland@google.com>
jmarantz
left a comment
There was a problem hiding this comment.
It looks like the coverage failure is functional, and related to your change:
[ RUN ] DynamoUtility.PartitionIdStatString
test/extensions/filters/http/dynamo/dynamo_utility_test.cc:42: Failure
Expected equality of these values:
expected_stat_string
Which is: "http.egress_dynamodb_iad.dynamodb.table.locations-sandbox-partition-test-iad-mytest-rea.capacity.GetItem.__partition_id=c5883ca"
partition_stat_string
Which is: "http.egress_dynamodb_iad.dynamodb.table.locations-sandbox-partition-test-iad-mytest-really-long-name.capacity.GetItem.__partition_id=c5883ca"
test/extensions/filters/http/dynamo/dynamo_utility_test.cc:43: Failure
Value of: partition_stat_string.size() == Stats::RawStatData::maxNameLength()
Actual: false
Expected: true
test/extensions/filters/http/dynamo/dynamo_utility_test.cc:58: Failure
Value of: partition_stat_string.size() == Stats::RawStatData::maxNameLength()
Actual: false
Expected: true
[ FAILED ] DynamoUtility.PartitionIdStatString (0 ms)
| if (key.size() > Stats::RawStatData::maxNameLength()) { | ||
| key.remove_suffix(key.size() - Stats::RawStatData::maxNameLength()); | ||
| } | ||
| auto value_created = stats_set_->insert(key); |
| BlockRawStatDataAllocator(const BlockMemoryHashSetOptions& options, bool init, uint8_t* memory, | ||
| Thread::BasicLockable& stat_lock) | ||
| : stat_lock_(stat_lock) { | ||
| stats_set_ = std::make_unique<RawStatDataSet>(options, init, memory); |
There was a problem hiding this comment.
nit: why not put this in the initializer list?
| std::unique_lock<Thread::BasicLockable> lock(stat_lock_); | ||
| stats_set_.reset(new RawStatDataSet(stats_set_options_, options.restartEpoch() == 0, | ||
| shmem_.stats_set_data_)); | ||
| stats_allocator_ = std::make_unique<Stats::BlockRawStatDataAllocator>( |
There was a problem hiding this comment.
Don't we need to lock around this as we did before this change?
| public: | ||
| HotRestartNopImpl() {} | ||
| HotRestartNopImpl(Options& options) | ||
| : stats_set_options_(Stats::blockMemHashOptions(options.maxStats())) { |
|
|
||
| class HotRestartNopImplTest : public testing::Test { | ||
| public: | ||
| void setup() { |
There was a problem hiding this comment.
+1, unless you need to do something before or after these methods, you should use the ones that are automatically called by gtest. Also, side note: it's generally preferred to use the constructor and destructor rather than setup and teardown. There are only a few rare exceptions to that rule: see https://github.com/google/googletest/blob/master/googletest/docs/FAQ.md#should-i-use-the-constructordestructor-of-the-test-fixture-or-the-set-uptear-down-function.
…ta::maxNameLength() was assumed to be 127 Signed-off-by: James Buckland <jbuckland@google.com>
| class Utility { | ||
| public: | ||
| // returns example max name length, 127. This number is easily mutable in the | ||
| // outside world, and depending on it makes this test fragile. Thus, the examples |
There was a problem hiding this comment.
This doesn't look like a test utility; this looks like production code.
|
Can I ask a general question about this change: Was a much simpler solution considered? Basically in the heap allocator case, just maintain an unordered map with a reference count? That would be like 10 lines of code and I think would do the trick? Is there any benefit to the more complex approach? |
|
TBH I had the same question but assumed that this done this way to share more code. But the size of this PR indicates that maybe we'd be better off just using an unordered_set. |
|
@mattklein123 This PR isn't that complex at the core, just some renames and some shuffling of code from one scope to another for visibility. Extracting the allocator from HotRestartImpl is the heart of that. Everything else is updating tests (and a few edge cases). That said, the less complex approach would mean two implementations of a map-based allocator, one of which would be baked in to HotRestartImpl, and one of which would be heap-based and freely usable. This would leave us with no freely usable non-heap-based allocator, which I don't like as much. |
|
I'd also like to add that the test failures that we've been seeing are related to test infrastructure issues (like preexisting static variables leaking across tests and cross platform types) rather than functional problems. I think using the block allocator has the following advantages:
I think the only disadvantage is that it will create limitations for the !hot restart case that aren't explicitly necessary. So users who want to have 100,000 clusters will need to manually tune the size of their stat allocation despite the fact that it's on the heap. That being said, I'm not too opinionated on which direction we go since both have merits. |
|
I haven't really looked at the change, but I see a hundreds of line diff for something that could be trivially accomplished with a small amount of code, even if the implementations end up being different. This is also in an area that has a super high regression risk. My preference in general given this tradeoff is always going to be the trivial approach. But, if you all feel that sharing the code is the better way forward I won't fight it. |
|
Closed in favor of #3212 as a simpler solution. |
…ot restart (#3249) Simpler solution to issue #2453 than pull #3163, continuing draft work in ambuc#1 and ambuc#2. Summary of changes: adds an unordered_map named stats_set_ as a member variable of HeapRawStatDataAllocator, and implements reference counting / dedup on allocated stats. Risk Level: Low. Testing: Add a test to stats_impl_test. Passes bazel test test/.... Docs Changes: N/A Release Notes: This is user-facing in that non-hot restart stat allocation now resolves namespace properly, but no effect on user configs. Fixes: #2453 Signed-off-by: James Buckland <jbuckland@google.com>
Signed-off-by: James Buckland jbuckland@google.com
title: Fixes issue with Envoy not reference counting stats across scopes under not-hot restart.
Description: Addresses issue #2453, continuing draft work in ambuc#1. Summary of changes:
HotRestartImplno longer is a child class ofRawStatDataAllocator. It now has a member variable allocator, which is a more general-purposeBlockRawStatDataAllocator. ThisBlockRawStatDataAllocatoraccepts shared- or not- memory, and accepts / uses a mutex internally.BlockRawStatDataAllocatorin bothHotRestartImplandHotRestartNopImpl, which implements the reference counting by means of aRawStatDataSet. This leavesHotRestartImplfunctionally unchanged, and extends that functionality to the no-op case.HotRestartNopImplneeds to know a fewOptionsandBlockMemoryHashSetOptionsfor creating its stat set, which means a small change tomain_common.ccto pass in options to both kinds of restarters.This also justifies the renaming changes in #3150.
Risk Level: Medium.
Testing: Changed
stats_impl_test,thread_local_store_test,hot_restart_impl_test, andhot_restart_nop_impl_testin large ways, and made minor updates to the server mocks. Added unit tests forBlockRawStatDataAllocatorand the hot restart / no-op hot restart modes. Passesbazel test test/....Docs Changes: N/A
Release Notes: This is user-facing in that non-hot restart stat allocation now resolves namespace properly, but no effect on user configs.
Fixes: #2453