Skip to content

Fix issue with Envoy not reference counting across scopes under not-hot restart#3163

Closed
ambuc wants to merge 8 commits intoenvoyproxy:masterfrom
ambuc:refcount-stats-across-scopes
Closed

Fix issue with Envoy not reference counting across scopes under not-hot restart#3163
ambuc wants to merge 8 commits intoenvoyproxy:masterfrom
ambuc:refcount-stats-across-scopes

Conversation

@ambuc
Copy link
Contributor

@ambuc ambuc commented Apr 23, 2018

Signed-off-by: James Buckland jbuckland@google.com

title: Fixes issue with Envoy not reference counting stats across scopes under not-hot restart.

Description: Addresses issue #2453, continuing draft work in ambuc#1. Summary of changes:

  • HotRestartImpl no longer is a child class of RawStatDataAllocator. It now has a member variable allocator, which is a more general-purpose BlockRawStatDataAllocator. This BlockRawStatDataAllocator accepts shared- or not- memory, and accepts / uses a mutex internally.
  • We now use this BlockRawStatDataAllocator in both HotRestartImpl and HotRestartNopImpl, which implements the reference counting by means of a RawStatDataSet. This leaves HotRestartImpl functionally unchanged, and extends that functionality to the no-op case.
  • This means HotRestartNopImpl needs to know a few Options and BlockMemoryHashSetOptions for creating its stat set, which means a small change to main_common.cc to pass in options to both kinds of restarters.
  • Associated changes to mocks / test suites to test this reference counting behaviors in both kinds of restart scenarios.

This also justifies the renaming changes in #3150.

Risk Level: Medium.

Testing: Changed stats_impl_test, thread_local_store_test, hot_restart_impl_test, and hot_restart_nop_impl_test in large ways, and made minor updates to the server mocks. Added unit tests for BlockRawStatDataAllocator and the hot restart / no-op hot restart modes. Passes bazel test test/....

Docs Changes: N/A

Release Notes: This is user-facing in that non-hot restart stat allocation now resolves namespace properly, but no effect on user configs.

Fixes: #2453

…ot restart

Signed-off-by: James Buckland <jbuckland@google.com>
ambuc added 3 commits April 23, 2018 10:35
Signed-off-by: James Buckland <jbuckland@google.com>
so that when Envoy compiles in not-hot-restart mode and elides the hot_restart_impl files, we can still use the relevant utilities

Signed-off-by: James Buckland <jbuckland@google.com>
Signed-off-by: James Buckland <jbuckland@google.com>
@mattklein123
Copy link
Member

@jmarantz do you mind taking a first pass here since you are familiar with these code paths?

Signed-off-by: James Buckland <jbuckland@google.com>
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing comments; still not done going over this.

HotRestartNopImpl() {}
HotRestartNopImpl(Options& options)
: stats_set_options_(Stats::blockMemHashOptions(options.maxStats())) {
uint32_t num_bytes = BlockMemoryHashSet<Stats::RawStatData>::numBytes(stats_set_options_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's worth noting in a comment somewhere that this strategy helps code sharing, but is potentially suboptimal with memory, as you have to have NUM_STATS * MaxSizeOfAllStats() bytes allocated for the string storage in the block, as opposed to just allocating each individually. In practice if the dynamic range is small, the waste may not be much.

Also the block hash-set algorithm may be inferior to std::unordered_set, but probably that won't be noticeable as this is really just a startup thing.

HotRestartNopImpl() {}
HotRestartNopImpl(Options& options)
: stats_set_options_(Stats::blockMemHashOptions(options.maxStats())) {
uint32_t num_bytes = BlockMemoryHashSet<Stats::RawStatData>::numBytes(stats_set_options_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const uint32_t num_bytes = ...

takes me a while to get used to this style; I never used to declare local scalars as const, but now I'm trying to pick up on it :)

: stats_set_options_(Stats::blockMemHashOptions(options.maxStats())) {
uint32_t num_bytes = BlockMemoryHashSet<Stats::RawStatData>::numBytes(stats_set_options_);
memory_.reset(new uint8_t[num_bytes]);
memset(memory_.get(), 0, num_bytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RELEASE_ASSERT(memory_.get());

I know I didn't do that in the test this was modeled after, but I think in server logs it's probably better to have an explicit assert if it does fail, rather than a segv.

Stats::RawStatData::configureForTestsOnly(options_);
BlockMemoryHashSetOptions memory_hash_set_options_ =
Stats::blockMemHashOptions(options_.maxStats());
uint32_t num_bytes = BlockMemoryHashSet<Stats::RawStatData>::numBytes(memory_hash_set_options_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

EXPECT_EQ(s3, nullptr);
}

// Because the shared memory is managed manually, make sure it meets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/shared/block/


absl::string_view key = name;
if (key.size() > Stats::RawStatData::maxNameLength()) {
key.remove_suffix(key.size() - Stats::RawStatData::maxNameLength());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we warn here, suggesting to increase the max name length? this could lead to surprising results if two stats with a common prefix get aliased.

return;
}
bool key_removed = stats_set_->remove(data.key());
ASSERT(key_removed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will fail in release builds; there's a macro to avoid the warning on key_removed being written-never-read. I forgot what it was :)

if (--data.ref_count_ > 0) {
return;
}
bool key_removed = stats_set_->remove(data.key());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const


RawStatData* alloc(const std::string& name);
void free(RawStatData& data);
std::string version() { return stats_set_->version(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::string version() const { return stats_set_->version(); }

std::unique_ptr<uint8_t[]> mem_buffer_for_dry_run_(new uint8_t[bytes]);
RawStatDataSet stats_set(options, true /* init */, mem_buffer_for_dry_run_.get());
Stats::RawStatDataSet stats_set(options, true /* init */, mem_buffer_for_dry_run_.get());
auto stats_set_version = stats_set.version();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/auto/std::string/ use explicit types unless locally obvious.

https://google.github.io/styleguide/cppguide.html#auto

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK these are almost all stylistic nits. Looks great, James!

void sendMessage(sockaddr_un& address, RpcBase& rpc);
static std::string versionHelper(uint64_t max_num_stats, uint64_t max_stat_name_len,
RawStatDataSet& stats_set);
std::string stats_set_version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absl::string_view

public:
HotRestartNopImpl() {}
HotRestartNopImpl(Options& options)
: stats_set_options_(Stats::blockMemHashOptions(options.maxStats())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually can we put this non-trivial impl in a .cc file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


#include "spdlog/spdlog.h"

// Can be overridden at compile time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't it also be overridden at startup time?


class BlockRawStatDataAllocatorTest : public testing::Test {
public:
void setup() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void SetUp() override {

and remove the explicit calls to setUp()

}

TEST_F(BlockRawStatDataAllocatorTest, allocFail) {
EXPECT_CALL(options_, maxStats()).WillRepeatedly(Return(2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one will be OK even if it runs after SetUp, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the calls it's expecting are made during setup(), right?

BlockRawStatDataAllocatorAlignmentTest() : name_len_(8 + GetParam()) {
EXPECT_CALL(options_, maxObjNameLength()).WillRepeatedly(Return(name_len_));
setup();
EXPECT_EQ(name_len_, Stats::RawStatData::maxObjNameLength());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here the pattern is to move these into another overriddent SetUp() method, and you can call BlockRawStatDataAllocatorTest::SetUp() in the middle like this.

class BlockRawStatDataAllocatorAlignmentTest : public BlockRawStatDataAllocatorTest,
public testing::WithParamInterface<uint64_t> {
public:
BlockRawStatDataAllocatorAlignmentTest() : name_len_(8 + GetParam()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use num_stats_ instead of 8 here, or is this 8 something else, in which case can you make a new constant for that?

Stats::RawStatData* stat = allocator_->alloc(fmt::format("stat {}", i));
EXPECT_TRUE((reinterpret_cast<uintptr_t>(stat) % alignof(decltype(*stat))) == 0);
EXPECT_TRUE(used.find(stat) == used.end());
used.insert(stat);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can replace the above two lines with:

  EXPECT_TRUE(used.insert(stat).second);

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in your description, you have "extends that functionality to the no-op case." did you mean the "non-hot-restart" case?

Also you might want to make a dummy PR to iterate on the DCO issue you are having. My guess is the signed-off-by email address does not match exactly your canonical username in github.

Note that all the commits have to have that signed-off thing right, so once you commit and are violating DCO, the PR is basically crippled until you squash your commits, losing all comments. Super-annoying. Now that we have comment history, one strategy might be to orphan this PR and make a new one with a clean DCO-correct single commit of all the changes, where the description points to this one for comments.

But others may have better ideas for DCO resolution.

@ggreenway
Copy link
Member

The DCO-bot is having issues right now; there's probably nothing you can do at this time, so ignore it until we've sorted that out.

@jmarantz
Copy link
Contributor

Thanks for the heads up @ggreenway -- James ignore my DCO comment :)

@ggreenway
Copy link
Member

DCO-bot should be working now. It should run on the next commit pushed (or you can test it by pushing an empty commit to this PR). The previous failure does not indicate that there's anything wrong with the DCO-sign-off in this PR.

ambuc added 2 commits April 23, 2018 16:58
Signed-off-by: James Buckland <jbuckland@google.com>
Signed-off-by: James Buckland <jbuckland@google.com>
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the coverage failure is functional, and related to your change:

[ RUN      ] DynamoUtility.PartitionIdStatString
test/extensions/filters/http/dynamo/dynamo_utility_test.cc:42: Failure
Expected equality of these values:
  expected_stat_string
    Which is: "http.egress_dynamodb_iad.dynamodb.table.locations-sandbox-partition-test-iad-mytest-rea.capacity.GetItem.__partition_id=c5883ca"
  partition_stat_string
    Which is: "http.egress_dynamodb_iad.dynamodb.table.locations-sandbox-partition-test-iad-mytest-really-long-name.capacity.GetItem.__partition_id=c5883ca"
test/extensions/filters/http/dynamo/dynamo_utility_test.cc:43: Failure
Value of: partition_stat_string.size() == Stats::RawStatData::maxNameLength()
  Actual: false
Expected: true
test/extensions/filters/http/dynamo/dynamo_utility_test.cc:58: Failure
Value of: partition_stat_string.size() == Stats::RawStatData::maxNameLength()
  Actual: false
Expected: true
[  FAILED  ] DynamoUtility.PartitionIdStatString (0 ms)

if (key.size() > Stats::RawStatData::maxNameLength()) {
key.remove_suffix(key.size() - Stats::RawStatData::maxNameLength());
}
auto value_created = stats_set_->insert(key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const

BlockRawStatDataAllocator(const BlockMemoryHashSetOptions& options, bool init, uint8_t* memory,
Thread::BasicLockable& stat_lock)
: stat_lock_(stat_lock) {
stats_set_ = std::make_unique<RawStatDataSet>(options, init, memory);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not put this in the initializer list?

std::unique_lock<Thread::BasicLockable> lock(stat_lock_);
stats_set_.reset(new RawStatDataSet(stats_set_options_, options.restartEpoch() == 0,
shmem_.stats_set_data_));
stats_allocator_ = std::make_unique<Stats::BlockRawStatDataAllocator>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to lock around this as we did before this change?

public:
HotRestartNopImpl() {}
HotRestartNopImpl(Options& options)
: stats_set_options_(Stats::blockMemHashOptions(options.maxStats())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


class HotRestartNopImplTest : public testing::Test {
public:
void setup() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, unless you need to do something before or after these methods, you should use the ones that are automatically called by gtest. Also, side note: it's generally preferred to use the constructor and destructor rather than setup and teardown. There are only a few rare exceptions to that rule: see https://github.com/google/googletest/blob/master/googletest/docs/FAQ.md#should-i-use-the-constructordestructor-of-the-test-fixture-or-the-set-uptear-down-function.

…ta::maxNameLength() was assumed to be 127

Signed-off-by: James Buckland <jbuckland@google.com>
class Utility {
public:
// returns example max name length, 127. This number is easily mutable in the
// outside world, and depending on it makes this test fragile. Thus, the examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like a test utility; this looks like production code.

@mattklein123
Copy link
Member

Can I ask a general question about this change: Was a much simpler solution considered? Basically in the heap allocator case, just maintain an unordered map with a reference count? That would be like 10 lines of code and I think would do the trick? Is there any benefit to the more complex approach?

@jmarantz
Copy link
Contributor

TBH I had the same question but assumed that this done this way to share more code. But the size of this PR indicates that maybe we'd be better off just using an unordered_set.

@ambuc
Copy link
Contributor Author

ambuc commented Apr 25, 2018

@mattklein123 This PR isn't that complex at the core, just some renames and some shuffling of code from one scope to another for visibility. Extracting the allocator from HotRestartImpl is the heart of that. Everything else is updating tests (and a few edge cases).

That said, the less complex approach would mean two implementations of a map-based allocator, one of which would be baked in to HotRestartImpl, and one of which would be heap-based and freely usable. This would leave us with no freely usable non-heap-based allocator, which I don't like as much.

@mrice32
Copy link
Member

mrice32 commented Apr 25, 2018

I'd also like to add that the test failures that we've been seeing are related to test infrastructure issues (like preexisting static variables leaking across tests and cross platform types) rather than functional problems.

I think using the block allocator has the following advantages:

  1. It's a single implementation in both hot and !hot restart. We don't currently run much CI for !hot restart outside of the mac tests, so reducing the differences makes our integration testing more robust and gives us bug-for-bug parity.
  2. Smaller concern, but I think having a single implementation reduces long term maintenance costs and will ultimately be fewer lines of total code.

I think the only disadvantage is that it will create limitations for the !hot restart case that aren't explicitly necessary. So users who want to have 100,000 clusters will need to manually tune the size of their stat allocation despite the fact that it's on the heap.

That being said, I'm not too opinionated on which direction we go since both have merits.

@mattklein123
Copy link
Member

I haven't really looked at the change, but I see a hundreds of line diff for something that could be trivially accomplished with a small amount of code, even if the implementations end up being different. This is also in an area that has a super high regression risk. My preference in general given this tradeoff is always going to be the trivial approach. But, if you all feel that sharing the code is the better way forward I won't fight it.

@ambuc
Copy link
Contributor Author

ambuc commented Apr 25, 2018

Closed in favor of #3212 as a simpler solution.

@ambuc ambuc closed this Apr 25, 2018
htuch pushed a commit that referenced this pull request May 4, 2018
…ot restart (#3249)

Simpler solution to issue #2453 than pull #3163, continuing draft work in ambuc#1 and ambuc#2. Summary of changes:

adds an unordered_map named stats_set_ as a member variable of HeapRawStatDataAllocator, and implements reference counting / dedup on allocated stats.
Risk Level: Low.

Testing: Add a test to stats_impl_test. Passes bazel test test/....

Docs Changes: N/A

Release Notes: This is user-facing in that non-hot restart stat allocation now resolves namespace properly, but no effect on user configs.

Fixes: #2453

Signed-off-by: James Buckland <jbuckland@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

!ENVOY_HOT_RESTART does not reference count across scopes

5 participants