Fix issue with Envoy not reference counting across scopes under not-hot restart by ambuc · Pull Request #3163 · envoyproxy/envoy

ambuc · 2018-04-23T14:21:01Z

Signed-off-by: James Buckland jbuckland@google.com

title: Fixes issue with Envoy not reference counting stats across scopes under not-hot restart.

Description: Addresses issue #2453, continuing draft work in ambuc#1. Summary of changes:

HotRestartImpl no longer is a child class of RawStatDataAllocator. It now has a member variable allocator, which is a more general-purpose BlockRawStatDataAllocator. This BlockRawStatDataAllocator accepts shared- or not- memory, and accepts / uses a mutex internally.
We now use this BlockRawStatDataAllocator in both HotRestartImpl and HotRestartNopImpl, which implements the reference counting by means of a RawStatDataSet. This leaves HotRestartImpl functionally unchanged, and extends that functionality to the no-op case.
This means HotRestartNopImpl needs to know a few Options and BlockMemoryHashSetOptions for creating its stat set, which means a small change to main_common.cc to pass in options to both kinds of restarters.
Associated changes to mocks / test suites to test this reference counting behaviors in both kinds of restart scenarios.

This also justifies the renaming changes in #3150.

Risk Level: Medium.

Testing: Changed stats_impl_test, thread_local_store_test, hot_restart_impl_test, and hot_restart_nop_impl_test in large ways, and made minor updates to the server mocks. Added unit tests for BlockRawStatDataAllocator and the hot restart / no-op hot restart modes. Passes bazel test test/....

Docs Changes: N/A

Release Notes: This is user-facing in that non-hot restart stat allocation now resolves namespace properly, but no effect on user configs.

Fixes: #2453

…ot restart Signed-off-by: James Buckland <jbuckland@google.com>

Signed-off-by: James Buckland <jbuckland@google.com>

so that when Envoy compiles in not-hot-restart mode and elides the hot_restart_impl files, we can still use the relevant utilities Signed-off-by: James Buckland <jbuckland@google.com>

Signed-off-by: James Buckland <jbuckland@google.com>

mattklein123 · 2018-04-23T16:59:30Z

@jmarantz do you mind taking a first pass here since you are familiar with these code paths?

Signed-off-by: James Buckland <jbuckland@google.com>

jmarantz

flushing comments; still not done going over this.

jmarantz · 2018-04-23T17:50:11Z

source/server/hot_restart_nop_impl.h

-  HotRestartNopImpl() {}
+  HotRestartNopImpl(Options& options)
+      : stats_set_options_(Stats::blockMemHashOptions(options.maxStats())) {
+    uint32_t num_bytes = BlockMemoryHashSet<Stats::RawStatData>::numBytes(stats_set_options_);


it's worth noting in a comment somewhere that this strategy helps code sharing, but is potentially suboptimal with memory, as you have to have NUM_STATS * MaxSizeOfAllStats() bytes allocated for the string storage in the block, as opposed to just allocating each individually. In practice if the dynamic range is small, the waste may not be much.

Also the block hash-set algorithm may be inferior to std::unordered_set, but probably that won't be noticeable as this is really just a startup thing.

jmarantz · 2018-04-23T17:51:15Z

source/server/hot_restart_nop_impl.h

-  HotRestartNopImpl() {}
+  HotRestartNopImpl(Options& options)
+      : stats_set_options_(Stats::blockMemHashOptions(options.maxStats())) {
+    uint32_t num_bytes = BlockMemoryHashSet<Stats::RawStatData>::numBytes(stats_set_options_);


nit: const uint32_t num_bytes = ...

takes me a while to get used to this style; I never used to declare local scalars as const, but now I'm trying to pick up on it :)

jmarantz · 2018-04-23T17:51:56Z

source/server/hot_restart_nop_impl.h

+      : stats_set_options_(Stats::blockMemHashOptions(options.maxStats())) {
+    uint32_t num_bytes = BlockMemoryHashSet<Stats::RawStatData>::numBytes(stats_set_options_);
+    memory_.reset(new uint8_t[num_bytes]);
+    memset(memory_.get(), 0, num_bytes);


RELEASE_ASSERT(memory_.get());

I know I didn't do that in the test this was modeled after, but I think in server logs it's probably better to have an explicit assert if it does fail, rather than a segv.

jmarantz · 2018-04-23T17:59:16Z

test/common/stats/stats_impl_test.cc

+    Stats::RawStatData::configureForTestsOnly(options_);
+    BlockMemoryHashSetOptions memory_hash_set_options_ =
+        Stats::blockMemHashOptions(options_.maxStats());
+    uint32_t num_bytes = BlockMemoryHashSet<Stats::RawStatData>::numBytes(memory_hash_set_options_);


jmarantz · 2018-04-23T17:59:56Z

test/common/stats/stats_impl_test.cc

+  EXPECT_EQ(s3, nullptr);
+}
+
+// Because the shared memory is managed manually, make sure it meets


s/shared/block/

jmarantz · 2018-04-23T18:12:40Z

source/common/stats/stats_impl.cc

+
+  absl::string_view key = name;
+  if (key.size() > Stats::RawStatData::maxNameLength()) {
+    key.remove_suffix(key.size() - Stats::RawStatData::maxNameLength());


should we warn here, suggesting to increase the max name length? this could lead to surprising results if two stats with a common prefix get aliased.

jmarantz · 2018-04-23T18:14:09Z

source/common/stats/stats_impl.cc

+    return;
+  }
+  bool key_removed = stats_set_->remove(data.key());
+  ASSERT(key_removed);


this will fail in release builds; there's a macro to avoid the warning on key_removed being written-never-read. I forgot what it was :)

jmarantz · 2018-04-23T18:14:26Z

source/common/stats/stats_impl.cc

+  if (--data.ref_count_ > 0) {
+    return;
+  }
+  bool key_removed = stats_set_->remove(data.key());


jmarantz · 2018-04-23T18:30:29Z

source/common/stats/stats_impl.h

+
+  RawStatData* alloc(const std::string& name);
+  void free(RawStatData& data);
+  std::string version() { return stats_set_->version(); }


std::string version() const { return stats_set_->version(); }

jmarantz · 2018-04-23T19:05:34Z

source/server/hot_restart_impl.cc

  std::unique_ptr<uint8_t[]> mem_buffer_for_dry_run_(new uint8_t[bytes]);
-  RawStatDataSet stats_set(options, true /* init */, mem_buffer_for_dry_run_.get());
+  Stats::RawStatDataSet stats_set(options, true /* init */, mem_buffer_for_dry_run_.get());
+  auto stats_set_version = stats_set.version();


s/auto/std::string/ use explicit types unless locally obvious.

https://google.github.io/styleguide/cppguide.html#auto

jmarantz

OK these are almost all stylistic nits. Looks great, James!

jmarantz · 2018-04-23T19:13:09Z

source/server/hot_restart_impl.h

  void sendMessage(sockaddr_un& address, RpcBase& rpc);
  static std::string versionHelper(uint64_t max_num_stats, uint64_t max_stat_name_len,
-                                   RawStatDataSet& stats_set);
+                                   std::string stats_set_version);


absl::string_view

jmarantz · 2018-04-23T19:14:01Z

source/server/hot_restart_nop_impl.h

 public:
-  HotRestartNopImpl() {}
+  HotRestartNopImpl(Options& options)
+      : stats_set_options_(Stats::blockMemHashOptions(options.maxStats())) {


actually can we put this non-trivial impl in a .cc file?

jmarantz · 2018-04-23T19:14:47Z

source/server/options_impl.h


 #include "spdlog/spdlog.h"

+// Can be overridden at compile time


can't it also be overridden at startup time?

jmarantz · 2018-04-23T19:16:02Z

test/common/stats/stats_impl_test.cc


+class BlockRawStatDataAllocatorTest : public testing::Test {
+public:
+  void setup() {


void SetUp() override {

and remove the explicit calls to setUp()

jmarantz · 2018-04-23T19:16:29Z

test/common/stats/stats_impl_test.cc

+}
+
+TEST_F(BlockRawStatDataAllocatorTest, allocFail) {
+  EXPECT_CALL(options_, maxStats()).WillRepeatedly(Return(2));


I think this one will be OK even if it runs after SetUp, right?

I think the calls it's expecting are made during setup(), right?

jmarantz · 2018-04-23T19:17:42Z

test/common/stats/stats_impl_test.cc

+  BlockRawStatDataAllocatorAlignmentTest() : name_len_(8 + GetParam()) {
+    EXPECT_CALL(options_, maxObjNameLength()).WillRepeatedly(Return(name_len_));
+    setup();
+    EXPECT_EQ(name_len_, Stats::RawStatData::maxObjNameLength());


here the pattern is to move these into another overriddent SetUp() method, and you can call BlockRawStatDataAllocatorTest::SetUp() in the middle like this.

jmarantz · 2018-04-23T19:18:22Z

test/common/stats/stats_impl_test.cc

+class BlockRawStatDataAllocatorAlignmentTest : public BlockRawStatDataAllocatorTest,
+                                               public testing::WithParamInterface<uint64_t> {
+public:
+  BlockRawStatDataAllocatorAlignmentTest() : name_len_(8 + GetParam()) {


Can you use num_stats_ instead of 8 here, or is this 8 something else, in which case can you make a new constant for that?

jmarantz · 2018-04-23T19:19:11Z

test/common/stats/stats_impl_test.cc

+    Stats::RawStatData* stat = allocator_->alloc(fmt::format("stat {}", i));
+    EXPECT_TRUE((reinterpret_cast<uintptr_t>(stat) % alignof(decltype(*stat))) == 0);
+    EXPECT_TRUE(used.find(stat) == used.end());
+    used.insert(stat);


you can replace the above two lines with:

EXPECT_TRUE(used.insert(stat).second);

jmarantz

in your description, you have "extends that functionality to the no-op case." did you mean the "non-hot-restart" case?

Also you might want to make a dummy PR to iterate on the DCO issue you are having. My guess is the signed-off-by email address does not match exactly your canonical username in github.

Note that all the commits have to have that signed-off thing right, so once you commit and are violating DCO, the PR is basically crippled until you squash your commits, losing all comments. Super-annoying. Now that we have comment history, one strategy might be to orphan this PR and make a new one with a clean DCO-correct single commit of all the changes, where the description points to this one for comments.

But others may have better ideas for DCO resolution.

ggreenway · 2018-04-23T19:32:50Z

The DCO-bot is having issues right now; there's probably nothing you can do at this time, so ignore it until we've sorted that out.

jmarantz · 2018-04-23T19:36:57Z

Thanks for the heads up @ggreenway -- James ignore my DCO comment :)

ggreenway · 2018-04-23T20:33:57Z

DCO-bot should be working now. It should run on the next commit pushed (or you can test it by pushing an empty commit to this PR). The previous failure does not indicate that there's anything wrong with the DCO-sign-off in this PR.

Signed-off-by: James Buckland <jbuckland@google.com>

jmarantz

It looks like the coverage failure is functional, and related to your change:

[ RUN      ] DynamoUtility.PartitionIdStatString
test/extensions/filters/http/dynamo/dynamo_utility_test.cc:42: Failure
Expected equality of these values:
  expected_stat_string
    Which is: "http.egress_dynamodb_iad.dynamodb.table.locations-sandbox-partition-test-iad-mytest-rea.capacity.GetItem.__partition_id=c5883ca"
  partition_stat_string
    Which is: "http.egress_dynamodb_iad.dynamodb.table.locations-sandbox-partition-test-iad-mytest-really-long-name.capacity.GetItem.__partition_id=c5883ca"
test/extensions/filters/http/dynamo/dynamo_utility_test.cc:43: Failure
Value of: partition_stat_string.size() == Stats::RawStatData::maxNameLength()
  Actual: false
Expected: true
test/extensions/filters/http/dynamo/dynamo_utility_test.cc:58: Failure
Value of: partition_stat_string.size() == Stats::RawStatData::maxNameLength()
  Actual: false
Expected: true
[  FAILED  ] DynamoUtility.PartitionIdStatString (0 ms)

mrice32 · 2018-04-23T21:09:38Z

source/common/stats/stats_impl.cc

+  if (key.size() > Stats::RawStatData::maxNameLength()) {
+    key.remove_suffix(key.size() - Stats::RawStatData::maxNameLength());
+  }
+  auto value_created = stats_set_->insert(key);


nit: const

mrice32 · 2018-04-23T21:11:19Z

source/common/stats/stats_impl.h

+  BlockRawStatDataAllocator(const BlockMemoryHashSetOptions& options, bool init, uint8_t* memory,
+                            Thread::BasicLockable& stat_lock)
+      : stat_lock_(stat_lock) {
+    stats_set_ = std::make_unique<RawStatDataSet>(options, init, memory);


nit: why not put this in the initializer list?

mrice32 · 2018-04-23T21:13:53Z

source/server/hot_restart_impl.cc

-    std::unique_lock<Thread::BasicLockable> lock(stat_lock_);
-    stats_set_.reset(new RawStatDataSet(stats_set_options_, options.restartEpoch() == 0,
-                                        shmem_.stats_set_data_));
+    stats_allocator_ = std::make_unique<Stats::BlockRawStatDataAllocator>(


Don't we need to lock around this as we did before this change?

mrice32 · 2018-04-23T21:17:07Z

source/server/hot_restart_nop_impl.h

 public:
-  HotRestartNopImpl() {}
+  HotRestartNopImpl(Options& options)
+      : stats_set_options_(Stats::blockMemHashOptions(options.maxStats())) {


mrice32 · 2018-04-23T21:23:55Z

test/server/hot_restart_nop_impl_test.cc

+
+class HotRestartNopImplTest : public testing::Test {
+public:
+  void setup() {


+1, unless you need to do something before or after these methods, you should use the ones that are automatically called by gtest. Also, side note: it's generally preferred to use the constructor and destructor rather than setup and teardown. There are only a few rare exceptions to that rule: see https://github.com/google/googletest/blob/master/googletest/docs/FAQ.md#should-i-use-the-constructordestructor-of-the-test-fixture-or-the-set-uptear-down-function.

…ta::maxNameLength() was assumed to be 127 Signed-off-by: James Buckland <jbuckland@google.com>

jmarantz · 2018-04-25T02:48:27Z

source/extensions/filters/http/dynamo/dynamo_utility.h

 class Utility {
 public:
+  // returns example max name length, 127. This number is easily mutable in the
+  // outside world, and depending on it makes this test fragile. Thus, the examples


This doesn't look like a test utility; this looks like production code.

mattklein123 · 2018-04-25T03:47:28Z

Can I ask a general question about this change: Was a much simpler solution considered? Basically in the heap allocator case, just maintain an unordered map with a reference count? That would be like 10 lines of code and I think would do the trick? Is there any benefit to the more complex approach?

jmarantz · 2018-04-25T12:59:59Z

TBH I had the same question but assumed that this done this way to share more code. But the size of this PR indicates that maybe we'd be better off just using an unordered_set.

ambuc · 2018-04-25T15:23:47Z

@mattklein123 This PR isn't that complex at the core, just some renames and some shuffling of code from one scope to another for visibility. Extracting the allocator from HotRestartImpl is the heart of that. Everything else is updating tests (and a few edge cases).

That said, the less complex approach would mean two implementations of a map-based allocator, one of which would be baked in to HotRestartImpl, and one of which would be heap-based and freely usable. This would leave us with no freely usable non-heap-based allocator, which I don't like as much.

mrice32 · 2018-04-25T15:54:06Z

I'd also like to add that the test failures that we've been seeing are related to test infrastructure issues (like preexisting static variables leaking across tests and cross platform types) rather than functional problems.

I think using the block allocator has the following advantages:

It's a single implementation in both hot and !hot restart. We don't currently run much CI for !hot restart outside of the mac tests, so reducing the differences makes our integration testing more robust and gives us bug-for-bug parity.
Smaller concern, but I think having a single implementation reduces long term maintenance costs and will ultimately be fewer lines of total code.

I think the only disadvantage is that it will create limitations for the !hot restart case that aren't explicitly necessary. So users who want to have 100,000 clusters will need to manually tune the size of their stat allocation despite the fact that it's on the heap.

That being said, I'm not too opinionated on which direction we go since both have merits.

mattklein123 · 2018-04-25T16:17:58Z

I haven't really looked at the change, but I see a hundreds of line diff for something that could be trivially accomplished with a small amount of code, even if the implementations end up being different. This is also in an area that has a super high regression risk. My preference in general given this tradeoff is always going to be the trivial approach. But, if you all feel that sharing the code is the better way forward I won't fight it.

ambuc · 2018-04-25T22:28:10Z

Closed in favor of #3212 as a simpler solution.

…ot restart (#3249) Simpler solution to issue #2453 than pull #3163, continuing draft work in ambuc#1 and ambuc#2. Summary of changes: adds an unordered_map named stats_set_ as a member variable of HeapRawStatDataAllocator, and implements reference counting / dedup on allocated stats. Risk Level: Low. Testing: Add a test to stats_impl_test. Passes bazel test test/.... Docs Changes: N/A Release Notes: This is user-facing in that non-hot restart stat allocation now resolves namespace properly, but no effect on user configs. Fixes: #2453 Signed-off-by: James Buckland <jbuckland@google.com>

Fix issue with Envoy not reference counting across scopes under not-h…

484d195

…ot restart Signed-off-by: James Buckland <jbuckland@google.com>

ambuc mentioned this pull request Apr 23, 2018

[DRAFT] Refcount stats across scopes ambuc/envoy#1

Closed

ambuc added 3 commits April 23, 2018 10:35

Fix unused private field in HotRestartNopImpl()

55f6bf6

Signed-off-by: James Buckland <jbuckland@google.com>

Move blockMemHashOptions from Server to Stats

5d73e76

so that when Envoy compiles in not-hot-restart mode and elides the hot_restart_impl files, we can still use the relevant utilities Signed-off-by: James Buckland <jbuckland@google.com>

Properly namespace instances of blockMemHashOptions

a3acf4d

Signed-off-by: James Buckland <jbuckland@google.com>

mattklein123 assigned jmarantz Apr 23, 2018

Possible casting issue, pushing to test on CircleCI

c090cf5

Signed-off-by: James Buckland <jbuckland@google.com>

jmarantz reviewed Apr 23, 2018

View reviewed changes

ambuc added 2 commits April 23, 2018 16:58

Still a possible casting issue, pushing to test on CircleCI

eb94c39

Signed-off-by: James Buckland <jbuckland@google.com>

Formatting updates

b3b2a3d

Signed-off-by: James Buckland <jbuckland@google.com>

jmarantz reviewed Apr 23, 2018

View reviewed changes

mrice32 reviewed Apr 23, 2018

View reviewed changes

Fix implicit assumption in dynamo_utility_test where Stats::RawStatDa…

0987da2

…ta::maxNameLength() was assumed to be 127 Signed-off-by: James Buckland <jbuckland@google.com>

jmarantz reviewed Apr 25, 2018

View reviewed changes

ambuc mentioned this pull request Apr 25, 2018

Fix issue with Envoy not reference counting across scopes under not-hot restart #3212

Closed

ambuc closed this Apr 25, 2018

ambuc mentioned this pull request Apr 27, 2018

Fix issue with Envoy not reference counting across scopes under not-hot restart #3249

Merged


		#include "spdlog/spdlog.h"

		// Can be overridden at compile time

Conversation

ambuc commented Apr 23, 2018

Uh oh!

mattklein123 commented Apr 23, 2018

Uh oh!

jmarantz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmarantz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmarantz left a comment

Choose a reason for hiding this comment

Uh oh!

ggreenway commented Apr 23, 2018

Uh oh!

jmarantz commented Apr 23, 2018

Uh oh!

ggreenway commented Apr 23, 2018

Uh oh!

jmarantz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 commented Apr 25, 2018

Uh oh!

jmarantz commented Apr 25, 2018

Uh oh!

ambuc commented Apr 25, 2018