Skip to content

stats: prevent unused counters from leaking across hot restart#6850

Merged
mattklein123 merged 18 commits intoenvoyproxy:masterfrom
fredlas:SCO_scope
May 9, 2019
Merged

stats: prevent unused counters from leaking across hot restart#6850
mattklein123 merged 18 commits intoenvoyproxy:masterfrom
fredlas:SCO_scope

Conversation

@fredlas
Copy link
Contributor

@fredlas fredlas commented May 7, 2019

During hot restart, stores counters imported from the parent in a temporary scope. When hot restart concludes, that scope is dropped. The result is that any counter the child accessed (independently of the hot restart merging) will be preserved by the separate reference that access created, but any counter not so accessed will be dropped. This brings the hot restart counter behavior back to how it was before shared memory was removed in #5910.

Risk Level: low
Testing: updated test/common/stats/stat_merger_test.cc.

More fully fixes #4974.

fredlas added 3 commits May 7, 2019 17:42
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
@fredlas
Copy link
Contributor Author

fredlas commented May 7, 2019

/assign jmarantz
/assign mattklein123

fredlas added 3 commits May 7, 2019 18:26
… passed anyways

Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for tackling. A few comments to get started.

/wait

/**
* @return whether any known counter exists with this name.
*/
virtual bool counterExists(const std::string& counter_name) PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ahedberg for intersection with #6712. Can we intersect with the APIs being added there so we don't duplicate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is test only, so it will definitely be best to just get rid of counterExists, and use those ones in here. That will be a very easy switch; left a todo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would've called this counterExistsForTests, but probably not worth it at this point. Maybe follow-up with a rename?

Also note in the doc that this will take a lock.

fredlas added 2 commits May 8, 2019 12:59
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
fredlas added 2 commits May 8, 2019 15:33
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
@fredlas
Copy link
Contributor Author

fredlas commented May 8, 2019

Ran all tests this time. All the CI should be fixed.

Signed-off-by: Fred Douglas <fredlas@google.com>
mattklein123
mattklein123 previously approved these changes May 8, 2019
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks great, assuming @ahedberg is OK with this merging first? @jmarantz can you take a quick pass?

@mattklein123
Copy link
Member

@fredlas ASAN failure looks legit.

@ahedberg
Copy link
Contributor

ahedberg commented May 8, 2019

Thanks, this looks great, assuming @ahedberg is OK with this merging first? @jmarantz can you take a quick pass?

Yup, SGTM. Fred's TODO SGTM too.

fredlas added 2 commits May 8, 2019 16:27
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
@fredlas
Copy link
Contributor Author

fredlas commented May 8, 2019

Haha sorry, changed my mind. Turns out the state transitions don't 100% make sense to switch to an enum. Just adding the 3rd bool is super simple, so I have now added that to this PR.

@mattklein123
Copy link
Member

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: tsan (failed build)

🐱

Caused by: a #6850 (comment) was created by @mattklein123.

see: more, trace.

mattklein123
mattklein123 previously approved these changes May 8, 2019
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. Will give @jmarantz a chance to take a look.

jmarantz
jmarantz previously approved these changes May 9, 2019
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

I'm OK with merging this with my nits addressed in a follow-up, but I'll let Matt make the call?

/**
* @return whether any known counter exists with this name.
*/
virtual bool counterExists(const std::string& counter_name) PURE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would've called this counterExistsForTests, but probably not worth it at this point. Maybe follow-up with a rename?

Also note in the doc that this will take a lock.

@mattklein123
Copy link
Member

Fine for me to wait for @jmarantz comments. They all seem like great things to do.

/wait

@fredlas
Copy link
Contributor Author

fredlas commented May 9, 2019

All of that code is going to be deleted when #6712 merges. I don't think it's worth perfecting. If #6712 is going to take a while, maybe splitting out just the getCounter part of it to be submitted separately would be the way to go.

@mattklein123
Copy link
Member

All of that code is going to be deleted when #6712 merges. I don't think it's worth perfecting. If #6712 is going to take a while, maybe splitting out just the getCounter part of it to be submitted separately would be the way to go.

I agree, but why add it in the first place given that we don't know when the other PR will merge? Per @jmarantz we can just have a method directly on the ThreadLocalStore that does what we need and is marked for testing?

@jmarantz
Copy link
Contributor

jmarantz commented May 9, 2019

Moreover it occurred to me after reflecting on the code that these new functions aren't needed, as there is already TestUtility::findCounter() and findGauge, declared in test/test_common/utility.h

fredlas added 2 commits May 9, 2019 13:57
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
@fredlas fredlas dismissed stale reviews from jmarantz and mattklein123 via e91f05a May 9, 2019 17:57
@fredlas
Copy link
Contributor Author

fredlas commented May 9, 2019

Nice find, thanks! Switched to those.

Signed-off-by: Fred Douglas <fredlas@google.com>
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

}
// We accessed 0 and 1 above, but not 2. Now that StatMerger has been destroyed,
// 2 should be gone.
EXPECT_TRUE(TestUtility::findCounter(store, "newcounter0"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably write it as EXPECT_TRUE(TestUtility::findCounter(store, "newcounter0") != nullptr) or maybe EXPECT_NE(TestUtility::findCounter(store, "newcounter0"), nullptr) if that compiles, but meh :)

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome

@mattklein123 mattklein123 merged commit 6ffa8c1 into envoyproxy:master May 9, 2019
mpuncel added a commit to mpuncel/envoy that referenced this pull request May 10, 2019
* master: (88 commits)
  upstream: Null-deref on TCP health checker if setsockopt fails  (envoyproxy#6793)
  ci: switch macOS CI to azure pipelines (envoyproxy#6889)
  os syscalls lib: break apart syscalls used for hot restart (envoyproxy#6880)
  Kafka codec: precompute request size before serialization, so we do n… (envoyproxy#6862)
  upstream: move static and strict_dns clusters to dedicated files (envoyproxy#6886)
  Rollforward of api: Add total_issued_requests to Upstream Locality and Endpoint Stats. (envoyproxy#6692) (envoyproxy#6784)
  fix explicit constructor in copy-initialization (envoyproxy#6884)
  stats: use tag iterator rather than constructing the tag-array and searching that. (envoyproxy#6853)
  common: use unscoped build target in generate_version_linkstamp (envoyproxy#6877)
  Addendum to envoyproxy#6778 (envoyproxy#6882)
  ci: add minimum Linux build for Azure Pipelines (envoyproxy#6881)
  grpc: utilities for inter-converting grpc::ByteBuffer and Buffer::Instance. (envoyproxy#6732)
  upstream: allow excluding hosts from lb calculations until initial health check (envoyproxy#6794)
  stats: prevent unused counters from leaking across hot restart (envoyproxy#6850)
  network filters: add `injectDataToFilterChain(data, end_stream)` method to network filter callbacks (envoyproxy#6750)
  delete things that snuck back in (envoyproxy#6873)
  config: scoped rds (2b): support delta APIs in ConfigProvider framework (envoyproxy#6781)
  string == string! (envoyproxy#6868)
  config: add mssing imports to delta_subscription_state (envoyproxy#6869)
  protobuf: add missing default case to enum (envoyproxy#6870)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@rgs1
Copy link
Member

rgs1 commented May 13, 2019

@fredlas @jmarantz fwiw, we picked up this change but we are still seeing metrics drop over a hot restart. I'll dig into it a bit more, but wanted to give you an early heads up.

@fredlas
Copy link
Contributor Author

fredlas commented May 13, 2019

By "drop", do you mean values going to 0, or metrics going from existing to not existing? Either version would be different from the issue solved in this PR, so this discussion should probably move to an issue, but it does sound like something to discuss.

@rgs1
Copy link
Member

rgs1 commented May 13, 2019

This is how it looks for cluster.${cluster}.health_check.failure during a hot restart:

image

I am wondering if the move from shared mem to copying stats over IPC would allow for such a drop & recovery to happen. Given that copying stats over takes time...

cc: @fishcakez

@fredlas
Copy link
Contributor Author

fredlas commented May 13, 2019

That problem is not related to this PR; we should discuss it elsewhere.

@mattklein123
Copy link
Member

@rgs1 I have an idea of why this is happening but per @fredlas it's unrelated to this PR. Can you open a fresh issue and we can discuss there?

@rgs1
Copy link
Member

rgs1 commented May 13, 2019

@mattklein123 yeah lets chat here: #6924

@fredlas thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

stats: Consider communicating stats across hot-restart via RPC rather than shared memory

5 participants