Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Healing after failed refresh #1457

Closed
glazychev-art opened this issue May 10, 2023 · 2 comments
Closed

Healing after failed refresh #1457

glazychev-art opened this issue May 10, 2023 · 2 comments
Assignees

Comments

@glazychev-art
Copy link
Contributor

Expected Behavior

Healing works as expected

Current Behavior

Healing doesn't work if refresh fails

Failure Information (for bugs)

This happens because we cancel the previous healing monitoring on refresh, but do nothing in case of error:

if cancelEventLoop, loaded := loadAndDelete(ctx); loaded {
cancelEventLoop()
}
conn, err := next.Client(ctx).Request(ctx, request, opts...)
if err != nil {
return nil, err
}

Perhaps we need to consider re-refresh right after a failure, or try to restore the previous monitoring

Steps to Reproduce

  1. NSC connects to NSE
  2. Refresh Request from NSC fails
  3. Kill NSE to start healing
  4. Healing doesn't start

Context

  • Kubernetes Version:
  • etc.

Failure Logs

@denis-tingaikin
Copy link
Member

denis-tingaikin commented May 10, 2023

@glazychev-art , @d-uzlov

Initial idea from what we can start is move retry after begin

@d-uzlov
Copy link
Contributor

d-uzlov commented May 25, 2023

Problem statement

Current heal behavior when it receives a request:

  • Stop current monitor thread.
  • On success create a new monitor thread.
  • On error do nothing.

if cancelEventLoop, loaded := loadAndDelete(ctx); loaded {
cancelEventLoop()
}
conn, err := next.Client(ctx).Request(ctx, request, opts...)
if err != nil {
return nil, err
}

image

Reasons why this works (except for failed refreshes):

  • Monitor cancelling doesn't do anything if healing loop has already started
  • Monitor retries requests until one succeeds
  • Monitor checks data plane in the healing loop, before each request
  • If data plane check is not available, heal always uses reselect

Reasons heal must stop monitoring before request:

  • Control plane monitoring relies on gRPC connection to next server, which is closed and recreated during request, to update timeouts and tokens.
  • Data plane check may give you a false-negative result if connection was interrupted during request

Heal client assumes that if request for already existing connection has failed, then healing is already running.
Obviously, this is not true for requests initiated by refresh, or by user, or by some custom chain element.

Therefore, we need to add a proper check if we already have a healing loop, and start monitoring, if we don't.


Solution 1 (Best solution)

Changes:

  • In heal: On request error: Restart monitoring
  • In heal: On request error: Check if healing loop is already running, to avoid parallel healing loops.

Pros:

  • All changes are in heal element

Cons:

  • "Heal" logic is more complex

Solution 2

Changes:

  • Move retry into client chain, after begin.
  • In retry: Modify retry to asynchronously use begin's event factory.
  • In heal: Transform "healing loop" into "one heal request".
  • In heal: On request fail check data plane synchronously, or just request reselect if data plane monitoring is not available..
  • In begin: Wait for async retry success for requests from user.

heal and refresh will indirectly use retry.

Pros:

  • Unified "retry" logic for the whole chain
  • heal becomes simpler

Cons:

  • retry's contract will change
  • We need changes in begin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants