Healing after failed refresh #1457

glazychev-art · 2023-05-10T11:24:59Z

Expected Behavior

Healing works as expected

Current Behavior

Healing doesn't work if refresh fails

Failure Information (for bugs)

This happens because we cancel the previous healing monitoring on refresh, but do nothing in case of error:

sdk/pkg/networkservice/common/heal/client.go

Lines 61 to 68 in 6fa2f68

    
           if cancelEventLoop, loaded := loadAndDelete(ctx); loaded { 
        
           	cancelEventLoop() 
        
           } 
        
           conn, err := next.Client(ctx).Request(ctx, request, opts...) 
        
           if err != nil { 
        
           	return nil, err 
        
           }

Perhaps we need to consider re-refresh right after a failure, or try to restore the previous monitoring

Steps to Reproduce

NSC connects to NSE
Refresh Request from NSC fails
Kill NSE to start healing
Healing doesn't start

Context

Kubernetes Version:
etc.

Failure Logs

denis-tingaikin · 2023-05-10T12:07:24Z

@glazychev-art , @d-uzlov

Initial idea from what we can start is move retry after begin

d-uzlov · 2023-05-25T08:00:25Z

Problem statement

Current heal behavior when it receives a request:

Stop current monitor thread.
On success create a new monitor thread.
On error do nothing.

sdk/pkg/networkservice/common/heal/client.go

Lines 61 to 68 in 63043b2

    
           if cancelEventLoop, loaded := loadAndDelete(ctx); loaded { 
        
           	cancelEventLoop() 
        
           } 
        
           conn, err := next.Client(ctx).Request(ctx, request, opts...) 
        
           if err != nil { 
        
           	return nil, err 
        
           }

Reasons why this works (except for failed refreshes):

Monitor cancelling doesn't do anything if healing loop has already started
Monitor retries requests until one succeeds
Monitor checks data plane in the healing loop, before each request
If data plane check is not available, heal always uses reselect

Reasons heal must stop monitoring before request:

Control plane monitoring relies on gRPC connection to next server, which is closed and recreated during request, to update timeouts and tokens.
Data plane check may give you a false-negative result if connection was interrupted during request

Heal client assumes that if request for already existing connection has failed, then healing is already running.
Obviously, this is not true for requests initiated by refresh, or by user, or by some custom chain element.

Therefore, we need to add a proper check if we already have a healing loop, and start monitoring, if we don't.

Solution 1 (Best solution)

Changes:

In heal: On request error: Restart monitoring
In heal: On request error: Check if healing loop is already running, to avoid parallel healing loops.

Pros:

All changes are in heal element

Cons:

"Heal" logic is more complex

Solution 2

Changes:

Move retry into client chain, after begin.
In retry: Modify retry to asynchronously use begin's event factory.
In heal: Transform "healing loop" into "one heal request".
In heal: On request fail check data plane synchronously, or just request reselect if data plane monitoring is not available..
In begin: Wait for async retry success for requests from user.

heal and refresh will indirectly use retry.

Pros:

Unified "retry" logic for the whole chain
heal becomes simpler

Cons:

retry's contract will change
We need changes in begin

glazychev-art assigned d-uzlov May 10, 2023

glazychev-art mentioned this issue May 18, 2023

TestVl3_dns is not stable networkservicemesh/deployments-k8s#9063

Closed

d-uzlov mentioned this issue May 19, 2023

Fix healing after failed refresh #1462

Closed

9 tasks

d-uzlov mentioned this issue Jun 6, 2023

Fix healing after failed refresh #1465

Merged

9 tasks

d-uzlov closed this as completed Jun 30, 2023

glazychev-art mentioned this issue Jul 27, 2023

NSE vl3 returns error when DNS is not ready networkservicemesh/cmd-nse-vl3-vpp#197

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Healing after failed refresh #1457

Healing after failed refresh #1457

glazychev-art commented May 10, 2023

denis-tingaikin commented May 10, 2023 •

edited

Loading

d-uzlov commented May 25, 2023 •

edited by denis-tingaikin

Loading

Healing after failed refresh #1457

Healing after failed refresh #1457

Comments

glazychev-art commented May 10, 2023

Expected Behavior

Current Behavior

Failure Information (for bugs)

Steps to Reproduce

Context

Failure Logs

denis-tingaikin commented May 10, 2023 • edited Loading

d-uzlov commented May 25, 2023 • edited by denis-tingaikin Loading

Problem statement

Solution 1 (Best solution)

Solution 2

denis-tingaikin commented May 10, 2023 •

edited

Loading

d-uzlov commented May 25, 2023 •

edited by denis-tingaikin

Loading