Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: High Deletion Latency When Route Data Volume Is Large #2342

Open
kachidoki opened this issue Jan 2, 2025 · 0 comments
Open

perf: High Deletion Latency When Route Data Volume Is Large #2342

kachidoki opened this issue Jan 2, 2025 · 0 comments

Comments

@kachidoki
Copy link

Issue Faced

We are experiencing significant delays when deleting Routes in bulk. The problem becomes evident when the number of Routes in the cluster exceeds 1w+. During this process, we observe noticeable latency and a clear spike in the etcd monitoring.

Upon reviewing the code, I found the following segment which seems to be causing the issue:

// TODO: Maintain a reference count for each object without having to poll each time
func (u *upstreamClient) deleteCheck(ctx context.Context, obj *v1.Upstream) (bool, error) {
    routes, _ := u.cluster.route.List(ctx)
    sroutes, _ := u.cluster.cache.ListStreamRoutes()
    if routes == nil && sroutes == nil {
        return true, nil
    }
    for _, route := range routes {
        if route.UpstreamId == obj.ID {
            return false, fmt.Errorf("can not delete this upstream, route.id=%s is still using it now", route.ID)
        }
    }
    for _, sroute := range sroutes {
        if sroute.UpstreamId == obj.ID {
            return false, fmt.Errorf("can not delete this upstream, stream_route.id=%s is still using it now", sroute.ID)
        }
    }
    return true, nil
}

The line
routes, _ := u.cluster.route.List(ctx)
causes the code to iterate through all routes in the cluster during every deletion. This results in unnecessary overhead.

image Additionally, the latency spikes observed in etcd monitoring are caused by the etcd range calls made during the deletion of Routes.

I would like to understand why the code is fetching the routes from the cluster every time rather than from the cache. Was this a design decision or is it an oversight?

Logs

No response

Steps to Reproduce

  1. install apisix and apisix ingress controller
  2. route crd volume > 1w+
  3. delete route crd

Environment

  • APISIX Ingress Controller Version:
    We are using a self-developed version of the APISIX Ingress Controller based on an older release, which differs from the latest official versions. However, the code for deleting Routes, which is causing the performance issue, remains consistent with the official versions.

  • Kubernetes Cluster Version: v1.24.4.

  • OS Version: CentOS 7.6 x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant