perf: High Deletion Latency When Route Data Volume Is Large #2342

kachidoki · 2025-01-02T09:13:28Z

Issue Faced

We are experiencing significant delays when deleting Routes in bulk. The problem becomes evident when the number of Routes in the cluster exceeds 1w+. During this process, we observe noticeable latency and a clear spike in the etcd monitoring.

Upon reviewing the code, I found the following segment which seems to be causing the issue:

// TODO: Maintain a reference count for each object without having to poll each time
func (u *upstreamClient) deleteCheck(ctx context.Context, obj *v1.Upstream) (bool, error) {
    routes, _ := u.cluster.route.List(ctx)
    sroutes, _ := u.cluster.cache.ListStreamRoutes()
    if routes == nil && sroutes == nil {
        return true, nil
    }
    for _, route := range routes {
        if route.UpstreamId == obj.ID {
            return false, fmt.Errorf("can not delete this upstream, route.id=%s is still using it now", route.ID)
        }
    }
    for _, sroute := range sroutes {
        if sroute.UpstreamId == obj.ID {
            return false, fmt.Errorf("can not delete this upstream, stream_route.id=%s is still using it now", sroute.ID)
        }
    }
    return true, nil
}

The line
routes, _ := u.cluster.route.List(ctx)
causes the code to iterate through all routes in the cluster during every deletion. This results in unnecessary overhead.

Additionally, the latency spikes observed in etcd monitoring are caused by the etcd range calls made during the deletion of Routes.

I would like to understand why the code is fetching the routes from the cluster every time rather than from the cache. Was this a design decision or is it an oversight?

Logs

No response

Steps to Reproduce

install apisix and apisix ingress controller
route crd volume > 1w+
delete route crd

Environment

APISIX Ingress Controller Version:
We are using a self-developed version of the APISIX Ingress Controller based on an older release, which differs from the latest official versions. However, the code for deleting Routes, which is causing the performance issue, remains consistent with the official versions.
Kubernetes Cluster Version: v1.24.4.
OS Version: CentOS 7.6 x86

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: High Deletion Latency When Route Data Volume Is Large #2342

perf: High Deletion Latency When Route Data Volume Is Large #2342

kachidoki commented Jan 2, 2025

perf: High Deletion Latency When Route Data Volume Is Large #2342

perf: High Deletion Latency When Route Data Volume Is Large #2342

Comments

kachidoki commented Jan 2, 2025

Issue Faced

Logs

Steps to Reproduce

Environment