Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] Introduce Quality-of-Service for the replication stream to reduce full sync as a result of buffer overruns #1596

Open
xbasel opened this issue Jan 21, 2025 · 2 comments

Comments

@xbasel
Copy link
Member

xbasel commented Jan 21, 2025

When replica nodes are very busy processing customer traffic, the replication stream can get starved, causing disconnections due to exceeding buffer limits. This often triggers a full sync.

To reduce the likelihood of full syncs caused by client output buffer overruns, we can add a quality-of-service mechanism in the replica to prioritize replication traffic during high load.

Description of the feature
This feature improves the availability of replicas and the stability of primaries by reducing the chances of full syncs.

The replica can detect replication traffic bursts by monitoring the buffer during primary socket reads. For example, if the application buffer has been fully filled for the last N reads, the replica can assume the kernel level TCP receive queue buffer isn’t empty and traffic is high. In this case, it can prioritize more socket reads for the primary file descriptor, helping to drain the shared replication buffer in the primary faster and reducing the chances of full syncs.

@zuiderkwast
Copy link
Contributor

You mean the IP_TOS field in IP packets? https://en.wikipedia.org/wiki/Type_of_service. IIUC, the kernel can prioritize packets according to this and routers can use it too. I think we can set it on replication connections and cluster bus connections. It should be just a setsockopt call on the socket fd.

@xbasel
Copy link
Member Author

xbasel commented Jan 21, 2025

No. I believe the bottleneck lies with the engine itself (CPU), not the network. If the replica processes the primary connection like any other client, and there are many clients sending commands, the replication connection could become starved.

Current implementation:

fds = epoll();
for (fd : fds) {
    buf = read(fd);
    process(buf);
}

Proposed approach:

fds = epoll();
for (fd : fds) {
    if (fd == primary) {
        handleReplication(fd);
    } else {
        buf = read(fd);
        process(buf);
    }
}

handleReplication(fd) {
    do {
        buf = read(fd);
        process(buf);
    } while (len(buf) == MAX_BUF_LEN && some_threshold_to_prevent_starving_other_clients_to_an_extreme_level);
}

If the condition len(buf) == MAX_BUF_LEN is true, this means the rx queue in the kernel is likely to have more data to read, this can be observed in netstat, Recv-Q field would be constantly way above 0. It should ideally be 0 or close to 0 at all times, especially for the replication connection..

We can set IP_TOS, although I'm not sure how impactful it would be, especially that many routers might ignore it.

I believe I can demonstrate this in a PR, before & after .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants