[Sub-]handlers (sort of) break eventual consistency at high load? #641

thpica · 2021-01-12T17:02:11Z

Long story short

When a handler has multiple long-running subhandlers (i.e: external API calls) and the rate of events is below the default batching window it happens that subhandlers are never executed even when the event stream stops.

Description

From my understanding kopf guarantees that handlers and subhandlers will process the latest change soon. This means that it may skip processing events and go to the latest version when the operator is available to do so (event batching).
An interesting side effect of this is that if the final state of a resource does not trigger a handler (or subhandler), the latest event is not guaranteed to be processed.

When a handler has multiple subhandlers (or a single one raising a TemporaryError), it will require multiple lifecycles for an event to be processed.
In the following scenario, events can come faster than they can be processed:

events come at a high rate for a watched resource
the (sub)handlers are slow (without blocking the event loop of course)
cpu usage is high, event loop slows down.

When this happens the resource may change between the subhandlers lifecycles.
This has two very weird side effect:

If the main handler has conditions on what subhandler will be triggered, the execution plan can change as it is executed because the the main handler is called again for each lifecycle.
Different subhandlers called for the same event might receive different (newer) versions of the resource.

This is very different from running a single handler without subhandler and can add so much complexity on high load systems that the only advantage of the subhandlers (idempotence) seems a very weak argument.

I think the documentation warning should be much bigger and scary, because subhandlers may seem an elegant solution initially but may cause your system to fall apart in production when under high load.

Here is a small repro script with a bit of plumbing using the documentation example CRD

import asyncio
import functools
import logging
from typing import Callable

import kopf
import kubernetes.client
import kubernetes.config

logging.basicConfig(
    level=logging.WARNING,
    format="[%(asctime)s] %(levelname)s - %(message)s",
)


@kopf.on.field(
    "zalando.org", "v1", "ephemeralvolumeclaims", field="status.tick"
)
async def evc_tick_update(new: int, logger: logging.LoggerAdapter, **_):
    logging.warning("[PARENT] {}".format(new))

    if new % 2:

        @kopf.subhandler(id="tick_sub_0")
        async def tick_sub_0(new: int, logger: logging.LoggerAdapter, **_):
            logger.warning("[SUB 0] {}".format(new))
            await asyncio.sleep(0.5)  # Long async task, i.e: API call

        @kopf.subhandler(id="tick_sub_1")
        async def tick_sub_1(new: int, logger: logging.LoggerAdapter, **_):
            logger.warning("[SUB 1] {}".format(new))
            await asyncio.sleep(0.5)  # Long async task, i.e: API call

        @kopf.subhandler(id="tick_sub_2")
        async def tick_sub_2(new: int, logger: logging.LoggerAdapter, **_):
            logger.warning("[SUB 2] {}".format(new))
            await asyncio.sleep(0.5)  # Long async task, i.e: API call


def call_async(fn: Callable, *args, **kwargs):
    return asyncio.get_event_loop().run_in_executor(
        None, functools.partial(fn, *args, **kwargs)
    )


async def ticker():
    kubernetes.config.load_kube_config()
    api_instance = kubernetes.client.CustomObjectsApi()
    tick = 0
    for _ in range(20):
        await call_async(
            api_instance.patch_namespaced_custom_object,
            group="zalando.org",
            version="v1",
            plural="ephemeralvolumeclaims",
            namespace="default",
            name="ticked",
            body={
                "status": {"tick": tick},
            },
        )

        await asyncio.sleep(0.2)  # Avoid event batching.
        tick += 1


@kopf.on.startup()
async def startup_fn(**_):
    kubernetes.config.load_kube_config()
    api_instance = kubernetes.client.CustomObjectsApi()
    await call_async(
        api_instance.create_namespaced_custom_object,
        group="zalando.org",
        version="v1",
        namespace="default",
        plural="ephemeralvolumeclaims",
        body={
            "apiVersion": "zalando.org/v1",
            "kind": "EphemeralVolumeClaim",
            "metadata": {"name": "ticked"},
            "spec": {},
            "status": {},
        },
    )


@kopf.on.cleanup()
async def cleanup_fn(logger, **kwargs):
    kubernetes.config.load_kube_config()
    api_instance = kubernetes.client.CustomObjectsApi()
    await call_async(
        api_instance.delete_namespaced_custom_object,
        group="zalando.org",
        version="v1",
        namespace="default",
        plural="ephemeralvolumeclaims",
        name="ticked",
    )


async def main():
    kopf_task = asyncio.create_task(
        kopf.operator(namespace="default", standalone=True)
    )
    await asyncio.sleep(1)  # Wait for kopf to be initialized.

    ticker_task = asyncio.create_task(ticker())
    await asyncio.wait([kopf_task, ticker_task])


if __name__ == "__main__":
    asyncio.run(main())

Execution output:

[2021-01-12 17:45:52,461] WARNING - [PARENT] 0
[2021-01-12 17:45:52,676] WARNING - [PARENT] 1
[2021-01-12 17:45:52,677] WARNING - [SUB 0] 1
[2021-01-12 17:45:53,314] WARNING - [PARENT] 4
[2021-01-12 17:45:53,529] WARNING - [PARENT] 5
[2021-01-12 17:45:53,529] WARNING - [SUB 0] 5
[2021-01-12 17:45:54,174] WARNING - [PARENT] 8
[2021-01-12 17:45:54,392] WARNING - [PARENT] 9
[2021-01-12 17:45:54,393] WARNING - [SUB 0] 9
[2021-01-12 17:45:55,040] WARNING - [PARENT] 12
[2021-01-12 17:45:55,253] WARNING - [PARENT] 13
[2021-01-12 17:45:55,254] WARNING - [SUB 0] 13
[2021-01-12 17:45:55,897] WARNING - [PARENT] 16
[2021-01-12 17:45:56,114] WARNING - [PARENT] 17
[2021-01-12 17:45:56,114] WARNING - [SUB 0] 17
[2021-01-12 17:45:56,737] WARNING - [PARENT] 19
[2021-01-12 17:45:56,738] WARNING - [SUB 1] 19
[2021-01-12 17:45:57,363] WARNING - [PARENT] 19
[2021-01-12 17:45:57,363] WARNING - [SUB 2] 19

.	Handlers for the last event
What happened	`[PARENT] 17` ↳ `[SUB 0] 17` `[PARENT] 19` ↳ `[SUB 1] 19` `[PARENT] 19` ↳ `[SUB 2] 19`
What should have happened	`[PARENT] 19` ↳ `[SUB 0] 19` `[PARENT] 19` ↳ `[SUB 1] 19` `[PARENT] 19` ↳ `[SUB 2] 19`

Checklist

I have read the documentation and searched there for the problem
I have searched in the GitHub Issues for similar questions

Keywords

subhandlers, idempotence, eventual consistency

The text was updated successfully, but these errors were encountered:

nolar · 2021-01-31T16:22:24Z

Yes, that is a known issue. Sadly, it is difficult to fix. Several (2 or 3) attempts were made in the past, all failed (zalando-incubator/kopf#279 (comment)). My comment from there:

Looks like a known problem (with no issue though, so this one will be it). I've made a few attempts to solve this problem previously: #9 (zalando-incubator/kopf#9), and then again in #163 (zalando-incubator/kopf#163) — but it failed spectacularly after the merge (had to revert it in #182 (zalando-incubator/kopf#182)).

The issue can happen even with the normal handlers, not necessary the sub-handlers — if the object does change before the full handling cycle is complete. In that case, the last seen essence of the object is persisted as a diff-base, as seen after the last successful handler. The preceding handlers, however, could have seen a different state (before the changes) and can end up with an inconsistent end-state.

The issue will be addressed for sure. Thanks for refreshing it and providing a repro!

In one of the latest attempts (#163), I tried to keep a hash/digest/checksum of the object's essence for each individual handler & subhandler once it is finished. And if, on the next cycle, the hash mismatches with the current essence, it means that the object has changed, and the handler/subhandler should be re-executed with the current-aka-latest state.

Sadly, I do not remember why it failed — need to re-read #182 carefully and re-play the whole changeset. Maybe, all the problems that prevented this "eventual consistency" fix from being applied, are solved by now. But the codebase has changed a lot since Aug 2019.

thpica · 2021-02-11T09:28:01Z

Thank you for your detailed answer.
I understand this is not a trivial issue and will continue to investigate on my side.
I will let you know if I can find a workaround.

hapatrick · 2022-09-06T12:44:06Z

I am pretty sure we are hitting this bug now. Does anyone have suggested workarounds? The only thing I can think of do for now is to put a timer on the handlers in our code that are susceptible to this, so even if a change is missed eventually the timer will trigger the handler. Obviously this is not very satisfying. If anyone has better suggestions I'd love to hear them.

thpica · 2022-11-29T09:38:56Z

@hapatrick My workaround was to split watched entities over multiple namespaces and replicate the controller to reduce the load per controller.
Also I made sure the handlers and subhandlers did execute as fast as possible. Avoid blocking code and network calls if you can.

I think adding timers would make it worse as the problem stems from the fact that the controller does not have enough time to process a handler and its subhandlers before the next change.

thpica added the question Further information is requested label Jan 12, 2021

nolar added bug Something isn't working and removed question Further information is requested labels Jan 31, 2021

nolar mentioned this issue Feb 24, 2021

Recovering from Missed Events #698

Open

2 tasks

nolar changed the title ~~Subhandlers (sort of) break eventual consistency at high load?~~ [Sub-]handlers (sort of) break eventual consistency at high load? Feb 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Sub-]handlers (sort of) break eventual consistency at high load? #641

[Sub-]handlers (sort of) break eventual consistency at high load? #641

thpica commented Jan 12, 2021 •

edited

Loading

nolar commented Jan 31, 2021

thpica commented Feb 11, 2021

hapatrick commented Sep 6, 2022

thpica commented Nov 29, 2022

[Sub-]handlers (sort of) break eventual consistency at high load? #641

[Sub-]handlers (sort of) break eventual consistency at high load? #641

Comments

thpica commented Jan 12, 2021 • edited Loading

Long story short

Description

Checklist

Keywords

nolar commented Jan 31, 2021

thpica commented Feb 11, 2021

hapatrick commented Sep 6, 2022

thpica commented Nov 29, 2022

thpica commented Jan 12, 2021 •

edited

Loading