Do not fail Trino is Kafka event listener cannot connect by losipiuk · Pull Request #23606 · trinodb/trino

losipiuk · 2024-09-30T09:31:47Z

Description

We do not want Trino to fail startup on intermittent network failure (e.g. DNS not being available).

Not sure if that need RN.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
(x) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## General

* Prevent network issues in the Kafka event listener from causing cluster startup problems. ({issue}`23606`)

sopel39 · 2024-09-30T09:46:14Z

...ka-event-listener/src/main/java/io/trino/plugin/eventlistener/kafka/KafkaEventPublisher.java

this is wrapped in try catch in io.trino.plugin.eventlistener.kafka.KafkaEventListener#KafkaEventListener. Why it's not sufficient?

sopel39 · 2024-09-30T09:54:11Z

...ka-event-listener/src/main/java/io/trino/plugin/eventlistener/kafka/KafkaEventPublisher.java

separate commit? Only for logging purposes?

sopel39 · 2024-09-30T09:58:57Z

...ka-event-listener/src/main/java/io/trino/plugin/eventlistener/kafka/KafkaEventPublisher.java

can it be long operation that could throttle query concurrency. Should it be async?

Yeah - maybe we should try to create it when bootstrapping, and then just have a loop which tries to reinitialize if boostrapping failed.
Then we would not call kafkaProducerSupplier here.

sopel39 · 2024-09-30T10:01:20Z

...ka-event-listener/src/main/java/io/trino/plugin/eventlistener/kafka/KafkaEventPublisher.java

What does it solve really? I think Kafka listener should either be present or not. There is kafka-event-listener.terminate-on-initialization-failure if we don't want Trino startup to fail in case listener creation fails

Oh - I did not know that there is this mechanism. It solves problem of not terminating trino if initialization fails, yeah. But do you want it to not try to connect again when intialization failed? Seems like it should try until it establishes connection. Similar like it would if you loose connection to Kafka while Trino is running.

marton-bod · 2024-09-30T10:29:59Z

All bootstrap-related errors should be caught here: https://github.com/trinodb/trino/blob/master/plugin/trino-kafka-event-listener/src/main/java/io/trino/plugin/eventlistener/kafka/KafkaEventListener.java#L52-L66
Wouldn't it be sufficient to make terminateOnInitializationFailure default to false? https://github.com/trinodb/trino/blob/master/plugin/trino-kafka-event-listener/src/main/java/io/trino/plugin/eventlistener/kafka/KafkaEventListenerConfig.java#L50

A related question: if Kafka event listener cannot connect to the brokers, isn't that sufficient reason to fail the cluster and let the user know so they can investigate? There can be critical observability infra built on top of the event-listener, so having no data coming in (but the cluster executing queries) can be misleading

losipiuk · 2024-09-30T15:14:04Z

All bootstrap-related errors should be caught here: https://github.com/trinodb/trino/blob/master/plugin/trino-kafka-event-listener/src/main/java/io/trino/plugin/eventlistener/kafka/KafkaEventListener.java#L52-L66 Wouldn't it be sufficient to make terminateOnInitializationFailure default to false? https://github.com/trinodb/trino/blob/master/plugin/trino-kafka-event-listener/src/main/java/io/trino/plugin/eventlistener/kafka/KafkaEventListenerConfig.java#L50

A related question: if Kafka event listener cannot connect to the brokers, isn't that sufficient reason to fail the cluster and let the user know so they can investigate? There can be critical observability infra built on top of the event-listener, so having no data coming in (but the cluster executing queries) can be misleading

How is lack of connectivity at startup different from loosing connectivity while cluster is up and running?

losipiuk · 2024-09-30T15:17:04Z

@sopel39 restrucutred a bit - PTAL

sopel39 · 2024-09-30T16:41:54Z

How is lack of connectivity at startup different from loosing connectivity while cluster is up and running?

One could be misconfiguration, the other could be intermittent connectivity issue. E.g. Even if query finished event delivery fails the query will still succeed IIRC, so there isn't much you can do anyway

sopel39 · 2024-09-30T16:43:10Z

...fka-event-listener/src/main/java/io/trino/plugin/eventlistener/kafka/KafkaEventListener.java

+            kafkaPublisher = new KafkaEventPublisher(config, producerFactory, stats);
        }
-        catch (Exception e) {
-            if (config.getTerminateOnInitializationFailure()) {


see #23606 (comment). I don't really think we need to complicate it further

sopel39 · 2024-09-30T16:46:02Z

...ka-event-listener/src/main/java/io/trino/plugin/eventlistener/kafka/KafkaEventPublisher.java

+                    while (true) {
+                        try {
+                            // success
+                            KafkaProducer<String, String> kafkaProducer = kafkaProducerSupplier.get();


Why do we need to recreate it? Will kafkaProducer stop working permanently in case of some intermittent issues?

sopel39 · 2024-09-30T16:46:46Z

...ent-listener/src/main/java/io/trino/plugin/eventlistener/kafka/KafkaEventListenerConfig.java

    private Set<String> excludedFields = Collections.emptySet();
    private Map<String, String> kafkaClientOverrides = Collections.emptyMap();
    private Duration requestTimeout = new Duration(10, SECONDS);
-    private boolean terminateOnInitializationFailure = true;


We probably could just keep this commit, but I still don't understand the rationale? Does it fail in tests or something else?

Not a 100% sure this is a good change, but I can be convinced. I think that if a user explicitly configures a kafka-event-listener, they would want to know if the brokers are unreachable instead of the plugin silently "failing". They can always set this flag to false explicitly, if they are fine with ignoring this problem.

marton-bod · 2024-09-30T17:51:20Z

@losipiuk To clarify the context, what is the problem we are trying to solve? Is it that if we lose connectivity to Kafka while Trino is already running, then our queries start failing? Because it shouldn't be the case right now, we should just hit a timeout exception here: https://github.com/trinodb/trino/blob/master/plugin/trino-kafka-event-listener/src/main/java/io/trino/plugin/eventlistener/kafka/KafkaEventPublisher.java#L103
Did you observe something different?

losipiuk · 2024-10-01T11:22:44Z

@losipiuk To clarify the context, what is the problem we are trying to solve?

The problem (maybe not important) is that we generally want Trino to start up if there are networking problems during start time.
For that alone we could only change the TerminateOnInitializationFailure flag to be false by default.

But with just this change if we see kafka is not running on startup it will not self-heal later on if network problem goes away.
The second commit addresses that.

Queries are not failing of course.

marton-bod · 2024-10-01T12:06:53Z

@losipiuk Thanks, the PR makes more sense to me now.

Why do we need to recreate it? Will kafkaProducer stop working permanently in case of some intermittent issues?

I would just like to clarify Karol's comment, to make sure the change is really needed, because it does seem to complicate the code logic. Thanks

sopel39 · 2024-10-01T12:41:36Z

The problem (maybe not important) is that we generally want Trino to start up if there are networking problems during start time.
For that alone we could only change the TerminateOnInitializationFailure flag to be false by default.

But with just this change if we see kafka is not running on startup it will not self-heal later on if network problem goes away.
The second commit addresses that.

Is it motivate by real use case? I still think some startup connectivity validation is needed before cluster is considered ready for use. Otherwise if we cannot reach kafka later on, we should probably fail queries

losipiuk · 2024-10-01T14:53:00Z

The problem (maybe not important) is that we generally want Trino to start up if there are networking problems during start time.
For that alone we could only change the TerminateOnInitializationFailure flag to be false by default.

But with just this change if we see kafka is not running on startup it will not self-heal later on if network problem goes away.
The second commit addresses that.

Is it motivate by real use case? I still think some startup connectivity validation is needed before cluster is considered ready for use. Otherwise if we cannot reach kafka later on, we should probably fail queries

It is motivated by the real use case. But I agree the decision what should be done in such a case how important it is for event-listener to work. I would say that most of the time we probably care about queries working more than about event listener - but maybe there are cases where having queries working but event listener not, can cause lots of problems.

mosabua · 2024-10-01T15:24:24Z

I added a suggested RN entry .. looks good @losipiuk ?

losipiuk · 2024-10-01T16:16:53Z

Yeah - looks good. Thanks @mosabua.

@sopel39 do you feel strongly it should not go in? Maybe worth a chat offline?

sopel39 · 2024-10-01T16:32:44Z

It is motivated by the real use case. But I agree the decision what should be done in such a case how important it is for event-listener to work. I would say that most of the time we probably care about queries working more than about event listener - but maybe there are cases where having queries working but event listener not, can cause lots of problems.

Similar discussion was around #22911 (see https://trinodb.slack.com/archives/GKZ8GS0SK/p1723336559476059?thread_ts=1723222353.289599&cid=GKZ8GS0SK) and we've decided that anything that makes events non-durable by default could be violation of auditing policy and is a breaking change.

This PR adds some complexity, which I'm not sure is needed TBH

sopel39 · 2024-10-01T16:34:12Z

I would say that most of the time we probably care about queries working more than about event listener

That is indeed true. Hence the current check is during cluster startup, not while queries are running already.

github-actions · 2024-10-22T17:03:01Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

mosabua · 2024-10-22T18:45:42Z

Added stale-ignore since I am sure we want to get this in.

cla-bot bot added the cla-signed label Sep 30, 2024

losipiuk requested review from domnikl, sopel39 and wendigo September 30, 2024 09:32

sopel39 requested a review from marton-bod September 30, 2024 09:45

wendigo approved these changes Sep 30, 2024

View reviewed changes

sopel39 reviewed Sep 30, 2024

View reviewed changes

losipiuk force-pushed the lukaszos/drop-unused-variables-a1e32c branch from eaa9f41 to 53c2e78 Compare September 30, 2024 15:14

losipiuk added 3 commits September 30, 2024 17:15

Try to reinitialize Kafka published if failed during bootstrap

837a7ff

Drop unused variables

74eda6c

Do not fail startup on Kafka listener initialization failure

29e2c38

losipiuk force-pushed the lukaszos/drop-unused-variables-a1e32c branch from 53c2e78 to 29e2c38 Compare September 30, 2024 15:16

sopel39 reviewed Sep 30, 2024

View reviewed changes

github-actions bot added the stale label Oct 22, 2024

mosabua added stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels Oct 22, 2024

Conversation

losipiuk commented Sep 30, 2024 • edited by mosabua Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context and related issues

Release notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sopel39 Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marton-bod commented Sep 30, 2024

Uh oh!

losipiuk commented Sep 30, 2024

Uh oh!

losipiuk commented Sep 30, 2024

Uh oh!

sopel39 commented Sep 30, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marton-bod Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marton-bod commented Sep 30, 2024

Uh oh!

losipiuk commented Oct 1, 2024

Uh oh!

marton-bod commented Oct 1, 2024

Uh oh!

sopel39 commented Oct 1, 2024

Uh oh!

losipiuk commented Oct 1, 2024

Uh oh!

mosabua commented Oct 1, 2024

Uh oh!

losipiuk commented Oct 1, 2024

Uh oh!

sopel39 commented Oct 1, 2024

Uh oh!

sopel39 commented Oct 1, 2024

Uh oh!

github-actions bot commented Oct 22, 2024

Uh oh!

mosabua commented Oct 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

losipiuk commented Sep 30, 2024 •

edited by mosabua

Loading

sopel39 Sep 30, 2024 •

edited

Loading

marton-bod Oct 1, 2024 •

edited

Loading