Skip to content

Conversation

@gaborgsomogyi
Copy link
Contributor

What changes were proposed in this pull request?

While I'm implementing SPARK-32032 I've found a bug in Kafka: https://issues.apache.org/jira/browse/KAFKA-10318. This will cause issues only later when it's fixed but it would be good to fix it now because SPARK-32032 would like to bring in AdminClient where the code blows up with the mentioned ConfigException. This would reduce the code changes in the mentioned jira. In this PR I've changed default.api.timeout.ms to request.timeout.ms which fulfils this condition.

Why are the changes needed?

Solve later problems and reduce SPARK-32032 PR size.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit tests.

@gaborgsomogyi
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Jul 28, 2020

Test build #126722 has finished for PR 29272 at commit 4c3a9f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

According to KAFKA-10318, is this Kafka 2.5.0 only bug?

@gaborgsomogyi
Copy link
Contributor Author

No, just only the latest added. It effects all previous versions where these params exist.

@HeartSaVioR
Copy link
Contributor

So my understanding is that request.timeout.ms should take precedence over default.api.timeout.ms, and it's desired for end users (in apps) to configure request.timeout.ms instead of default.api.timeout.ms, though they need to increase default.api.timeout.ms as well if they'd like to set request.timeout.ms higher than default.api.timeout.ms. Do I understand correctly?

@gaborgsomogyi
Copy link
Contributor Author

@HeartSaVioR that's exactly the main message. This part must be enforced on Kafka side just like in case of AdminClient:
they need to increase default.api.timeout.ms as well if they'd like to set request.timeout.ms higher than default.api.timeout.ms

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HeartSaVioR
Copy link
Contributor

Looks like there's no further input. I'll retrigger test and merge once it passes.

@HeartSaVioR
Copy link
Contributor

retest this, please

2 similar comments
@HeartSaVioR
Copy link
Contributor

retest this, please

@HeartSaVioR
Copy link
Contributor

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 30, 2020

Test build #126814 has finished for PR 29272 at commit 4c3a9f1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 31, 2020

Test build #126827 has finished for PR 29272 at commit 4c3a9f1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

retest this, please

@HyukjinKwon
Copy link
Member

retest this please

@HyukjinKwon
Copy link
Member

@gaborgsomogyi can you sync with the master and rebase? then the GitHub Actions build should pass. Jenkins seems down for some reasons.

@SparkQA
Copy link

SparkQA commented Jul 31, 2020

Test build #126842 has finished for PR 29272 at commit 4c3a9f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2020

Test build #126840 has finished for PR 29272 at commit 4c3a9f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

lol, somehow it passes Jenkins build. Merging.

@HeartSaVioR
Copy link
Contributor

Thanks, merged to master!

@gaborgsomogyi
Copy link
Contributor Author

Thanks everybody to help me!

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Aug 4, 2020

I have been getting reports for flaky tests based on this - I took a closer look on the explanation of Kafka configurations, and realized they're not interchangeable. My bad.

This test (in Kafka) perfectly describes how these configurations work together (I'm wondering why default.api.timeout.ms is not used there and it just passes the value into timeout parameter, but nothing changed to show the behavior):

https://github.com/apache/kafka/blob/28b7d8e21656649fb09b09f9bacfe865b0ca133c/clients/src/test/java/org/apache/kafka/clients/admin/KafkaAdminClientTest.java#L4086-L4127

    @Test
    public void testSuccessfulRetryAfterRequestTimeout() throws Exception {
        HashMap<Integer, Node> nodes = new HashMap<>();
        MockTime time = new MockTime();
        Node node0 = new Node(0, "localhost", 8121);
        nodes.put(0, node0);
        Cluster cluster = new Cluster("mockClusterId", nodes.values(),
                Arrays.asList(new PartitionInfo("foo", 0, node0, new Node[]{node0}, new Node[]{node0})),
                Collections.emptySet(), Collections.emptySet(),
                Collections.emptySet(), nodes.get(0));

        final int requestTimeoutMs = 1000;
        final int retryBackoffMs = 100;
        final int apiTimeoutMs = 3000;

        try (AdminClientUnitTestEnv env = new AdminClientUnitTestEnv(time, cluster,
                AdminClientConfig.RETRY_BACKOFF_MS_CONFIG, String.valueOf(retryBackoffMs),
                AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, String.valueOf(requestTimeoutMs))) {
            env.kafkaClient().setNodeApiVersions(NodeApiVersions.create());

            final ListTopicsResult result = env.adminClient()
                    .listTopics(new ListTopicsOptions().timeoutMs(apiTimeoutMs));

            // Wait until the first attempt has been sent, then advance the time
            TestUtils.waitForCondition(() -> env.kafkaClient().hasInFlightRequests(),
                    "Timed out waiting for Metadata request to be sent");
            time.sleep(requestTimeoutMs + 1);

            // Wait for the request to be timed out before backing off
            TestUtils.waitForCondition(() -> !env.kafkaClient().hasInFlightRequests(),
                    "Timed out waiting for inFlightRequests to be timed out");
            time.sleep(retryBackoffMs);

            // Since api timeout bound is not hit, AdminClient should retry
            TestUtils.waitForCondition(() -> env.kafkaClient().hasInFlightRequests(),
                    "Failed to retry Metadata request");
            env.kafkaClient().respond(prepareMetadataResponse(cluster, Errors.NONE));

            assertEquals(1, result.listings().get().size());
            assertEquals("foo", result.listings().get().iterator().next().name());
        }
    }

When we request without timeout parameter, we're now using default value of default.api.timeout.ms which is 60 seconds, longer than test timeout. If the request can't be retried, then request.timeout.ms would make the request fail in same time (which is intended). But if the request can be retried, it's no longer the same and it may hold the request until it is timed out via default.api.timeout.ms (which is not expected).

So if the change is intended to make sure default.api.timeout.ms >= request.timeout.ms (see KAFKA-10318), both configurations should be changed together.

HyukjinKwon pushed a commit that referenced this pull request Aug 4, 2020
…s well when specifying "request.timeout.ms" on replacing "default.api.timeout.ms"

### What changes were proposed in this pull request?

This patch is a follow-up to fill the gap in #29272 which missed to also provide `default.api.timeout.ms` as well.  #29272 unintentionally changed the behavior on Kafka side timeout which is incompatible with the test timeout. (`default.api.timeout.ms` gets default value which is 60 seconds, longer than test timeout.)

### Why are the changes needed?

We realized the PR for SPARK-32468 (#29272) doesn't work as we expect. See #29272 (comment) for more details.

### Does this PR introduce _any_ user-facing change?

No, as it only touches the tests.

### How was this patch tested?

Will trigger builds from Jenkins or Github Action multiple time and confirm.

Closes #29343 from HeartSaVioR/SPARK-32468-FOLLOWUP.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
jdcasale pushed a commit to palantir/spark that referenced this pull request Jun 22, 2021
…tests

### What changes were proposed in this pull request?
While I'm implementing SPARK-32032 I've found a bug in Kafka: https://issues.apache.org/jira/browse/KAFKA-10318. This will cause issues only later when it's fixed but it would be good to fix it now because SPARK-32032 would like to bring in `AdminClient` where the code blows up with the mentioned `ConfigException`. This would reduce the code changes in the mentioned jira. In this PR I've changed `default.api.timeout.ms` to `request.timeout.ms` which fulfils this condition.

### Why are the changes needed?
Solve later problems and reduce SPARK-32032 PR size.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing unit tests.

Closes apache#29272 from gaborgsomogyi/SPARK-32468.

Authored-by: Gabor Somogyi <[email protected]>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
jdcasale pushed a commit to palantir/spark that referenced this pull request Jun 22, 2021
…s well when specifying "request.timeout.ms" on replacing "default.api.timeout.ms"

### What changes were proposed in this pull request?

This patch is a follow-up to fill the gap in apache#29272 which missed to also provide `default.api.timeout.ms` as well.  apache#29272 unintentionally changed the behavior on Kafka side timeout which is incompatible with the test timeout. (`default.api.timeout.ms` gets default value which is 60 seconds, longer than test timeout.)

### Why are the changes needed?

We realized the PR for SPARK-32468 (apache#29272) doesn't work as we expect. See apache#29272 (comment) for more details.

### Does this PR introduce _any_ user-facing change?

No, as it only touches the tests.

### How was this patch tested?

Will trigger builds from Jenkins or Github Action multiple time and confirm.

Closes apache#29343 from HeartSaVioR/SPARK-32468-FOLLOWUP.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants