[C++] Auto update topic partitions #6732

BewareMyPower · 2020-04-13T15:07:23Z

Motivation

We need to increase producers or consumers when partitions updated.

Java client has implemented this feature, see #3513. This PR trys to implement the same feature in C++ client.

Modifications

Add a boost::asio::deadline_timer to PartitionedConsumerImpl and PartitionedProducerImpl to register lookup task to detect partitions changes periodly;
Add an unsigned int configuration parameter to indicate the period seconds of detecting partitions change (default: 60 seconds);
Unlock the mutex_ in PartitionedConsumerImpl::receive after state_ were checked.

Explain: When new consumers are created, handleSinglePartitionConsumerCreated will be called finally, which tried to lock the mutex_. It may happen that receive acquire the lock again and again so that handleSinglePartitionConsumerCreated
are blocked in lock.lock() for a long time.

Verifying this change

Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

Run PartitionsUpdateTest test suite after the cmake build.

Documentation

Does this pull request introduce a new feature? (yes)
If yes, how is the feature documented? (docs)

…client

BewareMyPower · 2020-04-15T18:37:17Z

/pulsarbot run-failure-checks

BewareMyPower · 2020-04-16T02:17:46Z

/pulsarbot run-failure-checks

BewareMyPower · 2020-04-16T04:22:58Z

/pulsarbot run cpp-tests

BewareMyPower · 2020-04-16T06:33:45Z

/pulsarbot run cpp-tests

BewareMyPower · 2020-04-16T08:37:08Z

/pulsarbot run unit-tests

BewareMyPower · 2020-04-16T10:54:52Z

/pulsarbot run cpp-tests

BewareMyPower · 2020-04-16T12:59:00Z

/pulsarbot run cpp-tests

jiazhai · 2020-04-16T15:42:58Z

pulsar-client-cpp/lib/PartitionedConsumerImpl.cc

-    assert(unsubscribedSoFar_ <= numPartitions_);
-    assert(consumerIndex <= numPartitions_);
+    Lock consumersLock(consumersMutex_);
+    const auto numPartitions = numPartitions_;


Could this change be avoid? Seems brings in lot of un-needed changes, also a little confusing to made numPartitions_ and numPartitions mix-used together in this file.

I'll change soon, here I should've called method like getNumPartitionsWithLock() (like I did in PartitionedConsumerImpl.
Because producers_/consumers_ may be modified in timer's callback, the code that accessed producers_/consumers_ and some other members (see comments in headers) should be protected by lock()/unlock(), except in start().

jiazhai · 2020-04-16T15:49:17Z

/pulsarbot run-failure-checks

jiazhai · 2020-04-16T15:56:29Z

@BewareMyPower thanks for the great work. Seems it is not normal for the cpp tests failure.
Could the tests get success in your local env?

BewareMyPower · 2020-04-16T16:29:48Z

@BewareMyPower thanks for the great work. Seems it is not normal for the cpp tests failure.
Could the tests get success in your local env?

Yeah. The test also passed in github action, you can see CI of commit 89ba471.

cpp-tests > run-tests (line 136)

[98/163] PartitionsUpdateTest.testPartitionsUpdate (13748 ms)

The strange thing is that the 163 tests cannot be completed, I've seen at most 141 tests completed, and the number may also be 121. The current test may be 98:

Thu, 16 Apr 2020 16:19:38 GMT
[96/163] BasicEndToEndTest.testPatternEmptyUnsubscribe (37 ms)
Thu, 16 Apr 2020 16:19:38 GMT
[97/163] BasicEndToEndTest.testpatternMultiTopicsHttpConsumerPubSub (2493 ms)
Thu, 16 Apr 2020 16:19:38 GMT
[98/163] PartitionsUpdateTest.testPartitionsUpdate (13718 ms)

I guessed after 90 minutes (2 hours reached), the completes tests number will still be 90.

BewareMyPower · 2020-04-17T10:03:01Z

@BewareMyPower thanks for the great work. Seems it is not normal for the cpp tests failure.
Could the tests get success in your local env?

Now I've found where the problem is. The tests stuck at the loop in TEST(BasicEndToEndTest, testPartitionTopicUnAckedMessageTimeout):

    while (true) {
        // maximum wait time
        ASSERT_LE(timeWaited, unAckedMessagesTimeoutMs * 3);
        if (messagesReceived >= 10 * 2) {  // **Problem**: Never reached here
            break;
        }
        std::this_thread::sleep_for(std::chrono::milliseconds(500));
        timeWaited += 500;
    }

It seems that MessageListener doesn't work after my commits. I will try to solve it.

BewareMyPower · 2020-04-17T20:05:13Z

/pulsarbot run-failure-checks

BewareMyPower · 2020-04-17T21:42:26Z

/pulsarbot run-failure-checks

BewareMyPower · 2020-04-18T01:58:25Z

/pulsarbot run-failure-checks

BewareMyPower · 2020-04-18T07:26:43Z

/pulsarbot run-failure-checks

BewareMyPower · 2020-04-18T09:57:20Z

/pulsarbot run-failure-checks

BewareMyPower · 2020-04-18T10:58:48Z

/pulsarbot run-failure-checks

BewareMyPower · 2020-04-18T12:03:42Z

/pulsarbot run process

BewareMyPower · 2020-04-18T12:53:54Z

/pulsarbot run-failure-checks

BewareMyPower · 2020-04-18T14:38:51Z

/pulsarbot run process

BewareMyPower · 2020-04-18T16:04:30Z

Hi, @codelipenghui @merlimat @jiazhai @sijie , could you help me with the failure of unit tests?

I've only changed files under pulsar-client-cpp directory, but there're some unrelated tests that can't pass:

Integration - Process / process
Unit - Adaptors / unit-tests
Unit - Broker Auth SASL / unit-tests
Unit - Flaky / unit-test-flaky
Unit - Proxy / unit-tests
Unit / unit-tests

They all failed with No space left on device. It seems like Github Actions' problem.

Failure message of Integration - Process / process:

[ERROR] Error processing tar file(exit status 1): write /pulsar/connectors/pulsar-io-redis-2.6.0-SNAPSHOT.nar: no space left on device

[ERROR] Failed to execute goal com.spotify:dockerfile-maven-plugin:1.4.13:build (default) on project pulsar-all-docker-image: Could not build image: Error processing tar file(exit status 1): write /pulsar/connectors/pulsar-io-redis-2.6.0-SNAPSHOT.nar: no space left on device -> [Help 1]

The other tests:

[ERROR] Failed to execute goal on project bc_2_0_1: Could not resolve dependencies for project org.apache.pulsar.tests:bc_2_0_1:jar:2.6.0-SNAPSHOT: Could not transfer artifact org.apache.pulsar:pulsar-client:jar:2.0.1-incubating from/to central (https://repo1.maven.org/maven2): GET request of: org/apache/pulsar/pulsar-client/2.0.1-incubating/pulsar-client-2.0.1-incubating.jar from central failed: No space left on device -> [Help 1]

I tried to rollback to the older commit that has passed these tests before, or run failure checks again, but the error still happens. How should I deal with it? Should I close this PR and open a new PR?

BewareMyPower · 2020-04-19T04:42:25Z

/pulsarbot run-failure-checks

BewareMyPower · 2020-04-20T03:51:27Z

/pulsarbot run unit-test-flaky

sijie · 2020-04-20T07:39:58Z

@BewareMyPower those are flaky tests due to the environmental issue. Typically it was because the disk space is not enough in Github, so it is not able to start the container-based integration. You can use /pulsarbot run-failure-checks to re-run those failures. But before your trigger that, it is good to check the logs of failures to see if it is related to your code change.

BewareMyPower · 2020-04-20T09:27:15Z

@BewareMyPower those are flaky tests due to the environmental issue. Typically it was because the disk space is not enough in Github, so it is not able to start the container-based integration. You can use /pulsarbot run-failure-checks to re-run those failures. But before your trigger that, it is good to check the logs of failures to see if it is related to your code change.

Thanks for your reply. I've rerun those failures for many times but only seen the same No space left on device error. I also saw some other PRs have the same failure, like #6760 and #6766. The problem is I'm not sure if the failure is recoverable. Did the frequently rerun before generate many dirty files causing that the test environment has no enough space?

BewareMyPower · 2020-04-20T09:40:18Z

/pulsarbot run-failure-checks

sijie · 2020-04-20T16:09:59Z

@BewareMyPower this usually indicates a problem in the Github Action. We might need to see how to clean the disk space up. /cc @tuteng @codelipenghui

BewareMyPower · 2020-04-22T06:25:09Z

/pulsarbot run-failure-checks

BewareMyPower · 2020-04-22T07:26:14Z

@merlimat @sijie Finally all checks have passed :) PTAL

jiazhai · 2020-05-12T13:40:22Z

conflict with #6938 will mark this as 2.5.2.

### Motivation We need to increase producers or consumers when partitions updated. Java client has implemented this feature, see [#3513](#3513). This PR trys to implement the same feature in C++ client. ### Modifications - Add a `boost::asio::deadline_timer` to `PartitionedConsumerImpl` and `PartitionedProducerImpl` to register lookup task to detect partitions changes periodly; - Add an `unsigned int` configuration parameter to indicate the period seconds of detecting partitions change (default: 60 seconds); - Unlock the `mutex_` in `PartitionedConsumerImpl::receive` after `state_` were checked. > Explain: When new consumers are created, `handleSinglePartitionConsumerCreated` will be called finally, which tried to lock the `mutex_`. It may happen that `receive` acquire the lock again and again so that `handleSinglePartitionConsumerCreated` are blocked in `lock.lock()` for a long time. * auto update topic partitions extend for consumer and producer in c++ client * add c++ unit test for partitions update * format code with clang-format-5.0 * stop partitions update timer after producer/consumer called closeAsync() * fix bugs when running gtest-parallel * fix bug: Producer::flush() may cause deadlock * use getters to read `numPartitions` with or without lock (cherry picked from commit 30934e1)

### Motivation We need to increase producers or consumers when partitions updated. Java client has implemented this feature, see [apache#3513](apache#3513). This PR trys to implement the same feature in C++ client. ### Modifications - Add a `boost::asio::deadline_timer` to `PartitionedConsumerImpl` and `PartitionedProducerImpl` to register lookup task to detect partitions changes periodly; - Add an `unsigned int` configuration parameter to indicate the period seconds of detecting partitions change (default: 60 seconds); - Unlock the `mutex_` in `PartitionedConsumerImpl::receive` after `state_` were checked. > Explain: When new consumers are created, `handleSinglePartitionConsumerCreated` will be called finally, which tried to lock the `mutex_`. It may happen that `receive` acquire the lock again and again so that `handleSinglePartitionConsumerCreated` are blocked in `lock.lock()` for a long time. * auto update topic partitions extend for consumer and producer in c++ client * add c++ unit test for partitions update * format code with clang-format-5.0 * stop partitions update timer after producer/consumer called closeAsync() * fix bugs when running gtest-parallel * fix bug: Producer::flush() may cause deadlock * use getters to read `numPartitions` with or without lock

### Motivation We need to increase producers or consumers when partitions updated. Java client has implemented this feature, see [apache#3513](apache#3513). This PR trys to implement the same feature in C++ client. ### Modifications - Add a `boost::asio::deadline_timer` to `PartitionedConsumerImpl` and `PartitionedProducerImpl` to register lookup task to detect partitions changes periodly; - Add an `unsigned int` configuration parameter to indicate the period seconds of detecting partitions change (default: 60 seconds); - Unlock the `mutex_` in `PartitionedConsumerImpl::receive` after `state_` were checked. > Explain: When new consumers are created, `handleSinglePartitionConsumerCreated` will be called finally, which tried to lock the `mutex_`. It may happen that `receive` acquire the lock again and again so that `handleSinglePartitionConsumerCreated` are blocked in `lock.lock()` for a long time. * auto update topic partitions extend for consumer and producer in c++ client * add c++ unit test for partitions update * format code with clang-format-5.0 * stop partitions update timer after producer/consumer called closeAsync() * fix bugs when running gtest-parallel * fix bug: Producer::flush() may cause deadlock * use getters to read `numPartitions` with or without lock (cherry picked from commit 30934e1)

### Motivation We need to increase producers or consumers when partitions updated. Java client has implemented this feature, see [apache#3513](apache#3513). This PR trys to implement the same feature in C++ client. ### Modifications - Add a `boost::asio::deadline_timer` to `PartitionedConsumerImpl` and `PartitionedProducerImpl` to register lookup task to detect partitions changes periodly; - Add an `unsigned int` configuration parameter to indicate the period seconds of detecting partitions change (default: 60 seconds); - Unlock the `mutex_` in `PartitionedConsumerImpl::receive` after `state_` were checked. > Explain: When new consumers are created, `handleSinglePartitionConsumerCreated` will be called finally, which tried to lock the `mutex_`. It may happen that `receive` acquire the lock again and again so that `handleSinglePartitionConsumerCreated` are blocked in `lock.lock()` for a long time. * auto update topic partitions extend for consumer and producer in c++ client * add c++ unit test for partitions update * format code with clang-format-5.0 * stop partitions update timer after producer/consumer called closeAsync() * fix bugs when running gtest-parallel * fix bug: Producer::flush() may cause deadlock * use getters to read `numPartitions` with or without lock

BewareMyPower added 2 commits April 9, 2020 21:27

auto update topic partitions extend for consumer and producer in c++ …

506f080

…client

add c++ unit test for partitions update

09451ae

codelipenghui requested review from merlimat, jiazhai and sijie April 14, 2020 00:09

codelipenghui assigned BewareMyPower Apr 14, 2020

codelipenghui added the component/c++ label Apr 14, 2020

codelipenghui added this to the 2.6.0 milestone Apr 14, 2020

BewareMyPower added 3 commits April 14, 2020 11:09

format code with clang-format-5.0

1748bb6

stop partitions update timer after producer/consumer called closeAsync()

b4d2a92

fix bugs when running gtest-parallel

89ba471

BewareMyPower force-pushed the master branch from 6a0b23e to 89ba471 Compare April 15, 2020 16:33

jiazhai reviewed Apr 16, 2020

View reviewed changes

BewareMyPower added 2 commits April 17, 2020 22:23

fix bug: Producer::flush() may cause deadlock

9de457a

use getters to read numPartitions with or without lock

657269c

BewareMyPower force-pushed the master branch from 6b2e55e to 657269c Compare April 17, 2020 17:59

BewareMyPower force-pushed the master branch 2 times, most recently from 9de457a to 89ba471 Compare April 18, 2020 08:01

BewareMyPower changed the title ~~Auto update topic partitions for C++ client~~ [C++] Auto update topic partitions May 4, 2020

jiazhai approved these changes May 10, 2020

View reviewed changes

jiazhai merged commit 30934e1 into apache:master May 10, 2020

jiazhai added the release/2.5.2 label May 12, 2020

BewareMyPower mentioned this pull request Jun 7, 2020

regex subscription not working for new topics in Python #7168

Closed

BewareMyPower mentioned this pull request Oct 14, 2022

[Feature] Supports producers to automatically obtain partitions apache/pulsar-client-cpp#50

Closed

2 tasks

RobertIndie mentioned this pull request Feb 9, 2023

Incorrect configuration of PartititionsUpdateInterval apache/pulsar-client-cpp#191

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Auto update topic partitions #6732

[C++] Auto update topic partitions #6732

BewareMyPower commented Apr 13, 2020 •

edited

Loading

BewareMyPower commented Apr 15, 2020

BewareMyPower commented Apr 16, 2020

BewareMyPower commented Apr 16, 2020

BewareMyPower commented Apr 16, 2020

BewareMyPower commented Apr 16, 2020

BewareMyPower commented Apr 16, 2020

BewareMyPower commented Apr 16, 2020

jiazhai Apr 16, 2020

BewareMyPower Apr 16, 2020

jiazhai commented Apr 16, 2020

jiazhai commented Apr 16, 2020

BewareMyPower commented Apr 16, 2020

BewareMyPower commented Apr 17, 2020

BewareMyPower commented Apr 17, 2020

BewareMyPower commented Apr 17, 2020

BewareMyPower commented Apr 18, 2020

BewareMyPower commented Apr 18, 2020

BewareMyPower commented Apr 18, 2020

BewareMyPower commented Apr 18, 2020

BewareMyPower commented Apr 18, 2020

BewareMyPower commented Apr 18, 2020

BewareMyPower commented Apr 18, 2020

BewareMyPower commented Apr 18, 2020

BewareMyPower commented Apr 19, 2020

BewareMyPower commented Apr 20, 2020

sijie commented Apr 20, 2020

BewareMyPower commented Apr 20, 2020

BewareMyPower commented Apr 20, 2020

sijie commented Apr 20, 2020

BewareMyPower commented Apr 22, 2020

BewareMyPower commented Apr 22, 2020

jiazhai commented May 12, 2020

[C++] Auto update topic partitions #6732

[C++] Auto update topic partitions #6732

Conversation

BewareMyPower commented Apr 13, 2020 • edited Loading

Motivation

Modifications

Verifying this change

Documentation

BewareMyPower commented Apr 15, 2020

BewareMyPower commented Apr 16, 2020

BewareMyPower commented Apr 16, 2020

BewareMyPower commented Apr 16, 2020

BewareMyPower commented Apr 16, 2020

BewareMyPower commented Apr 16, 2020

BewareMyPower commented Apr 16, 2020

jiazhai Apr 16, 2020

Choose a reason for hiding this comment

BewareMyPower Apr 16, 2020

Choose a reason for hiding this comment

jiazhai commented Apr 16, 2020

jiazhai commented Apr 16, 2020

BewareMyPower commented Apr 16, 2020

BewareMyPower commented Apr 17, 2020

BewareMyPower commented Apr 17, 2020

BewareMyPower commented Apr 17, 2020

BewareMyPower commented Apr 18, 2020

BewareMyPower commented Apr 18, 2020

BewareMyPower commented Apr 18, 2020

BewareMyPower commented Apr 18, 2020

BewareMyPower commented Apr 18, 2020

BewareMyPower commented Apr 18, 2020

BewareMyPower commented Apr 18, 2020

BewareMyPower commented Apr 18, 2020

BewareMyPower commented Apr 19, 2020

BewareMyPower commented Apr 20, 2020

sijie commented Apr 20, 2020

BewareMyPower commented Apr 20, 2020

BewareMyPower commented Apr 20, 2020

sijie commented Apr 20, 2020

BewareMyPower commented Apr 22, 2020

BewareMyPower commented Apr 22, 2020

jiazhai commented May 12, 2020

BewareMyPower commented Apr 13, 2020 •

edited

Loading