Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lh github actions workflow refactoring 2022 artifacts #60

Closed

Conversation

lhotari
Copy link
Owner

@lhotari lhotari commented Mar 28, 2022

BewareMyPower and others added 30 commits March 22, 2022 23:35
…#14797)

Fixes apache#509

### Motivation

When a C++ producer is created successfully, it will start a send timer.
However the callback has captured the shared pointer of `ProducerImpl`
itself. It extends the lifetime of `ProducerImpl` so that even after the
`Producer` object destructs, the underlying `ProducerImpl` object won't
be destructed. It could only be destructed after `Client::close()` is
called.

### Modifications

- Pass a weak pointer of `ProducerImpl` to the send timer and add a
  `asyncWaitSendTimeout` method for the combination of
  `expires_from_now` and `async_wait` calls on the timer.
- Add `ClientTest.testReferenceCount` to verify the reference count will
  become 0 after the producer or consumer destructs.
Fixes apache#14665

### Motivation

In C++ client, a connect timeout task is created each time before an
asynchronous connect operation is performed, if the connection cannot be
established in the configured timeout, the callback of the task will be
called to close the connection and then the `createProducer` or
`subscribe` methods will return `ResultConnectError`.

`ClientConnection::connectTimeoutTask_`, which is a shared pointer,
represents the timeout task. However, after `ClientConnection::close` is
called, the shared pointer will be reset, and the underlying `PeriodicTask`
object will be released. After that, when `stop` method is called on the
released `PeriodicTask` object in the callback (`handleTcpConnected`), a
segmentation fault will happen.

The root cause is that `connectTimeoutTask_` can be accessed in two
threads while one of them could release the memory. See apache#14665 for more
explanations. This race condition leads to flaky Python tests as well,
because we also have the similar test in Python tests. See
https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221

So this PR might also fix apache#14714.

### Modifications

Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After
that, the `connectTimeoutTask_` will always points to the same
`PeriodicTask` object, whose methods are thread safe.

### Verifying this change

Execute the following command

```bash
./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10
```

to runs the `testConnectTimeout` for 10 times. In my local env, it never
failed, while before applying this patch, it's very easy to fail.
* Only run crowdin-download once a day to avoid timeouts

* Add license to website's package.json

Master issue: apache#14824 

### Motivation

The crowdin translations do not get updated often enough to warrant 3 downloads a day, especially given that the download and upload are currently timing out and preventing the english version of the website to get published.

### Modifications

* Make the `crowdin-download` step only run once a day.
* Add license to the yarn `package.json`.

### Verifying this change

This is a trivial change.
* WIP: 2.8.3 release notes

* Refactor titles

Co-authored-by: Anonymitaet <[email protected]>

* Fix tense; upper case first letters

* Correct version; update date

Co-authored-by: Anonymitaet <[email protected]>
…ete. (apache#14818)

### Motivation

If configuration  ``patternAutoDiscoveryPeriod`` is small, there may be some unnecessary subscribe requests.

### Modifications

*Describe the modifications you've done.*
* Java client: only schedule batch when there are pending messages

* Update variable name; handle reconnect case

* Remove code duplication

* Fix method name

* Improve method name; add tests; fix bugs raised by tests

* Remove comment that is now incorrect

* Chain batch flush task when task triggers delivery

* Prevent early batch flush by storing lastBatchSendNanos

* Refactor getPendingQueueSize

* Remove unintentionally committed comments

* Remove premature optimization to cancle and reschedule batchFlushTask

* Ensure buffered batchMessageContainer messages can fail due to send timeout

* Fix comment to match new design

* Fix batch message counting log line

* Guard against null batchMessageContainer

* Fix send timeout logic to fire when batching is disabled

Master Issue: apache#11100

### Motivation

As observed in apache#11100, the Java client producer consumes cpu even when doing nothing. This is especially true when using many producers. Instead, we can make it so that the batch timer is only scheduled when it has messages to deliver or just fired and delivered messages.

If there are concerns about this optimization, I will need to take some time to complete benchmarks.

### Modifications

* Skip message batch delivery if the producer is not in `Ready` state. As a consequence, ensure that messages pending in the `batchMessageContainer` are still eligible to fail for `sendTimeout`.
* If there is no current batch flush task, schedule a single flush task when a message is added to a batch message container (assuming the message does not also trigger the delivery of the batch and the producer is in READY state).
* Schedule another batch flush task if and only if the batch flush task itself was responsible for sending messages.
* Keep track of `lastBatchSendNanoTime`, and only flush a batch message container if the `BatchingMaxPublishDelayMicros` time has passed since the last send time.
* Note that the timer task is only ever updated within `synchronized (this)` block, so we are guaranteed to have a consistent view free of race conditions. (There is one race condition, and that is that an existing batch timer might get canceled while it is starting. This is of little consequence since it'll result in either no delivery or a small batch.)

### Verifying this change

There are existing tests that cover batch message delivery. Specifically, the `BatchMessageTest` in the `pulsar-broker` module covers these changes.

### Does this pull request potentially affect one of the following parts:

This is a backwards compatible change.

### Documentation

- [x] `no-need-doc`
…pache#14810)

### Motivation
When Transactionlog recover fail throw CursorAlreadyClosedException, we should stop the recover op. the cursor was been closed, the transaction log was been closed, so we should stop the recover op, in order to release thread resources
like apache#14781

### Modifications
When recover fail by CursorAlreadyClosedException, comeplete recover
…ache#14801)

### Motivation
When Transaction buffer recover create reader or create writer fail or read snapshot fail throw PulsarClientException, we should rerecover this topic so close this topic to reinit.
### Motivation

apache#14823 fixes the flaky
`testConnectTimeout` but it's also a regression of
apache#14587. Because when the fd limit
is reached, the `connectionTimeoutTask_` won't be initialized with a
non-null value. Calling `stop` method on it directly will cause
segmentation fault.

See
https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185

### Modifications

Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`.
apache#14835)

Master Issue: apache#14824 

### Motivation

In apache#14828, I tried to fix the python client docs. The correction partially succeeded. In looking at the `asf-site`, I noticed that the docs were not in the correct location and they were duplicated. I subsequently learned that docusaurus actually copies all content in the `static` directory to the `build/` output directory. Therefore, there is no need to copy the docs in the `build` script. The docs just have to be output in the correct location.

### Modifications

* Move the docs created in apache#14788 from `python-client` to `api/python` so that they will properly render on the website. I already made this change on the `asf-site` branch here to ensure that the current website renders correctly: 53271f1. 
* Update the python client doc creation script.

### Verifying this change

I ran the appropriate scrips to ensure that the site is built and rendered as expected. You can view the current state of the 2.9.1 docs here: https://pulsar.apache.org/api/python/2.9.1/pulsar.html.
…pache#14830)

### Motivation
now transaction buffer recover no snapshot, we don't close the reader, it will produce the problem of OOM
…apache#14807)

### Motivation
When Transaction buffer recover fail throw CursorAlreadyClosedException, we should stop the recover op. the cursor was been closed, the transaction buffer was been closed, so we should stop the recover op, in order to release thread resources
like apache#14781
* [Broker] Fix NPE when subscription is already removed

* Cover same case for NonPersistentTopic

Master Issue: apache#14362 

### Motivation

There is current a race condition when we remove a subscription. The race and how to reproduce it is described in the apache#14362. One of the consequences of the race is that there is a chance we try to remove the subscription from the topic twice. This leads to an NPE, as described in the issue.

### Modifications

* Verify that the `sub` is not null before getting its stats.

### Verifying this change

This is a trivial change.
### Motivation

apache#14896 is flaky, after diving into the codes, I find it's a bug about closing topic policy reader.  We should use `ex.getCause` instead of `ex`.

Stacktrace:
```
2022-03-27T10:42:16,795+0800 [broker-client-shared-internal-executor-58-1] WARN  org.apache.pulsar.broker.service.SystemTopicBasedTopicPoliciesService - Read more topic polices exception, read again.
java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$AlreadyClosedException: Consumer already closed
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_291]
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_291]
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) ~[?:1.8.0_291]
	at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628) ~[?:1.8.0_291]
	at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996) ~[?:1.8.0_291]
	at org.apache.pulsar.client.impl.MultiTopicsReaderImpl.readNextAsync(MultiTopicsReaderImpl.java:140) ~[classes/:?]

```
mattisonchao and others added 9 commits March 28, 2022 16:43
### Motivation

The current non-durable cursor does not have the correct state.
For example, when the reader is created, I always see the cursor status as ``Uninitialized`` via the ``getInternalStats`` method.

```json
{
  "reader-xxxxx" : {
      "markDeletePosition" : "19785:18718",
      "readPosition" : "19807:42735",
      "waitingReadOp" : false,
      "pendingReadOps" : 0,
      "messagesConsumedCounter" : -2257,
      "cursorLedger" : -1,
      "cursorLedgerLastEntry" : -1,
      "individuallyDeletedMessages" : "[]",
      "lastLedgerSwitchTimestamp" : "2022-03-24T20:03:51.85Z",
      "state" : "Uninitialized",
      "numberOfEntriesSinceFirstNotAckedMessage" : 744993,
      "totalNonContiguousDeletedMessagesRange" : 0,
      "subscriptionHavePendingRead" : true,
      "subscriptionHavePendingReplayRead" : false,
      "properties" : { }
    }
}
```

### Modifications

- Correct the cursor state.
* [improve][client] Add log for reader acknowledge.

* Fix checkstyle

* Change log level to warn
…itive (apache#14802)

Let's get this in and unblock flaky tests
…t rid of CVE-2020-36518 (apache#14871)

* [fix][security] Upgrade JacksonXML to get rid of CVE-2020-36518

* force jackson-databind version
…cursor (apache#14900)

Fixes apache#14880

### Motivation

Non durable cursor was not closed properly.

### Modifications
For non durable cursor,  `cursor.asyncClose` did nothing. The proper way is `topic.getManagedLedger().asyncDeleteCursor`
…pache#14898)

* [C++] Fix send callback might not be invoked in key based batching

### Motivation

When C++ client enables key based batching, there is a chance that the
send callback is not invoked. See
https://github.com/apache/pulsar/blob/32df93f693bfdf42953bd728a12ecdea1796dcc8/pulsar-client-cpp/lib/ProducerImpl.cc#L272-L275

If a batch container has multiple batches, only one batch could be
processed during `closeAsync`. Even worse, the semaphores of other
batches won't be released.

### Modifications

- Add a `clearPendingBatches` method to clear all pending batches and
  process them. Then call this method in `closeAsync` and
  `getPendingCallbacksWhenFailed`.
- Add a test `testCloseBeforeSend` to verify when a producer has
  multiple pending batches, all callbacks can be invoked in
  `closeAsync`.

* Add processAndClear() to batch message container
@lhotari lhotari force-pushed the lh-github-actions-workflow-refactoring-2022-artifacts branch from 35672ad to fa201e4 Compare March 28, 2022 18:35
@lhotari lhotari force-pushed the lh-github-actions-workflow-refactoring-2022-artifacts branch from ffeb60e to 3c555cb Compare March 28, 2022 22:38
@lhotari lhotari closed this Mar 30, 2022
lhotari pushed a commit that referenced this pull request Apr 12, 2022
lhotari pushed a commit that referenced this pull request Aug 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.