Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE#4178] When there are a large number of errors in the HTTP target, RocketMQ consumption is slow #4544

Merged
merged 19 commits into from
Nov 14, 2023

Conversation

yanrongzhen
Copy link
Contributor

@yanrongzhen yanrongzhen commented Nov 6, 2023

Fixes #4178

Modifications

Provide MQ-storage-based retry strategy in http protocol.
Add retry, rocketmq-retry modules.

Copy link

codecov bot commented Nov 6, 2023

Codecov Report

Attention: 86 lines in your changes are missing coverage. Please review.

Comparison is base (7438a55) 16.14% compared to head (4fa39d6) 16.61%.
Report is 8 commits behind head on master.

❗ Current head 4fa39d6 differs from pull request most recent head 60088b9. Consider uploading reports for the commit 60088b9 to get more accurate results

Files Patch % Lines
.../eventmesh/runtime/core/protocol/RetryContext.java 0.00% 38 Missing ⚠️
...eventmesh/runtime/boot/AbstractRemotingServer.java 0.00% 9 Missing ⚠️
...time/core/protocol/producer/EventMeshProducer.java 0.00% 8 Missing ⚠️
...untime/core/protocol/producer/ProducerManager.java 0.00% 6 Missing ⚠️
...core/protocol/http/consumer/EventMeshConsumer.java 0.00% 5 Missing ⚠️
...h/storage/rocketmq/common/TopicNameHelperImpl.java 0.00% 4 Missing ⚠️
...e/eventmesh/protocol/http/HttpProtocolAdaptor.java 0.00% 2 Missing ⚠️
...he/eventmesh/runtime/boot/EventMeshGrpcServer.java 0.00% 2 Missing ⚠️
...re/protocol/http/push/AbstractHTTPPushRequest.java 0.00% 2 Missing ⚠️
.../core/protocol/http/push/AsyncHTTPPushRequest.java 0.00% 2 Missing ⚠️
... and 6 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #4544      +/-   ##
============================================
+ Coverage     16.14%   16.61%   +0.47%     
- Complexity     1581     1603      +22     
============================================
  Files           747      743       -4     
  Lines         28965    28570     -395     
  Branches       2541     2477      -64     
============================================
+ Hits           4677     4748      +71     
+ Misses        23840    23369     -471     
- Partials        448      453       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@pandaapo pandaapo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain how this PR solves the problem described in issue #4178? Because most of the code I see seems to be a unified and abstract refactoring of the code related to the grpc, http, and tcp modules.


能否解释下该PR是怎么解决issue #4178所描述的问题?因为我看到的代码好像大部分是给grpc、http、tcp三个模块相关代码做统一和抽象的重构工作。

Copy link
Member

@pandaapo pandaapo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR #4179 that was linked to this issue could solve the problem, but I forgot about the reasons why the community did not merge this PR. @xwm1992 . So I don't know if there are any similar situations with this PR.


当初该issue的PR #4179能解决问题,但是我忘记是出于什么方面的考虑,社区没有合并该PR,@xwm1992 。 所以不知道该PR有没有这方面的问题。

@pandaapo
Copy link
Member

pandaapo commented Nov 7, 2023

Could you explain how this PR solves the problem described in issue #4178? Because most of the code I see seems to be a unified and abstract refactoring of the code related to the grpc, http, and tcp modules.

能否解释下该PR是怎么解决issue #4178所描述的问题?因为我看到的代码好像大部分是给grpc、http、tcp三个模块相关代码做统一和抽象的重构工作。

Waiting for your reply.

@yanrongzhen
Copy link
Contributor Author

Could you explain how this PR solves the problem described in issue #4178? Because most of the code I see seems to be a unified and abstract refactoring of the code related to the grpc, http, and tcp modules.
能否解释下该PR是怎么解决issue #4178所描述的问题?因为我看到的代码好像大部分是给grpc、http、tcp三个模块相关代码做统一和抽象的重构工作。

Waiting for your reply.

First of all, the goal of this PR is to provide the ability to retry in the MQ retry queue within the Retry module.
Here are two changes made:

  1. The EventMeshProducer, ProducerManager, and SendMessageContext of each protocol have been standardized. The reason for this change is that Retry should be a standard processing method, regardless of the specific underlying protocol, and the usage of these three protocols is completely consistent, so they are merged.

  2. In RetryContext, when the user turns on the switch, the message is sent to the retry queue asynchronously to retry.


首先, 此PR的目标是在Retry模块内部提供于MQ重试队列来进行重试的能力.
这里做了两个改动:

  1. 对于各协议的EventMeshProducer, ProducerManager, SendMessageContext进行了标准化, 此改动的原因是Retry应该是一个标准的处理方式, 而不关心具体底层的协议, 并且这三者在各协议的用法完全一致, 因此进行了合并处理.
  2. 在RetryContext中, 在用户打开开关时, 支持将消息异步发送至重试队列中, 以达到重试的效果.

@yanrongzhen yanrongzhen requested a review from pandaapo November 9, 2023 03:08
@yanrongzhen yanrongzhen requested a review from pandaapo November 9, 2023 11:24
@yanrongzhen yanrongzhen requested a review from pandaapo November 9, 2023 15:28
@yanrongzhen
Copy link
Contributor Author

yanrongzhen commented Nov 10, 2023

@pandaapo

After communicating with the maintainer, I used the plugin method to solve the dependency problem discussed before, please review it again, thanks.


经过和maintainer进行沟通后, 我使用了plugin方式来解决了之前讨论的依赖的问题, 帮忙再review一下, 谢谢

@pandaapo
Copy link
Member

I used the plugin method to solve the dependency problem discussed before

Your idea is better than mine.

pandaapo
pandaapo previously approved these changes Nov 10, 2023
@pandaapo pandaapo changed the title [ISSUE#4178] Provide MQ-storage-based retry strategy in http protocol. [ISSUE#4178] When there are a large number of errors in the HTTP target, RocketMQ consumption is slow Nov 10, 2023
@ConfigFiled(field = "server.retry.storageEnabled")
private boolean eventMeshServerRetryStorageEnabled = Boolean.FALSE;
@ConfigFiled(field = "server.retry.plugin.type")
private String eventMeshRetryPluginType = Constants.STANDALONE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may want to decouple the retry plugin and storage plugin. However, when the retry plugin is non standalone, the specific configuration of the storage medium for the retry strategy currently still uses the configuration of the EventMesh message storage medium.

我觉得你可能是想将retry plugin和storage plugin解耦。但是当retry plugin是非standalone时,目前重试策略的存储介质的具体配置仍然使用的是EventMesh消息存储介质的配置。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isolation is configured to allowing users to retry using standalone mode when the storage medium is not using standalone.


是的, 配置隔离是为了应对当存储介质不使用standalone时, 也允许用户使用standalone模式重试。

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it only serves the purpose of "allowing users to retry using standalone mode when the storage medium is not using standalone", then the significance of this modification is not significant, as the previous server. retry. storageEnabled=false can also achieve this.

What I want to express is that the modifications you made to isolate are not thorough enough, because when the retry plugin is non standalone, the storage medium on which the retry strategy depends still uses the configuration of the storage plugin's storage medium, which will bring new problems. For example, the user has configured retry. plugin. type=rocketmq, storage. plugin. type=rabbitmq, but the middleware configuration used for the former is still the latter.

如果只起到“当存储介质不使用standalone时, 也允许用户使用standalone模式重试”这一个作用的话,那这个修改意义就不大了,因为之前的server.retry.storageEnabled=false也能做到。
我想表达的是你为了隔离做的这些修改,不够彻底,因为当retry plugin是非standalone时,重试策略依赖的存储介质仍然使用Storage插件的存储介质的配置,这就会带来新的问题。比如用户配置retry.plugin.type=rocketmqstorage.plugin.type=rabbitmq,但是前者用的中间件配置还是后者的。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the original intention here is to use the default strategy. There is no need to reflect it as standalone. It can be marked as default. Others, such as rocketmq and rabbitmq, can have different retry strategies. eventMeshRetryPluginType = “default”; That’s fine. @pandaapo @yanrongzhen


我理解这里的本意是走默认的策略,没必要体现是standalone,可以标识为default,其他例如rocketmq,rabbitmq,可以有不同的重试策略实现,eventMeshRetryPluginType = “default”; 这样就好。 @pandaapo @yanrongzhen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relationship between the retry strategy plugin and storage can be reflected in the document, and there is no need to strongly bind the configuration name and code logic. I think everyone can understand this.


重试策略插件与使用的存储的关系,可以体现在文档中,不用在配置名称和代码逻辑上强绑定。这点我想大家也是都能理解的。

Copy link
Member

@pandaapo pandaapo Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this submission, the author changed the writing of the retry configuration from retry.storageEnabled=false to retry.plugin.type=standalone(default). The idea is very good, which separates the configuration of the retry plugin and the storage plugin. However, as far as the current code is concerned, new issues may arise. For example, if the user's configuration is retry.plugin.type=rocketmq and storage.plugin.type=rabbitmq, but the retry plugin is actually connected to the address of rabbitmq (retry will also fail), or if the user's configuration is retry.plugin.type=rocketmq and storage.plugin.type=standalone, the retry plugin is actually 'connected' to memory (retry will also fail). So this actually makes retry.plugin.type=standalone(default) not as good as previous retry.storageEnabled=false.

I would like to know your views on this point. @yanrongzhen @xwm1992

作者在该提交中,将retry配置的写法从retry.storageEnabled=false改成了retry.plugin.type=standalone(default)。思路是很好的,这样将retry插件和storage插件的配置隔离开。但是就目前的代码而言会出现新的问题。比如用户的配置是retry.plugin.type=rocketmqstorage.plugin.type=rabbitmq,但其实retry插件连接是rabbitmq的地址(重试也将失效),或者用户配置的是retry.plugin.type=rocketmqstorage.plugin.type=standalone,其实retry插件“连接”的是内存(重试也将失效)。所以这反而使得retry.plugin.type=standalone(default)没有之前retry.storageEnabled=false好。
就这一点,我想听下你们的看法。@yanrongzhen @xwm1992

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the current modification has nothing to do with standalone. If the retry strategy is configured, the corresponding configuration will be used. If the configuration is not loaded, the default strategy will be used. For the situation the rocketmq strategy is used, but the rabbitmq is used for storage. Just mark it in the documentation guide, or add a judgment to determine whether the strategy matches the storage, or throw a clear mismatch exception.


我理解目前的修改与standalone 没什么关系了,配置了retry策略就是走对应配置的,如果配置的没加载到就是走default策略,对于使用retry策略使用的rocketmq策略,但是存储使用的rabbitmq的这种情况,在文档指引中体现标识一下就行了,或者就是加一个策略与存储是否匹配的判断,抛出一个明确的不匹配的异常也可以。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The judgment of whether it matches needs to determine whether the strings of the two plug-in types are equal, which may lead to the coupling of the configuration.


是否匹配的判断需要判断两者插件类型的字符串是否相等, 感觉可能会导致配置的耦合。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there will be coupling, so either there are relevant descriptions in the document guidelines to make it clear to users, or coupling judgment processing must be carried out in the code. The judgment conditions are: if the retry strategy is used or the corresponding The storage module is the same, or it is the default strategy.


是的,是会存在耦合的情况,所以要么文档指引中有相关描述,让使用人清楚,要么就是要在代码中进行耦合的判断处理,判断的条件就是:如果使用了重试策略要么与对应的storage模块相同,要么就是default策略。

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or coupling judgment processing must be carried out in the code. The judgment conditions are: if the retry strategy is used or the corresponding The storage module is the same, or it is the default strategy.

Since that's the case, I prefer the above treatment. Otherwise, if the two configurations do not match, the retry strategy may only fail without reporting an error or providing corresponding prompts. At this point, it is difficult for users to detect, as they assume that the retry strategy is running normally according to their own configuration retry.plugin.type.

既然如此,我倾向于上述这种处理。否则的话,当两者配置不匹配时,重试策略可能只是失效,而不报错或者给出相应的提示。这时用户是很难察觉的,会以为重试策略在按照自己的配置retry.plugin.type正常运行。

@yanrongzhen
Copy link
Contributor Author

yanrongzhen commented Nov 14, 2023

I submitted the code based on the latest code review, please help to review it again, thanks.

Copy link
Member

@pandaapo pandaapo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope you can verify the repair effect of this PR locally based on the reproduction steps described in issue #4178. Thank you!

希望你能在本地根据issue #4178描述的重现步骤,再验证下该PR的修复效果。感谢!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] When there are a large number of errors in the HTTP target, RocketMQ consumption is slow
3 participants