Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement for distributed backend throttling #1791

Closed
18 tasks done
rmsamitha opened this issue May 3, 2023 · 10 comments
Closed
18 tasks done

Improvement for distributed backend throttling #1791

rmsamitha opened this issue May 3, 2023 · 10 comments

Comments

@rmsamitha
Copy link
Member

rmsamitha commented May 3, 2023

Problem

Distributed backend throttling is not accurate due to the periodic sync time between Redis server and the Gateway nodes.
This needs to be improved.
The slip percentage is considerable. At a certain load tests it was always higher than ~30% and this percentage depends on the the number of GW nodes and periodic sync time between Redis server and the Gateway nodes.

This slippage needs to be improved to cater better accuracy at high-load use cases, where conforming to the defined backend throttling limits is critical

Sub tasks:

Affected Component

APIM

@rmsamitha
Copy link
Member Author

Meeting 3/05/2023

  • Discussed on analysis details of current design, design suggested by CS team, Choreo-APIM and other possible solution approaches
  • Decided to proceed with Sync-Async Hybrid approach

@rmsamitha
Copy link
Member Author

rmsamitha commented May 4, 2023

Meeting 04/05/2023

  • Discussed on analysis details of current design, design suggested by CS team, Choreo-APIM and other possible solution approaches with customer success team considering customer use cases as well.
  • Decided and confirmed to proceed with Sync-Async Hybrid approach

@rmsamitha
Copy link
Member Author

rmsamitha commented May 22, 2023

Meeting 19/05/2023

Participants: TharinduA, ThiliniS, Samitha, Nandika, IsuruU

Summary:
Decided to use interface/abstraction or separate class/method based implementation rather than using if/else blocks to write the new implementation related code. This will make the code more readable and easier to understand and update later too and also it will make it easier to drop the old behaviur in future releases, if needed.

@rmsamitha
Copy link
Member Author

rmsamitha commented May 29, 2023

Meeting: 26/05/2023

Participants: TharinduA, ThiliniS, Samitha, IsuruU

Summary:
Discussed on which approach to use to get the GW count and decided to use the channel subscription count based approach.

@rmsamitha
Copy link
Member Author

Meeting: 14/06/2023

Participants: TharinduA, ThiliniS, Samitha, IsuruU, Nandika

Summary
Discussed on the current progress. Planned how to test in K8s setup to simulate and test real time use cases.
Discussed and got the confirmation to use a separate channel to count the number of gateways, instead of the sync_mode_change channel. In this approach, when a new GW node is spawned, at the server startup, a message will be sent to this specific channel conveying that a new node has joined. So via pub/sub model, all other GW nodes will be triggered and they will ask for the channel subscription count from the Redis. This solution will not address server shutdown cases where the number of GWs should be reduced.

@rmsamitha
Copy link
Member Author

Meeting: 29/06/2023

Participants: TharinduA, Samitha, IsuruU, Nandika

Summary:

  • Decided to use a request context to pass the current time rather than passing the current time in method arguments. It is ok if there is only the currentTime only in that context object ATM.
  • Implement the locking mechanism in redis and client side.
  • Discussed further on the Redis call timeout and importance of setting that

@rmsamitha
Copy link
Member Author

Meeting: 28/07/2023

Participants: TharinduA, Samitha, IsuruU, Nandika, ThiliniS

Summary:

  • Presented the current progress.
  • Presented the load test results in AKS setup.
  • Highlighted the high latency issue and decided to investigate more on that. (i.e. get the latency for each specific step and figure out the impact)
  • Suggested the implementation of clock-time-window based implementation (instead of non clock time based but started timestamp based window) as a POC later.
  • Decided to have a meeting with project stakeholders to communicate the state.

@rmsamitha
Copy link
Member Author

Meeting: 07/08/2023

Participants: Nandika, TharinduA, ThiliniS, SanjeewaM, Dinusha, Ruwan, Janith, Miyuru, Samitha

Summary:

  • Briefly described the design of the solution
  • Presented the upto now implementation and testing outcomes
  • Avoid waiting for requests to acquire the lock for requests more than maximum quota (throttle quota). Those additional requests should be directly rejected without waiting.
  • Add a configuration whether to reject or allow the "redis lock_acquiring_timeout" exceeded api requests. (suggestion for default behavior was to reject).
  • Investigate on the premature throttling scenario in test results
  • Communicate an ETA from the product team once the currently identified obvious issues are fixed.
  • Implement the clock time window based implementation only once the current implementation is finalized.
  • CS team will communicate about the current implementation progress to customer

@rmsamitha
Copy link
Member Author

Meeting Date: 15/08/2023
Participants: SanjeewaM, Nandika, TharinduA, ThiliniS, Dinusha, IsuruU, Samitha

Summary:

  • Samitha clarified in-detail the two root causes for the premature throttling scenario (as described in the previous email) and mentioned on how the issue is not of a high impact and how the issue can be not a critical concern.
  • Sanjeewa described the similar concerns encountered and addressed in hazelcast based older implementations
  • After further discussions, decided that this premature throttling scenario is not much of a concern and that is not a blocker to proceed.
  • In the feature docs we may mention that there might be some amount of spill percentage which is acceptable.
  • Discussed further on what should be done to the redis-lock-acquiring-time-out exceeded requests.
    • The main concern was that customers may report that there are intermittent request failures, even at non-throttle-out situations.
    • This would be a critical concern to defend. And the root cause is a limitation in our implementation, so it is not correct to fail a request due to that.
    • So the decision was that it would allow such requests rather than rejecting.
    • Decided also not to add a configuration to change the behavior since the implementation should be consistent and additional configurations are not ideal at this level as per the suggestion by Sanjeewa. We may introduce a configuration in the future if such a request is made by a client.
  • Decided to proceed to complete the testing efforts by load testing on a auto-scaling deployment (with upto 10 gateways). After that the u2 creation would be started and the delivery will be done.

@rmsamitha
Copy link
Member Author

Closing as completed

@npamudika npamudika added this to the 4.3.0-M1 milestone Jan 3, 2024
@npamudika npamudika added 4.3.0 4.3.0-M1 4.3.0 M1 Milestone labels Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants