Skip to content

Conversation

@singhpk234
Copy link
Contributor

@singhpk234 singhpk234 commented May 3, 2025

About the change

This change attempts to add retries in the JDBC persistence layer, these retries with jitter are tunable in the following ways :
a. max_retries : total number of retries we expect the persistence to do before giving up.
b. max_duaration_in_ms : time in ms since the first attempt this retries should be done. For ex on configured 500 ms the total time spent in retrying should not exceed 500ms (optimistically)
c. initial_delay_in_ms : intial delay before the first attempt


As a result of which we see the following improvements, with the setup of getting-started guide

Make 100% of the call pass


========================================================================================================================
---- Global Information -------------------------------------------------------------|---Total---|-----OK----|----KO----
> request count                                                                      |    30,222 |    30,222 |         0
> min response time (ms)                                                             |         1 |         1 |         -
> max response time (ms)                                                             |       786 |       786 |         -
> mean response time (ms)                                                            |        17 |        17 |         -
> response time std deviation (ms)                                                   |        24 |        24 |         -
> response time 25th percentile (ms)                                                 |         5 |         5 |         -
> response time 50th percentile (ms)                                                 |        11 |        11 |         -
> response time 75th percentile (ms)                                                 |        23 |        23 |         -
> response time 99th percentile (ms)                                                 |        96 |        96 |         -
> mean throughput (rps)                                                              |     83.72 |     83.72 |         -
---- Response Time Distribution ----------------------------------------------------------------------------------------
> OK: t < 800 ms                                                                                         30,222   (100%)
> OK: 800 ms <= t < 1200 ms                                                                                   0     (0%)
> OK: t >= 1200 ms                                                                                            0     (0%)
> KO                                                                                                          0     (0%)
========================================================================================================================

Increased the TPS to 125


========================================================================================================================
2025-05-05 21:01:49 GMT                                                                             360s elapsed
---- Requests -----------------------------------------------------------------------|---Total---|-----OK----|----KO----
> Global                                                                             |    37,305 |    37,305 |         0
> Authenticate                                                                       |         6 |         6 |         0
> Read / Check Namespace Exists                                                      |     2,059 |     2,059 |         0
> Read / Fetch all Views under parent namespace                                      |     2,095 |     2,095 |         0
> Write / Update Namespace Properties                                                |     6,199 |     6,199 |         0
> Read / Fetch all Tables under parent namespace                                     |     2,098 |     2,098 |         0
> Read / Check View Exists                                                           |     2,036 |     2,036 |         0
> Read / Fetch Namespace                                                             |     2,082 |     2,082 |         0
> Read / Check Table Exists                                                          |     2,111 |     2,111 |         0
> Read / Fetch Table                                                                 |     2,085 |     2,085 |         0
> Write / Update View metadata                                                       |     6,182 |     6,182 |         0
> Write / Update table metadata                                                      |     6,166 |     6,166 |         0
> Read / Fetch all Namespaces under specific parent                                  |     2,097 |     2,097 |         0
> Read / Fetch View                                                                  |     2,089 |     2,089 |         0

---- Read and write entities using the Iceberg REST API ----------------------------------------------------------------
[################################################################################################################]  100%
          waiting:         0 / active:         0  / done:    37,299
---- Stop refreshing the authentication token --------------------------------------------------------------------------
[################################################################################################################]  100%
          waiting:         0 / active:         0  / done:         1
---- Authenticate every minute using the Iceberg REST API --------------------------------------------------------------
[################################################################################################################]  100%
          waiting:         0 / active:         0  / done:         1
---- Wait for the authentication token to be available -----------------------------------------------------------------
[################################################################################################################]  100%
          waiting:         0 / active:         0  / done:         1
========================================================================================================================

Parsing log file(s)...
Parsing log file(s) done in 0s.
Generating reports...

========================================================================================================================
---- Global Information -------------------------------------------------------------|---Total---|-----OK----|----KO----
> request count                                                                      |    37,305 |    37,305 |         0
> min response time (ms)                                                             |         1 |         1 |         -
> max response time (ms)                                                             |     1,166 |     1,166 |         -
> mean response time (ms)                                                            |        24 |        24 |         -
> response time std deviation (ms)                                                   |        44 |        44 |         -
> response time 25th percentile (ms)                                                 |         7 |         7 |         -
> response time 50th percentile (ms)                                                 |        14 |        14 |         -
> response time 75th percentile (ms)                                                 |        27 |        27 |         -
> response time 99th percentile (ms)                                                 |       197 |       197 |         -
> mean throughput (rps)                                                              |    103.34 |    103.34 |         -
---- Response Time Distribution ----------------------------------------------------------------------------------------
> OK: t < 800 ms                                                                                         37,296 (99.98%)
> OK: 800 ms <= t < 1200 ms                                                                                   9  (0.02%)
> OK: t >= 1200 ms                                                                                            0     (0%)
> KO                                                                                                          0     (0%)
========================================================================================================================

@singhpk234
Copy link
Contributor Author

cc @dimas-b @flyrain

@singhpk234 singhpk234 changed the title [JDBC] [DO NOT REVIEW] Add retries with delay [JDBC] Add retries with delay May 5, 2025
@pingtimeout
Copy link
Contributor

@singhpk234 I think the code should be implemented using SmallRye @Retry annotation instead of a custom logic. This is a very common feature and I don't think we should reimplement it. You could benefit from the randomized delay strategy or even an exponential backoff (which could be better).

That being said, it would require some logic around the driver execute(...) method. Something like:

  • Create a new exception class called RetryableSQLException
  • if the query results in a SQLException, and the error code is considered retryable, wrap it in RetryableSQLException
  • otherwise, re-throw the original exception
  • Use the @Retry annotation with the retryOn attribute set to RetryableSQLException

Benefits from doing so include less code to maintain, but also metrics on the amount of retries.

Note that I am only -0.9 here. If you feel strongly against using @Retry, feel free to ignore.

@singhpk234
Copy link
Contributor Author

Thank you @pingtimeout, Really appreciate your inputs.

are you suggesting using this module from quarkus https://quarkus.io/guides/smallrye-fault-tolerance
if thats the case the jdbc module was supposed to be independent of any quarkus code and Honestly i didn't knew about this functionality in first place either (thank you for pointing this out), but if i re-evaluate only because of this IMHO we should not use it.

This is a very common feature and I don't think we should reimplement it

I agree with this in principal, i did evaluate options like FailSafe lib also but we don't want very sophisticated retry logic, its simple one we need with simple knobs if we want like Adaptive Retries or like callbacks i would considered using lib, but our use-case it seemed like an over kill, as bringing new lib will bring lib / notice, cve etc, hence refrained from it only for this use case.

That being said, if you strongly feel about this i am happy to re evaluate and find the best path forward !

@pingtimeout
Copy link
Contributor

@singhpk234 no problem, that was just a suggestion, not a requirement :-). That being said, could you add a javadoc or a TODO around the withRetries method and mention the library? That way, in the future, if somebody needs to improve the retry logic, they know about the library and can consider refactoring the code.

@snazy
Copy link
Member

snazy commented May 7, 2025

About the change

[1] Add retries (make this configurable) [2] add random delays

Make 100% of the call pass

...

Could you add more details what the description represents and a summary of the actual changes. Two Gatling summaries w/o context do not tell anything.

Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry-loops, especially fair retries, are very tricky. What I'm missing in this PR are concurrency-tests and specific tests vetting the retry loop itself.

Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests look appropriate now from a brief look.

Comment on lines 31 to 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is

  id("polaris-quarkus")

for this?

dimas-b
dimas-b previously approved these changes May 8, 2025
Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Please allow one more day for other people to comment before merging.

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board May 8, 2025
@singhpk234 singhpk234 force-pushed the enhancement/add-retries branch 2 times, most recently from 738a4ac to 51e1cde Compare May 12, 2025 16:06
@singhpk234 singhpk234 force-pushed the enhancement/add-retries branch from 51e1cde to 0f5cc33 Compare May 12, 2025 16:44
@singhpk234 singhpk234 requested a review from dimas-b May 12, 2025 20:54
@singhpk234
Copy link
Contributor Author

singhpk234 commented May 12, 2025

I re-ran polaris benchmarks (with the setting @pingtimeout used https://docs.google.com/document/d/1RLYaAtNUkgNW3Ef7-BWfF_8RkSK7B7oR/edit#bookmark=id.von5ayuoga6), happy to share that we now have 100% success with this.

Comment on lines +35 to +37
- POLARIS_PERSISTENCE_RELATIONAL_JDBC_MAX_RETRIES=5
- POLARIS_PERSISTENCE_RELATIONAL_JDBC_INITIAL_DELAY_IN_MS=100
- POLARIS_PERSISTENCE_RELATIONAL_JDBC_MAX_DELAY_IN_MS=5000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Don't defaults work for "getting started"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do ! these are just for demo purpose !

@singhpk234 singhpk234 merged commit 2a69415 into apache:main May 13, 2025
6 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board May 13, 2025
snazy added a commit to snazy/polaris that referenced this pull request May 22, 2025
Last merged commit f2d4b55

The NoSQL-parts contains changes to align the nosql-specific configuration keys + values as follows. No immediate action needed for Dremio, as `org.apache.polaris.persistence.nosql.quarkus.config.ConfigBackwardsCompat` takes care of relocating old config keys + values.
```
polaris.persistence.type=persistence
polaris.backend.name=...
```
to
```
polaris.persistence.type=noqsl
polaris.persistence.backend.type=...
```


* main: Update dependency io.micrometer:micrometer-bom to v1.14.7 (apache#1570)

* JDBC: Simplify JDBC entity conversion (apache#1564)

* fix(Catalog): Add List PolarisStorageAction for all metadata read operations (apache#1391)

* fix(Catalog): Add List PolarisStorageAction for all metadata read operations

* Site : Update cloud providers quickstart to use  (apache#1554)

* [JDBC] Add retries with delay (apache#1517)

[JDBC] Add retries with delay

This change adds retries in the JDBC persistence layer, these retries are with jitter and are tunable in the following ways :
a. max_retries : Total number of retries we expect the persistence to do on Connection Reset exception and serializable error exceptions, before giving up.
b. max_duaration_in_ms : Time in ms since the first attempt this retries should be done. For ex on configured 500 ms the total time spent in retrying should not exceed 500ms (optimistically)
c. initial_delay_in_ms : initial delay before the first attempt

* [Docs] Add JDBC retry properties (apache#1550)

* Use env var in spark container (apache#1522)

* added

Signed-off-by: owenowenisme <[email protected]>

* fix

Signed-off-by: owenowenisme <[email protected]>

* add export

Signed-off-by: owenowenisme <[email protected]>

* update docs using .env

Signed-off-by: owenowenisme <[email protected]>

* update docs

Signed-off-by: owenowenisme <[email protected]>

* change back from using .env to export

Signed-off-by: owenowenisme <[email protected]>

* Apply suggestions from code review

Co-authored-by: Adnan Hemani <[email protected]>

---------

Signed-off-by: owenowenisme <[email protected]>
Co-authored-by: Adnan Hemani <[email protected]>

* Migrate catalog configs to the new reserved prefix (apache#1557)

* rewrite

* rewrite

* stable

* changes per comments

* Remove unused javadoc parameter in BasePersistence (apache#1580)

* Site: Publish table maintenance policies (apache#1581)

* Add schema symlinks to static site directory
Co-authored-by: Yufei Gu <yufei.apache.org>

* Remove `defaults` / `overrides` from feature configurations (apache#1572)

* double WithParentName

* autolint

* Revert some

* autolint

* add to BCconfigs

* autolint

* yank

* copy yuns test

* autolint

* remove defaults

* repair test

* autolint

* stablize test

* stable

* autolint

* configmap change

* copypaste

* regen helm docs

* autolint

* no dots in props

* remove accidental file

* small changes per review

* clean out defaults

* BCC fix

* autolint

* typefix

* autolint

* main: Update dependency io.prometheus:prometheus-metrics-exporter-servlet-jakarta to v1.3.7 (apache#1578)

* main: Update dependency io.micrometer:micrometer-bom to v1.15.0 (apache#1575)

* main: Update dependency io.projectreactor.netty:reactor-netty-http to v1.2.6 (apache#1577)

* main: Update dependency boto3 to v1.38.15 (apache#1574)

* NoSQL: BREAKING: Change NoSQL configuration options

Requires changing the Quarkus configuration(s) from
```
polaris.persistence.type=persistence
polaris.backend.name=...
```
to
```
polaris.persistence.type=noqsl
polaris.persistence.backend.type=...
```

* DREMIO: Use 'nosql' persistence by default

* DREMIO: backwards-compatible configuations

* Revert "DREMIO: Use 'nosql' persistence by default"

This reverts commit ccba4976b5398511c1c987a242ac11517278c700.

Causes test failures :(

* damn defaults refactoring 😡

* try fix helm

---------

Signed-off-by: owenowenisme <[email protected]>
Co-authored-by: Mend Renovate <[email protected]>
Co-authored-by: Yufei Gu <[email protected]>
Co-authored-by: fivetran-ashokborra <[email protected]>
Co-authored-by: Prashant Singh <[email protected]>
Co-authored-by: Owen Lin (You-Cheng Lin) <[email protected]>
Co-authored-by: Adnan Hemani <[email protected]>
Co-authored-by: Eric Maynard <[email protected]>
Co-authored-by: Dmitri Bourlatchkov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants