Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Shutdown hook in RyukResourceReaper prevents a graceful shutdown with spring framework #8558

Open
jobayle opened this issue Apr 22, 2024 · 19 comments
Labels

Comments

@jobayle
Copy link

jobayle commented Apr 22, 2024

Module

Core

Testcontainers version

1.19.7

Using the latest Testcontainers version?

Yes

Host OS

Linux, Windows

Host Arch

x86_64

Docker version

Client:
 Cloud integration: v1.0.35+desktop.13
 Version:           26.0.0
 API version:       1.45
 Go version:        go1.21.8
 Git commit:        2ae903e
 Built:             Wed Mar 20 15:18:56 2024
 OS/Arch:           windows/amd64
 Context:           default

Server: Docker Desktop 4.29.0 (145265)
 Engine:
  Version:          26.0.0
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.21.8
  Git commit:       8b79278
  Built:            Wed Mar 20 15:18:01 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.28
  GitCommit:        ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

What happened?

Dears,

With the introduction of this feature: #7717

The ryuk instance is terminated by a shutdown hook, therefore all containers are closed very early.

This prevents other shutdown hooks (eg: those set by the spring framework) to perform actions needed for a graceful shutdown that still require containers to be up.
Due to a retry mechanism in Spring's shutdown hook, it greatly increases the overall time to run tests!

Could you please revert this change or introduce a mean to disable this shutdown hook, thanks!

Relevant log output

No response

Additional Information

No response

@s-jepsen
Copy link

s-jepsen commented May 6, 2024

Will this be fixed any time soon?

@EAlf91
Copy link

EAlf91 commented May 7, 2024

Also relevant for #7871 there was a solution proposed in the issue as well

@jobayle
Copy link
Author

jobayle commented May 13, 2024

Hi @EAlf91, I cannot find the solution in the linked issue, could you please add a link to the specific comment? thanks!

@EAlf91
Copy link

EAlf91 commented May 14, 2024

@jobayle It's mentioned here: #7871 (comment) but it would require a change I guess. The author made the hook configurable via spring properties: alex-arana@a0fcbdf

I didn't try it myself and didn't see any way to configure this without using this fork. I'd highly appreciate to see a revert of #7717 or maybe using this way to configure it via spring properties

@jobayle
Copy link
Author

jobayle commented May 14, 2024

Thanks @EAlf91, but imho, this is not a viable solution if we have to maintain a patched version of testcontainers.

mmaeller added a commit to mmaeller/testcontainers-java that referenced this issue May 28, 2024
mmaeller added a commit to mmaeller/testcontainers-java that referenced this issue May 28, 2024
@s-jepsen
Copy link

Any progress on this issue?

@jpalaz
Copy link

jpalaz commented Oct 14, 2024

@eddumelendez @bsideup @kiview Mention you all because you were working on PR which lead to this issue: #7717

Could you please take a brief look on this 🙏

@kiview
Copy link
Member

kiview commented Oct 14, 2024

Hey @jobayle,

Due to a retry mechanism in Spring's shutdown hook, it greatly increases the overall time to run tests!

This is of course a situation we would like to solve. What absolute amount of time are we talking abut here?

I am not sure #8732 is the right approach though, since this would still require all affected users, to apply this config option (not ideal). We could make configuration more programmatically accessible, but this would also require downstream integration changes by the Spring project.

Since a considerable amount of Testcontainers users are also Spring users, I think reverting the change and accept the containers lingering for some time after process termination can be the better DX, WDYT @eddumelendez?

@jobayle
Copy link
Author

jobayle commented Oct 14, 2024

Hi @kiview, sure:
Here's the first and last log of the SpringApplicationShutdownHook:

16:15:25.372 [SpringApplicationShutdownHook] INFO - UPDATING MACHINE CONFIGURATION STATUS=SHUTDOWN
[...] lots of connection errors with DB/redis/...
16:16:27.222 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-13 - Shutdown completed.

So roughly a whole minute!

By enabling container reuse to simulate how things worked before #7717, the shutdown is almost instant:

16:38:34.259 [SpringApplicationShutdownHook] INFO  - UPDATING MACHINE CONFIGURATION STATUS=SHUTDOWN
16:38:34.749 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-3 - Shutdown initiated...
16:38:34.751 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-3 - Shutdown completed.
16:38:34.753 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-2 - Shutdown initiated...
16:38:34.755 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-2 - Shutdown completed.
16:38:34.755 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-1 - Shutdown initiated...
16:38:34.758 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-1 - Shutdown completed.
16:38:34.806 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-9 - Shutdown initiated...
16:38:34.807 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-9 - Shutdown completed.
16:38:34.810 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-8 - Shutdown initiated...
16:38:34.813 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-8 - Shutdown completed.
16:38:34.813 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-7 - Shutdown initiated...
16:38:34.816 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-7 - Shutdown completed.
16:38:34.856 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-15 - Shutdown initiated...
16:38:34.857 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-15 - Shutdown completed.
16:38:34.860 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-14 - Shutdown initiated...
16:38:34.862 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-14 - Shutdown completed.
16:38:34.862 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-13 - Shutdown initiated...
16:38:34.864 [SpringApplicationShutdownHook] INFO  com.zaxxer.hikari.HikariDataSource - HikariPool-13 - Shutdown completed.

Yes, I'd really love a fix or a revert, thanks 🙂

@kiview
Copy link
Member

kiview commented Oct 16, 2024

Thanks for sharing @jobayle, 1 minute is really not acceptable, will sync with @eddumelendez on how to best proceed with this and resolve the situation 👍.

@xak2000
Copy link

xak2000 commented Oct 21, 2024

The registering of shutdown hook also has other problems, such as classloader issues in some specific use-cases. Considering this, I vote for reverting the change that introduced the shutdown hook. Or at least made it configurable (and 'off' by default).

The solution of this problem will also solve the problem of using Testcontainers from Maven (e.g. #7923, #1454, testcontainers/testcontainers-jooq-codegen-maven-plugin#37) if shutdown hook will not be registered by default or at least configurable.

@eddumelendez
Copy link
Member

Hi, I have created a branch revert-shutdown-hook, which is available via jitpack. I'm also using it in one of the projects available here. However, in the case of artemis the behavior is the same with or without this change.

I would like to add a friendly reminder about sharing a project that reproduces the issue to help maintainers.

You can also use this commit in your projects in order to help with the reproducer.

I have few more examples

https://github.com/eddumelendez/testcontainers-r2dbc
https://github.com/eddumelendez/testcontainers-localstack

We have agreed to revert the change if there is a real impact but we have not seen any so far. Again, a reproducer would help.

@jobayle
Copy link
Author

jobayle commented Oct 31, 2024

Hi,

I've spent a few hours investigating the issue and making a reproducer: https://github.com/jobayle/repro-tc-shutdownhook

See the shutdown time when the integration tests are run in a github action :
https://github.com/jobayle/repro-tc-shutdownhook/commit/fa7dc3b95b7450808ad63f1223c3baf6a8dc9717/checks/32354065819/logs

first line:

2024-10-31T18:40:07.050Z INFO 2089 --- [ionShutdownHook] bug.machine.ApplicationStatus : UPDATING MACHINE CONFIGURATION STATUS=SHUTDOWN

last line:

2024-10-31T18:41:12.293Z INFO 2089 --- [ionShutdownHook] com.zaxxer.hikari.HikariDataSource : HikariPool-5 - Shutdown completed.

More than one minute.

@eddumelendez
Copy link
Member

Thanks, @jobayle! really appreciate sharing a reproducer and the context.

@eddumelendez
Copy link
Member

Just sharing some updates. Thanks to the reproducer I can see how the gradle project takes 2m 40s on my machine and the shutdown takes a lot of time. For testing purposes I did the same migrating the project to maven and it ran under 1m 20s. I didn't catch it due to the projects I shared before were only maven projects. It's interesting!

This is just an update and not a statement about not fixing it or saying move to maven. I'll be deep dive a little bit more but probably the easiest and quick solution would be reverting the change as it has been suggested.

Thanks again for the reproducer, it helps to move things faster!

@xak2000
Copy link

xak2000 commented Oct 31, 2024

@eddumelendez

For testing purposes I did the same migrating the project to maven and it ran under 1m 20s.

Are you saying that running the same tests using Maven instead of Gradle reduces the tests running time from 2m 40s to 1m 20s or just the whole build time is reduced from 2m 40s to 1m 20s (for reasons unrelated to running integration tests, e.g. downloading artifacts)?

P.S. It's interesting that I cannot reproduce slow running tests with this project. They are usually finising in 23s. Of course, I see many unexpected errors in the logs due to premature containers shutdown, such as:

org.springframework.dao.DataAccessResourceFailureException: PreparedStatementCallback; SQL [UPDATE machine SET running_status = ?, last_update = now() WHERE name = ?]; FATAL: terminating connection due to unexpected postmaster exit

But long-running tests I cannot reproduce. Maybe the default timeout is different depending on the OS (I tried with Windows) or something..

Though, in other (real) projects I can still reproduce slow tests. The root cause (premature containers shutdown) is still easily reproducible. It is just the effect from it that is different depending on specific technologies and configuration (such as timeout) used.

For example, in one of the projects I "fixed" the long-running tests problem by modifying the timeout of connection pool:

# Set small timeout (default: 30000 ms) as a workaround for https://github.com/testcontainers/testcontainers-java/issues/7871
# org.springframework.integration.jdbc.lock.DefaultLockRepository.close() method tries to get JDBC connection from
# HikariCP to execute an SQL query on Spring Context shutdown, but Mysql container is already stopped at this
# point, so HikariCP waits up to `connection-timeout` for it to be available, which will never happen.
# Setting it to a smaller value just makes delay smaller on stopping of each cached Spring Context after all tests
# are finished.
spring.datasource.hikari.connection-timeout = 1000

(the default is 30000).

@jobayle
Copy link
Author

jobayle commented Oct 31, 2024

@xak2000 well, I'm working on windows and the github action runs on Ubuntu, so it's probably not OS dependent.

Maybe it depends on the version of the JVM? or the vendor (Oracle, or Eclipse, or redhat...)

Maybe there's a difference in shutdown hook processing somewhere 🤔

Or maybe it's because of hardware, if your machine is so powerful it manages to run most of the shutdown hooks before ruyk terminates the containers?

@xak2000
Copy link

xak2000 commented Oct 31, 2024

@jobayle I'm not sure what is the reason. The containers are definitely terminated when shutdown listener is executed because I see errors like this in the logs:

Logs
2024-11-01T00:08:43.749+01:00 ERROR 28940 --- [ionShutdownHook] o.s.t.i.TransactionInterceptor           : Application exception overridden by rollback exception

org.springframework.dao.DataAccessResourceFailureException: PreparedStatementCallback; SQL [UPDATE machine SET running_status = ?, last_update = now() WHERE name = ?]; An I/O error occurred while sending to the backend.
        at org.springframework.jdbc.support.SQLStateSQLExceptionTranslator.doTranslate(SQLStateSQLExceptionTranslator.java:121) ~[spring-jdbc-6.1.3.jar:6.1.3]

...

        at bug.machine.MachineRepository.updateRunningStatus(MachineRepository.kt:73) ~[main/:na]
 
...

        at bug.machine.MachineRepository$$SpringCGLIB$$0.updateRunningStatus(<generated>) ~[main/:na]
        at bug.machine.ApplicationStatus.handleShutdown(ApplicationStatus.kt:44) ~[main/:na]

...

        at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1127) ~[spring-context-6.1.3.jar:6.1.3]

...

Caused by: org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:398) ~[postgresql-42.7.3.jar:42.7.3]

...

        at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeUpdate(HikariProxyPreparedStatement.java) ~[HikariCP-5.0.1.jar:na]
        at org.springframework.jdbc.core.JdbcTemplate.lambda$update$2(JdbcTemplate.java:975) ~[spring-jdbc-6.1.3.jar:6.1.3]
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:658) ~[spring-jdbc-6.1.3.jar:6.1.3]
        ... 42 common frames omitted
Caused by: java.io.EOFException: null
        at org.postgresql.core.PGStream.receiveChar(PGStream.java:469) ~[postgresql-42.7.3.jar:42.7.3]
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2166) ~[postgresql-42.7.3.jar:42.7.3]
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:371) ~[postgresql-42.7.3.jar:42.7.3]
        ... 50 common frames omitted

I think it could be ordering and speed of shutdown of different systems issue, yes. As we see, it is java.io.EOFException in my case. It looks like in my case Hikari already tested the connection before returning it from the pool and the connection was valid, but then, when context's listener tried to execute the UPDATE query, the Postgres container was finally stopped right in the middle of the query, that is why Hikari doesn't try to reconnect and timeout doesn't work. The exception is happening when execution is already in the user code. To make Hikari timeout work, the exception should happen earlier, when Hikari tests the connection before returning it from the pool to the user code. So, probably, my machine is too slow on stopping docker containers. :)

Update:
I actually just made your example reproducible by adding 1s delay to the ContextClosedEvent listener:

    @EventListener
    fun handleShutdown(ctxStatus: ContextClosedEvent) {
        Thread.sleep(1000)
        logger.info("UPDATING MACHINE CONFIGURATION STATUS=SHUTDOWN")
        machineRepository.updateRunningStatus(machineConfiguration.name, RunningStatus.OFFLINE)
    }

This way the tests are executed in a solid 1m 57s. Yay! 🎉😂

With many

HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection@3eeeae1f (This connection has been closed.). Possibly consider using a shorter maxLifetime value.

and

reconnecting RedisConnection@618341304 [redisClient=[addr=redis://localhost:51852], channel=[id: 0x145b0193, L:/127.0.0.1:51949 ! R:localhost/127.0.0.1:51852], currentCommand=null, usage=0] to localhost/127.0.0.1:51852

@eddumelendez
Copy link
Member

Another update: I ran ./gradlew test --no-daemon and it ran in 1m 57s.

My focus has been on test execution time, dependencies were already downloaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants