-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ruby uplift 2.5 -> 3.3 makes Fluentd startup 60x slower #4545
Comments
Are there somewhere publicly available Dockerfile recipe to reproduce it with Ruby 3.3? |
@kenhys In the meantime, I have compared the "good" and the "bad" strace outputs. The differences begin mostly with the large number of In the "bad" strace:
versus the "good" strace:
In the good case each gemspec file is examined immediately after each other. In the bad case a lot of |
I managed to reproduce the issue with a Dockerfile: There are two Dockerfiles, the "good" which is ruby 2.5.9 and fluentd loads immediately, and the "bad" which is ruby 3.3.1 and fluentd startup takes a long time. I managed to reproduce the issue in multiple independent environments, so hopefully it will work. |
This is just a comment to avoid flagging this issue as stale. @kenhys I would be happy to try to investigate further, but I don't have ruby experience, do you perhaps have some tool suggestions or tips for debugging / what to look for? |
Thanks @dinatamas , I can reproduce it. I'm not sure why, but it seems that
|
Checked some more environment how many delay observed?:
As for recent version of Ruby, it seems that |
For example, RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 causes launching time penalty: * ruby-3.3.4(YJIT) 0.9 ~40 secs * ruby-3.2.5(YJIT) 0.9 ~3 secs * ruby-3.1.6(YJIT) 0.9 ~1 secs * ruby-3.0.7 0.9 ~1 secs See fluent/fluentd#4545 Signed-off-by: Kentaro Hayashi <[email protected]>
For example, RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 causes launching time penalty: * ruby-3.3.4(YJIT) 0.9 ~40 secs * ruby-3.2.5(YJIT) 0.9 ~3 secs * ruby-3.1.6(YJIT) 0.9 ~1 secs * ruby-3.0.7 0.9 ~1 secs See fluent/fluentd#4545 Signed-off-by: Kentaro Hayashi <[email protected]>
For example, RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 causes launching time penalty: * ruby-3.3.4(YJIT) 0.9 ~40 secs * ruby-3.2.5(YJIT) 0.9 ~3 secs * ruby-3.1.6(YJIT) 0.9 ~1 secs * ruby-3.0.7 0.9 ~1 secs See fluent/fluentd#4545 Signed-off-by: Kentaro Hayashi <[email protected]>
I've added this issue in official documentation. |
Describe the bug
I am running fluentd in a docker container, and noticed that when I uplift ruby from version 2.5 (provided by the OS) to version 3.3 (built by me), then the startup time of fluentd increases significantly (from <1 second to multiple minutes).
The main bottleneck is the CPU:
and then later:
The high CPU load only stops once the
fluentd worker is now running worker
messages appear.I ran an strace, and found that calls like the following happen hundreds of times each second:
To Reproduce
I was unable to reproduce the issue outside my environment, even when I tried to build ruby the same way and install the same plugins. But for me it happens very reliably, each time I start the container. I have the full strace output, but it's 260'000 lines long, and 240'000 of it are just those
clock_gettime()
calls.Your Environment
Note: Uplifting fluentd to v1.17.0 does not help.
Gemfile:
Your Configuration
I am deliberately running a simple - basically empty - config, and even with this the startup takes a very long time. It's dumped in the following error log. It also takes ~1 minute to simply execute
fluentd --version
.Find the fluentd trace logs of the startup below. There is almost an entire minute between the first line (the fluentd command being issued) and the first log message from fluentd:
Your Error Log
Usually I don't get an error log, because things eventually start working correctly, it just takes a very long time...
When I hit Ctrl+C during fluentd startup (when it's taking 1 minute to load), I usually get a stacktrace that's very similar to this:
Additional context
Interestingly, the first log line from fluentd always appears almost exactly 1 minute after the command was issued, it might not be a coincidence.
But even after that, the load still takes very long and the CPU usage is ~100%. For example it takes 5-10 minutes to load my actual configuration, which has a lot of rules and uses multiple plugins like Prometheus, Kafka, OpenSearch. This took only 1-2 seconds before the ruby uplift.
The text was updated successfully, but these errors were encountered: