Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JMX Exporter not scraping metrics from Kafka 7.7.1 version #755

Closed
Aravindangit003 opened this issue Dec 6, 2022 · 24 comments
Closed

JMX Exporter not scraping metrics from Kafka 7.7.1 version #755

Aravindangit003 opened this issue Dec 6, 2022 · 24 comments

Comments

@Aravindangit003
Copy link

Aravindangit003 commented Dec 6, 2022

Confluent Kafka Version: 7.7.1
Java - 1.8.0_202
Tried it on Broker
JMX exporter - 0.17.2 (not scraping metrics)
JMX Exporter - 0.17.0 (not scraping metrics)
JMX Exporter - 0.16.1 (working)
JMX exporter - 0.13.0 (working)

Note: We have another Kafka Broker with Confluent kafka 5.4 version and JMX exporter 0.17.2 version is working for this.

Variable used for Kafka 7.7.1 in start.broker.sh
export KAFKA_OPTS="-javaagent:/appl/itka/jmx_exporter/jmx_prometheus_javaagent-0.17.2.jar=7071:/appl/itka/jmx_exporter/kafka-2_0_0.yml

Can you please let me know why JMX exporter 0.17.2 not working for Kafka 7.7.1 version? do I need to add any other arguments in the environment variable?

@dhoard
Copy link
Collaborator

dhoard commented Feb 24, 2023

@Aravindangit003 there is no "Kafka" version 7.7.1, nor is there a Confluent Platform version 7.7.1... so I'm confused around "Confluent Kafka Version: 7.7.1"

@dhoard
Copy link
Collaborator

dhoard commented Apr 14, 2023

@Aravindangit003 Any updates/clarification? The JMX Exporter is well-tested and used on supported Confluent Platform versions.

Have you resolved this issue?

If there are no updates within 1 week, this will be closed as inactive.

@db3f
Copy link

db3f commented May 4, 2023

We are seeing the same problems, for Confluent Platform 7.3.x as well as for Apache Kafka 3.3.x.

@dhoard
Copy link
Collaborator

dhoard commented May 4, 2023

@db3f please provide your startup configuration, your process output ps -ef | grep java, and your exporter YAML.

@db3f
Copy link

db3f commented May 4, 2023

Here is our full Java command line:

java -Xmx6G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xlog:gc*:file=/var/log/kafka/kafkaServer-gc.log:time,tags:filecount=10,filesize=100M -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp /usr/bin/../share/java/kafka/:/usr/bin/../share/java/confluent-telemetry/ -javaagent:/usr/share/java/cp-base-new/jmx_prometheus_javaagent-0.14.0.jar=9999:/etc/confluent/docker/kafka-rules.yaml kafka.Kafka /etc/kafka/kafka.properties

Had to rename to .txt for upload. We know that the rules contain some redundancies but they work flawless and quite fast with jmx exporter 0.16.x

Edit: we tried an empty config file, same result. The behaviour seems to depend "non-linear" on the number of metrics. With 10 Topics it works OK, with 20 Topics we don't see any result. 0.16.1 works flawless with 600 Topics (around 5 seconds scrape time)

kafka-rules.txt

@dhoard
Copy link
Collaborator

dhoard commented May 5, 2023

@db3f Good to know that a later version resolved the issue for you.

I would recommend moving to the latest version and retest.

@db3f
Copy link

db3f commented May 6, 2023

Sorry of there was a misunderstanding. It’s the later version that doesn’t work. Versions up to 0.16.1 work, versions 0.17.1 and 0.17.2 don’t. We are currently using 0.14.0 in production because that’s what we used before.

@dhoard
Copy link
Collaborator

dhoard commented May 7, 2023

@db3f I just tested with with Confluent Platform 7.3.3 + version 0.18.0 + 500 topics and don't see any issues.

You might need to enable Java trace logging to see if there is a failure.

@db3f
Copy link

db3f commented May 8, 2023

Just a question: Did you send messages through your Kafka Topics? Because no MBeans/metrics will be created before you do. Ah, and maybe just giving the Number of Topics is misleading. Most Metrics are per partition, we have 18000 Partitions, 3 replicas each.

@dhoard
Copy link
Collaborator

dhoard commented May 8, 2023

My test scenario:

  • single broker (so RF = 1)
  • using ZK
  • 500 topics (topic1 ... topic500) 1 partition each
  • Java producer sending 1 JSON message per topic every 100 ms.
  • using 0.18.0
  • using your rules

@db3f
Copy link

db3f commented May 13, 2023

Sorry, took some time to set this up. Initial setup is similar to our production environment, just without the topics (except for a few like __consumer_offsets and topics for cruise control and Kafka exporter). Brokers are scraped by prometheus via jmx exporter library. Time between scrapes is 30 seconds, scrape timeout is 25 seconds. The graphs show the average scrape time across all brokers and the number or partitions. About every 5 minutes 500 partitions were added. The first image shows measurements for jmx exporter version 0.16.1, the second shows 0.18.0
IMG_0010
IMG_0011

it appears that Version 0.16.1 behaves about linear in the number of partitions (which in turn is linearly related to the number of metrics) whereas 0.18.0 seems to display a behavior that looks like it’s at least quadratic.

@dhoard
Copy link
Collaborator

dhoard commented May 16, 2023

To clarify... you are getting metrics, but there is a performance degradation with the later versions? (This issue was reported as not getting any metrics.)

Using curl or wget, do you get metrics?
If so, what is the value for jmx_scrape_duration_seconds?

@db3f
Copy link

db3f commented May 16, 2023

We are seeing metrics up to a certain number of partitions, if we are willing to wait a few minutes. At about 12000 partitions we gave up waiting after 20 minutes. When we got metrics jmx_scrape_duration_seconds was within a few 100 milliseconds of what curl showed as download time.

@dhoard
Copy link
Collaborator

dhoard commented May 16, 2023

Sounds like Kafka is slow.

I would suggest using the JmxScraper class directly to diagnose. This bypass most of the code used during a scrape.

private static final Logger logger = Logger.getLogger(JmxScraper.class.getName());

Due to your cluster size, not sure I can create a large enough deployment to reproduce the behavior. I'll see what I can do.

@db3f
Copy link

db3f commented May 16, 2023

Well, if it was Kafka (or solely Kafka), versions below 0.17 should show the same behavior. But as the Grafana boards I posted a few days ago show, 0.16.1 (and for that matter all earlier versions, we are using the library since at least 0.12) do work. It seems that the scrape time was linear in the number of metrics before and is now at least quadratic, probably even exponential.

@dhoard
Copy link
Collaborator

dhoard commented May 16, 2023

Totally agree and it sounds like a regression, just trying to narrow the scope.

I'll write an integration test with some fake MBeans to try to simulate the scenario.

@dhoard
Copy link
Collaborator

dhoard commented May 17, 2023

Looking at the code, there are two major changes between 0.16.0 and 0.17.1.

  1. Code was introduced to process bean attributes one by one if a bulk load of the attributes fails.

JmxScraper.java....

        try {
            // bulk load all attributes
            attributes = beanConn.getAttributes(mbeanName, name2AttrInfo.keySet().toArray(new String[0]));
            if (attributes == null) {
                logScrape(mbeanName.toString(), "getAttributes Fail: attributes are null");
                return;
            }
        } catch (Exception e) {
            // couldn't get them all in one go, try them 1 by 1
            processAttributesOneByOne(beanConn, mbeanName, name2AttrInfo);
            return;
        }
        for (Object attributeObj : attributes.asList()) {
            if (Attribute.class.isInstance(attributeObj)) {
                Attribute attribute = (Attribute)(attributeObj);
                MBeanAttributeInfo attr = name2AttrInfo.get(attribute.getName());
                logScrape(mbeanName, attr, "process");
                processBeanValue(
                        mbeanName.getDomain(),
                        jmxMBeanPropertyCache.getKeyPropertyList(mbeanName),
                        new LinkedList<String>(),
                        attr.getName(),
                        attr.getType(),
                        attr.getDescription(),
                        attribute.getValue()
                );
            }
        }
    }
  1. the client_java dependency was upgraded from 0.11.0 to 0.16.0. Looking at the code that's actually being used in the project I didn't see any change that should result in what you're seeing.

In my limited testing, single broker, 3000 partitions, I'm not hitting the scenario.

I can create a 0.18.0 build without the processing bean attributes one by one on Exception if you would like to test it.

@db3f
Copy link

db3f commented May 19, 2023

I’m currently on a short leave but can offer to do some testing next week.

@dhoard
Copy link
Collaborator

dhoard commented Jun 14, 2023

@db3f any update on testing? One suspicion is that there is a failure getting all MBean attributes in a single call, so the attributes are getting processed one at a time.

} catch (Exception e) {
// couldn't get them all in one go, try them 1 by 1
processAttributesOneByOne(beanConn, mbeanName, name2AttrInfo);
return;
}

@db3f
Copy link

db3f commented Jun 29, 2023

Sorry, we had some unexpected Workload recently. We think we found the solution today. We increased Kafka max heap space by 3GB and suddenly scrapes worked again. This is preliminary but we will do some more testing in this direction and report here.

@earnil
Copy link

earnil commented Jan 17, 2024

Hello @dhoard, bumping up this issue because I'm facing the same problem on some big Kafka clusters (~7400 partitions per broker). We had no issue with metrics when using JMX Exporter 0.16.0, it began with a Kafka upgrade that updated JMX Exporter to 0.17.2. At the time, we downgraded it to 0.16.0 to fix the issue. A new Kafka upgrade (right now we're on Kafka 3.5.1) came with JMX Exporter 0.18.0 and issue appeared again.

I noticed that while jmx_scrape_duration_seconds is stable, between 13 to 16 seconds across brokers, the actual request can take a lot longer, from 30s to 1min 30s. This behavior is exacerbated when I'm manually running curl on /metrics quickly one after another.

I tried to increase max heap space but I didn't notice any changes. If you're still willing to create a 0.18.0 build without the processing bean attributes one by one on Exception I could test it.

@dhoard
Copy link
Collaborator

dhoard commented Jan 23, 2024

@earnil Please upgrade to the latest version and test. Version 0.20.0 added some performance improvements that decrease scrape time, but ultimately it takes time for a large cluster/many metrics.

If you are capturing a lot of metrics, you will need to increase your heap size due to how the underlying client_java library allocates/caches a per thread buffer that is the size of the max request.

@earnil
Copy link

earnil commented Jan 30, 2024

Version 0.20.0 did increase performance paired with heap size adjustment for the biggest clusters. It still takes more than 30 seconds on some brokers, but these are very overloaded (> 15 000 partitions) so that's on us.

@dhoard
Copy link
Collaborator

dhoard commented Feb 2, 2024

Closing this issue.

@dhoard dhoard closed this as completed Feb 2, 2024
@dhoard dhoard self-assigned this May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants