JMX Exporter not scraping metrics from Kafka 7.7.1 version #755

Aravindangit003 · 2022-12-06T16:19:42Z

Confluent Kafka Version: 7.7.1
Java - 1.8.0_202
Tried it on Broker
JMX exporter - 0.17.2 (not scraping metrics)
JMX Exporter - 0.17.0 (not scraping metrics)
JMX Exporter - 0.16.1 (working)
JMX exporter - 0.13.0 (working)

Note: We have another Kafka Broker with Confluent kafka 5.4 version and JMX exporter 0.17.2 version is working for this.

Variable used for Kafka 7.7.1 in start.broker.sh
export KAFKA_OPTS="-javaagent:/appl/itka/jmx_exporter/jmx_prometheus_javaagent-0.17.2.jar=7071:/appl/itka/jmx_exporter/kafka-2_0_0.yml

Can you please let me know why JMX exporter 0.17.2 not working for Kafka 7.7.1 version? do I need to add any other arguments in the environment variable?

dhoard · 2023-02-24T06:04:18Z

@Aravindangit003 there is no "Kafka" version 7.7.1, nor is there a Confluent Platform version 7.7.1... so I'm confused around "Confluent Kafka Version: 7.7.1"

dhoard · 2023-04-14T12:42:50Z

@Aravindangit003 Any updates/clarification? The JMX Exporter is well-tested and used on supported Confluent Platform versions.

Have you resolved this issue?

If there are no updates within 1 week, this will be closed as inactive.

db3f · 2023-05-04T12:21:43Z

We are seeing the same problems, for Confluent Platform 7.3.x as well as for Apache Kafka 3.3.x.

dhoard · 2023-05-04T12:56:10Z

@db3f please provide your startup configuration, your process output ps -ef | grep java, and your exporter YAML.

db3f · 2023-05-04T13:07:21Z

Here is our full Java command line:

java -Xmx6G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xlog:gc*:file=/var/log/kafka/kafkaServer-gc.log:time,tags:filecount=10,filesize=100M -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp /usr/bin/../share/java/kafka/:/usr/bin/../share/java/confluent-telemetry/ -javaagent:/usr/share/java/cp-base-new/jmx_prometheus_javaagent-0.14.0.jar=9999:/etc/confluent/docker/kafka-rules.yaml kafka.Kafka /etc/kafka/kafka.properties

Had to rename to .txt for upload. We know that the rules contain some redundancies but they work flawless and quite fast with jmx exporter 0.16.x

Edit: we tried an empty config file, same result. The behaviour seems to depend "non-linear" on the number of metrics. With 10 Topics it works OK, with 20 Topics we don't see any result. 0.16.1 works flawless with 600 Topics (around 5 seconds scrape time)

kafka-rules.txt

dhoard · 2023-05-05T21:40:21Z

@db3f Good to know that a later version resolved the issue for you.

I would recommend moving to the latest version and retest.

db3f · 2023-05-06T09:58:54Z

Sorry of there was a misunderstanding. It’s the later version that doesn’t work. Versions up to 0.16.1 work, versions 0.17.1 and 0.17.2 don’t. We are currently using 0.14.0 in production because that’s what we used before.

dhoard · 2023-05-07T23:33:09Z

@db3f I just tested with with Confluent Platform 7.3.3 + version 0.18.0 + 500 topics and don't see any issues.

You might need to enable Java trace logging to see if there is a failure.

db3f · 2023-05-08T07:48:52Z

Just a question: Did you send messages through your Kafka Topics? Because no MBeans/metrics will be created before you do. Ah, and maybe just giving the Number of Topics is misleading. Most Metrics are per partition, we have 18000 Partitions, 3 replicas each.

dhoard · 2023-05-08T12:40:33Z

My test scenario:

single broker (so RF = 1)
using ZK
500 topics (topic1 ... topic500) 1 partition each
Java producer sending 1 JSON message per topic every 100 ms.
using 0.18.0
using your rules

db3f · 2023-05-13T21:16:09Z

Sorry, took some time to set this up. Initial setup is similar to our production environment, just without the topics (except for a few like __consumer_offsets and topics for cruise control and Kafka exporter). Brokers are scraped by prometheus via jmx exporter library. Time between scrapes is 30 seconds, scrape timeout is 25 seconds. The graphs show the average scrape time across all brokers and the number or partitions. About every 5 minutes 500 partitions were added. The first image shows measurements for jmx exporter version 0.16.1, the second shows 0.18.0

it appears that Version 0.16.1 behaves about linear in the number of partitions (which in turn is linearly related to the number of metrics) whereas 0.18.0 seems to display a behavior that looks like it’s at least quadratic.

dhoard · 2023-05-16T04:09:51Z

To clarify... you are getting metrics, but there is a performance degradation with the later versions? (This issue was reported as not getting any metrics.)

Using curl or wget, do you get metrics?
If so, what is the value for jmx_scrape_duration_seconds?

db3f · 2023-05-16T17:00:11Z

We are seeing metrics up to a certain number of partitions, if we are willing to wait a few minutes. At about 12000 partitions we gave up waiting after 20 minutes. When we got metrics jmx_scrape_duration_seconds was within a few 100 milliseconds of what curl showed as download time.

dhoard · 2023-05-16T20:21:27Z

Sounds like Kafka is slow.

I would suggest using the JmxScraper class directly to diagnose. This bypass most of the code used during a scrape.

jmx_exporter/collector/src/main/java/io/prometheus/jmx/JmxScraper.java

Line 36 in c2a90ec

    
           private static final Logger logger = Logger.getLogger(JmxScraper.class.getName());

Due to your cluster size, not sure I can create a large enough deployment to reproduce the behavior. I'll see what I can do.

db3f · 2023-05-16T21:47:18Z

Well, if it was Kafka (or solely Kafka), versions below 0.17 should show the same behavior. But as the Grafana boards I posted a few days ago show, 0.16.1 (and for that matter all earlier versions, we are using the library since at least 0.12) do work. It seems that the scrape time was linear in the number of metrics before and is now at least quadratic, probably even exponential.

dhoard · 2023-05-16T22:29:01Z

Totally agree and it sounds like a regression, just trying to narrow the scope.

I'll write an integration test with some fake MBeans to try to simulate the scenario.

dhoard · 2023-05-17T21:12:13Z

Looking at the code, there are two major changes between 0.16.0 and 0.17.1.

Code was introduced to process bean attributes one by one if a bulk load of the attributes fails.

JmxScraper.java....

        try {
            // bulk load all attributes
            attributes = beanConn.getAttributes(mbeanName, name2AttrInfo.keySet().toArray(new String[0]));
            if (attributes == null) {
                logScrape(mbeanName.toString(), "getAttributes Fail: attributes are null");
                return;
            }
        } catch (Exception e) {
            // couldn't get them all in one go, try them 1 by 1
            processAttributesOneByOne(beanConn, mbeanName, name2AttrInfo);
            return;
        }
        for (Object attributeObj : attributes.asList()) {
            if (Attribute.class.isInstance(attributeObj)) {
                Attribute attribute = (Attribute)(attributeObj);
                MBeanAttributeInfo attr = name2AttrInfo.get(attribute.getName());
                logScrape(mbeanName, attr, "process");
                processBeanValue(
                        mbeanName.getDomain(),
                        jmxMBeanPropertyCache.getKeyPropertyList(mbeanName),
                        new LinkedList<String>(),
                        attr.getName(),
                        attr.getType(),
                        attr.getDescription(),
                        attribute.getValue()
                );
            }
        }
    }

the client_java dependency was upgraded from 0.11.0 to 0.16.0. Looking at the code that's actually being used in the project I didn't see any change that should result in what you're seeing.

In my limited testing, single broker, 3000 partitions, I'm not hitting the scenario.

I can create a 0.18.0 build without the processing bean attributes one by one on Exception if you would like to test it.

db3f · 2023-05-19T07:14:56Z

I’m currently on a short leave but can offer to do some testing next week.

dhoard · 2023-06-14T04:55:55Z

@db3f any update on testing? One suspicion is that there is a failure getting all MBean attributes in a single call, so the attributes are getting processed one at a time.

jmx_exporter/collector/src/main/java/io/prometheus/jmx/JmxScraper.java

Lines 158 to 162 in c2a90ec

    
           } catch (Exception e) { 
        
               // couldn't get them all in one go, try them 1 by 1 
        
               processAttributesOneByOne(beanConn, mbeanName, name2AttrInfo); 
        
               return; 
        
           }

db3f · 2023-06-29T15:26:14Z

Sorry, we had some unexpected Workload recently. We think we found the solution today. We increased Kafka max heap space by 3GB and suddenly scrapes worked again. This is preliminary but we will do some more testing in this direction and report here.

earnil · 2024-01-17T19:09:33Z

Hello @dhoard, bumping up this issue because I'm facing the same problem on some big Kafka clusters (~7400 partitions per broker). We had no issue with metrics when using JMX Exporter 0.16.0, it began with a Kafka upgrade that updated JMX Exporter to 0.17.2. At the time, we downgraded it to 0.16.0 to fix the issue. A new Kafka upgrade (right now we're on Kafka 3.5.1) came with JMX Exporter 0.18.0 and issue appeared again.

I noticed that while jmx_scrape_duration_seconds is stable, between 13 to 16 seconds across brokers, the actual request can take a lot longer, from 30s to 1min 30s. This behavior is exacerbated when I'm manually running curl on /metrics quickly one after another.

I tried to increase max heap space but I didn't notice any changes. If you're still willing to create a 0.18.0 build without the processing bean attributes one by one on Exception I could test it.

dhoard · 2024-01-23T16:05:27Z

@earnil Please upgrade to the latest version and test. Version 0.20.0 added some performance improvements that decrease scrape time, but ultimately it takes time for a large cluster/many metrics.

If you are capturing a lot of metrics, you will need to increase your heap size due to how the underlying client_java library allocates/caches a per thread buffer that is the size of the max request.

earnil · 2024-01-30T15:03:35Z

Version 0.20.0 did increase performance paired with heap size adjustment for the biggest clusters. It still takes more than 30 seconds on some brokers, but these are very overloaded (> 15 000 partitions) so that's on us.

dhoard · 2024-02-02T14:08:16Z

Closing this issue.

dhoard added the theme: performance label May 17, 2023

dhoard added the status: waiting on feedback label Jul 27, 2023

dhoard removed the status: waiting on feedback label Feb 2, 2024

dhoard closed this as completed Feb 2, 2024

dhoard self-assigned this May 15, 2024

dhoard added the status: resolved label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JMX Exporter not scraping metrics from Kafka 7.7.1 version #755

JMX Exporter not scraping metrics from Kafka 7.7.1 version #755

Aravindangit003 commented Dec 6, 2022 •

edited

Loading

dhoard commented Feb 24, 2023

dhoard commented Apr 14, 2023

db3f commented May 4, 2023

dhoard commented May 4, 2023

db3f commented May 4, 2023 •

edited

Loading

dhoard commented May 5, 2023

db3f commented May 6, 2023

dhoard commented May 7, 2023 •

edited

Loading

db3f commented May 8, 2023 •

edited

Loading

dhoard commented May 8, 2023 •

edited

Loading

db3f commented May 13, 2023

dhoard commented May 16, 2023 •

edited

Loading

db3f commented May 16, 2023

dhoard commented May 16, 2023

db3f commented May 16, 2023

dhoard commented May 16, 2023 •

edited

Loading

dhoard commented May 17, 2023 •

edited

Loading

db3f commented May 19, 2023

dhoard commented Jun 14, 2023

db3f commented Jun 29, 2023

earnil commented Jan 17, 2024 •

edited

Loading

dhoard commented Jan 23, 2024

earnil commented Jan 30, 2024

dhoard commented Feb 2, 2024

JMX Exporter not scraping metrics from Kafka 7.7.1 version #755

JMX Exporter not scraping metrics from Kafka 7.7.1 version #755

Comments

Aravindangit003 commented Dec 6, 2022 • edited Loading

dhoard commented Feb 24, 2023

dhoard commented Apr 14, 2023

db3f commented May 4, 2023

dhoard commented May 4, 2023

db3f commented May 4, 2023 • edited Loading

dhoard commented May 5, 2023

db3f commented May 6, 2023

dhoard commented May 7, 2023 • edited Loading

db3f commented May 8, 2023 • edited Loading

dhoard commented May 8, 2023 • edited Loading

db3f commented May 13, 2023

dhoard commented May 16, 2023 • edited Loading

db3f commented May 16, 2023

dhoard commented May 16, 2023

db3f commented May 16, 2023

dhoard commented May 16, 2023 • edited Loading

dhoard commented May 17, 2023 • edited Loading

db3f commented May 19, 2023

dhoard commented Jun 14, 2023

db3f commented Jun 29, 2023

earnil commented Jan 17, 2024 • edited Loading

dhoard commented Jan 23, 2024

earnil commented Jan 30, 2024

dhoard commented Feb 2, 2024

Aravindangit003 commented Dec 6, 2022 •

edited

Loading

db3f commented May 4, 2023 •

edited

Loading

dhoard commented May 7, 2023 •

edited

Loading

db3f commented May 8, 2023 •

edited

Loading

dhoard commented May 8, 2023 •

edited

Loading

dhoard commented May 16, 2023 •

edited

Loading

dhoard commented May 16, 2023 •

edited

Loading

dhoard commented May 17, 2023 •

edited

Loading

earnil commented Jan 17, 2024 •

edited

Loading