Skip to content

Conversation

@srowen
Copy link
Member

@srowen srowen commented Oct 30, 2019

What changes were proposed in this pull request?

Update the version of dropwizard metrics that Spark uses for metrics to 4.1.x, from 3.2.x.

Why are the changes needed?

This helps JDK 9+ support, per for example dropwizard/metrics#1236

Does this PR introduce any user-facing change?

No, although downstream users with custom metrics may be affected.

How was this patch tested?

Existing tests.

@srowen srowen self-assigned this Oct 30, 2019
<exclusions>
<exclusion>
<groupId>com.rabbitmq</groupId>
<artifactId>amqp-client</artifactId>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would otherwise be a new dependency on RabbitMQ, but, my hunch is that it is not necessary for Spark.

@dongjoon-hyun
Copy link
Member

Wow. Nice!

@dongjoon-hyun
Copy link
Member

BTW, what happens to ganglia? Do we still have Ganglia support after this PR?

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #112974 has finished for PR 26332 at commit aefde48.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member Author

srowen commented Oct 31, 2019

I don't know of a reason it would change Ganglia support. metrics-ganglia is still included, updated, and built here.

@srowen
Copy link
Member Author

srowen commented Oct 31, 2019

Oh, wait no I see the issue. Ganglia support is in a different module that I didn't enable. Indeed it isn't present in 4.x: https://search.maven.org/search?q=g:io.dropwizard.metrics%20AND%20a:metrics-ganglia&core=gav

Hm, OK, that's a problem, as we have historically maintained some Ganglia support.
But there are actually no tests for this integration, so we wouldn't have noticed any issues with it anyway.

Maybe I need to take this to dev@ ... if the support doesn't necessarily work on JDK 11 (unproven) then do we need to consider removing it?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 31, 2019

+1 for removing in JDK11. Since we dropped kafka-0.8 at 2.4.x (Scala-2.12), I believe we can disable Ganglia support in JDK11. It's because the main distribution of Apache Spark 3.0.0 is still JDK8.

@dongjoon-hyun
Copy link
Member

Oh.. It's difficult to keep separately for JDK8/11.

@LucaCanali
Copy link
Contributor

LucaCanali commented Oct 31, 2019

Adding a link to a similar discussion in #26212
Probably getting rid of the Ganglia reporter is a good solution. Just a thought is that if some people still need it in Spark 3.x, maybe the Ganglia reporter can be supported by Apache Spark? It does not seem like a lot of code. There is the annoying issue of LGPL licensing of course though, that's why Dropwizard metrics library dropped it, I understand.
PS: I see now that the discussion has moved to the dev mailing list with a similar comment and more.

@srowen
Copy link
Member Author

srowen commented Oct 31, 2019

Yep, I am next going to try that, just inlining the simple single file metrics-ganglia contained. Dropwizard itself is ALv2. Ganglia is not, but, Spark doesn't publish this module directly, so that doesn't change.
Honestly most of the work is updating the license files!

@srowen
Copy link
Member Author

srowen commented Oct 31, 2019

Yep, I am next going to try that, just inlining the simple single file metrics-ganglia contained. Dropwizard itself is ALv2. Ganglia is not, but, Spark doesn't publish this module directly, so that doesn't change.
Honestly most of the work is updating the license files!
If this file gets hard to maintain, we can consider removal. If it's pretty easy to keep it for now, we can just keep it.
Honestly the only problem here is lack of tests.

<dependency>
<groupId>io.dropwizard.metrics</groupId>
<artifactId>metrics-ganglia</artifactId>
<groupId>info.ganglia.gmetric4j</groupId>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previously a transitive dependency anyway, and should have been directly referenced.

The Apache Software Foundation (http://www.apache.org/).


Metrics
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should have already been there, but now would also need to be in NOTICE

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #113036 has finished for PR 26332 at commit b842ad4.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class GangliaReporter extends ScheduledReporter
  • public static class Builder

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #113037 has finished for PR 26332 at commit 79bd378.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

IIUC, we are publishing our Ganglia module to Maven directly from Apache Spark 1.2.0 to 2.4.4 technically.

Spark doesn't publish this module directly, so that doesn't change.

@srowen
Copy link
Member Author

srowen commented Oct 31, 2019

Ah sorry, you are correct. What I really mean is that it isn't bundled in the binary release, because that would entail distributing the ganglia dependency. At least this was true last time I checked!

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Nov 3, 2019

Test build #113178 has finished for PR 26332 at commit 79bd378.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Nov 3, 2019

Test build #113182 has finished for PR 26332 at commit 79bd378.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you so much, @srowen . Merged to master.
Also, thank you, @LucaCanali .

@srowen srowen deleted the SPARK-29674 branch November 6, 2019 14:56
@srowen
Copy link
Member Author

srowen commented Nov 7, 2019

@LucaCanali @jerryshao maybe I can get your advice on something related here.

I was investigating back-porting this to 2.4.x. I find that the tests run out of memory with this change. A heap dump suggests that a lot of memory in consumed by MetricRegistry objects, with transitive references to all sorts of big objects. (Let's put aside why those references exist for now.)

As part of my hacking, I found that manually clearing all the Metrics in the registry when Spark's MetricSystem stops seems to resolve it. As in, in MetricSystem:

def stop(): Unit = {
    ...
      sinks.foreach(_.stop)
      registry.removeMatching(new MetricFilter {
        override def matches(name: String, metric: Metric): Boolean = true
      })
    ...
  }

Is this crazy? like, is clearing out all of the metrics going to break something? or is this a sensible cleanup step?

I'm still investigating, but wanted to reach out to anyone who may be familiar with this code.

@LucaCanali
Copy link
Contributor

What you write sounds reasonable to me. As an additional comment I'd say that while stopping the metrics system makes sense for a testing suite, it is not something that happens often in real usage: for most cases I can think of, metrics are started/registered at the component instance startup and they stay up for all its life.

@srowen
Copy link
Member Author

srowen commented Nov 7, 2019

Thanks @LucaCanali . Yeah, that's my worry. This means that after the Spark cluster is stopped, the metrics all disappear. Is that expected, or a problem? I am not sure how metrics are used. Of course at that point the components' lifecycle has ended anyway. Would external services rely on 'pulling' metrics after it's done but before the JVM stops, or do metrics implementations typically 'push' as they go? sorry for the dumb question.

@LucaCanali
Copy link
Contributor

Metrics are written to sinks at regular time intervals (configurable, let’s say 10 seconds). I mostly use the Graphite sink to write to an InfluxDB instance. Metrics are persisted there and consumed from InfluxDB by a Grafana dashboard.
The metrics system also has the MetricsServlet and more recently PrometheusServlet sink, which publish metrics to the Spark UI, + also I believe the JmxSink is quite popular. In these cases the idea is for the metrics to be consumed by the final user by pulling. It would be the end user/consumer's job to keep a history of the gathered metrics in these cases.

@srowen
Copy link
Member Author

srowen commented Nov 7, 2019

OK it's sounding reasonable to explicitly remove the registry then, as it's going away pretty immediately anyway, and the sinks are shutdown. I will propose it.

That helps the tests, which do start/stop contexts, and I think we otherwise somehow 'leak' all the previous metrics registries in this case. However it's kind of specific to the tests.

The other question remains: why is this an issue only in 2.4 (as far as I can tell), and only with metrics 4.x? Since you've looked at a lot of metrics related changes @LucaCanali , do you recall anything we fixed in 3.0 that might help avoid using a lot of memory in the metrics system? I don't, and didn't see anything in skimming JIRAs.

I am still not clear whether the big closures captured by gauges that Spark registers are necessary or an accident that will otherwise cause long-running apps to 'leak' somehow. I don't think so; what it retains a reference to are all the current Spark machinery like BlockManager. Not a problem as long as there is only one live BlockManager anyway. Still, need to keep an eye on it in 3.0.

@LucaCanali
Copy link
Contributor

Most of the PRs for the metrics system targeting 3.0 so far have been about adding new metrics.
It's good to know that the memory usage issue does not seem to affect 3.0 though.
BTW we could work on adding a few configuration parameters to allow turning on/off some of the more "chatty metrics"? For example: #26320

FMX pushed a commit to apache/celeborn that referenced this pull request Jun 4, 2024
### What changes were proposed in this pull request?

Bump Dropwizard version from 3.2.6 to 4.2.25. Meanwhile, introduce `metrics_jvm_thread_peak_count_Value` and `metrics_jvm_thread_total_started_count_Value` in `celeborn-jvm-dashboard.json`.

### Why are the changes needed?

Dropwizard metrics has released v4.2.25 including some bugfixes and improvements including:

* [JVM] Fix maximum/total memory calculation: dropwizard/metrics#3125
* [Thread] Add peak and total started thread count to `ThreadStatesGaugeSet`: dropwizard/metrics#1601

Meanwhile, Ratis version has upgraded to 3.0.1 which has no compatibility problem with Dropwizard 4.2.25.

Backport:

- apache/spark#26332
- apache/spark#29426
- apache/spark#37372

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #2540 from SteNicholas/CELEBORN-1389.

Authored-by: SteNicholas <[email protected]>
Signed-off-by: mingji <[email protected]>
FMX pushed a commit to apache/celeborn that referenced this pull request Jun 4, 2024
### What changes were proposed in this pull request?

Bump Dropwizard version from 3.2.6 to 4.2.25. Meanwhile, introduce `metrics_jvm_thread_peak_count_Value` and `metrics_jvm_thread_total_started_count_Value` in `celeborn-jvm-dashboard.json`.

### Why are the changes needed?

Dropwizard metrics has released v4.2.25 including some bugfixes and improvements including:

* [JVM] Fix maximum/total memory calculation: dropwizard/metrics#3125
* [Thread] Add peak and total started thread count to `ThreadStatesGaugeSet`: dropwizard/metrics#1601

Meanwhile, Ratis version has upgraded to 3.0.1 which has no compatibility problem with Dropwizard 4.2.25.

Backport:

- apache/spark#26332
- apache/spark#29426
- apache/spark#37372

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #2540 from SteNicholas/CELEBORN-1389.

Authored-by: SteNicholas <[email protected]>
Signed-off-by: mingji <[email protected]>
(cherry picked from commit 4fc42d7)
Signed-off-by: mingji <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants