Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Open Telemetry the default instead of StatsD for monitoring #40800

Open
kaxil opened this issue Jul 15, 2024 · 22 comments
Open

Make Open Telemetry the default instead of StatsD for monitoring #40800

kaxil opened this issue Jul 15, 2024 · 22 comments
Assignees
Labels
airflow3.0:breaking Candidates for Airflow 3.0 that contain breaking changes airflow3.0:candidate Potential candidates for Airflow 3.0 telemetry Telemetry-related issues

Comments

@kaxil
Copy link
Member

kaxil commented Jul 15, 2024

Currently, Apache Airflow uses StatsD for metrics collection and monitoring. To modernize our observability stack and align with industry standards, we should adopt OpenTelemetry as the primary metrics collection and monitoring tool for Airflow 3 while keeping Statsd as an option. This is made easier with AIP-49 Open Telemetry Support.

Backward Compatibility / Migration

Most of the enterprise Airflow deployments rely on StatsD for monitoring & alerting so we should make sure that there is a smooth migration path. Ideally, all the StatsD metrics mentioned in https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html#metric-descriptions should have a like-by-like replacement.

High-level Objectives:

  1. Use OpenTelemetry as the default metrics collection tool. This includes setting up configurations, dependencies, and code integration.
  2. Documentation Update: Update the official documentation to reflect the change from StatsD to OpenTelemetry as the default, including setup, configuration, and usage instructions. Include migration guide for users transitioning from StatsD to OpenTelemetry if there are breaking changes. This includes the Helm chart.
@kaxil kaxil mentioned this issue Jul 15, 2024
10 tasks
@dosubot dosubot bot added the telemetry Telemetry-related issues label Jul 15, 2024
@kaxil kaxil added involves core breaking change airflow3.0:candidate Potential candidates for Airflow 3.0 airflow3.0:breaking Candidates for Airflow 3.0 that contain breaking changes labels Jul 15, 2024
@josix
Copy link
Contributor

josix commented Jul 16, 2024

Hi @kaxil I'm interested in this issue, may I have a try for this, thanks!

@kaxil
Copy link
Member Author

kaxil commented Jul 17, 2024

Awesome, assigned it to you

@dirrao
Copy link
Contributor

dirrao commented Jul 18, 2024

Hi @josix
Let me know if you need any help.

@raphaelauv
Copy link
Contributor

also provide a simple grafana dashboard in the example stack ( cause the main airflow grafana dashboard open source is based on statsd ) would help a lot

@kaxil
Copy link
Member Author

kaxil commented Jul 25, 2024

@howardyoo is also interested in working on this. He led AIP-49, so @josix if you need help please let Howard no.

@howardyoo is equally interested in leading this effort too

@josix
Copy link
Contributor

josix commented Jul 25, 2024

Yeah, I just checked out the AIP and the sharing in the Airflow Submit these days, and currently worked on studying the codebase around StatsD and OTel. I believe it would be a better choice for @howardyoo to lead this topic 🙂 Please feel free to unassign me and assign sub-items to me if possible, thanks for the coordination.

@dirrao
Copy link
Contributor

dirrao commented Jul 25, 2024

I am interested too in this. Please assign me any tasks needed to be done.
I am working on similar lines on this in this PR #39908

@kaxil kaxil assigned dirrao and unassigned josix Jul 25, 2024
@kaxil
Copy link
Member Author

kaxil commented Jul 25, 2024

@howardyoo Could you add a comment, please? GitHub doesn't allow me to assign you an issue until you have commented on it since you aren't part of the Apache org

@howardyoo
Copy link
Contributor

@howardyoo Could you add a comment, please? GitHub doesn't allow me to assign you an issue until you have commented on it since you aren't part of the Apache org

Ok, done!

@kaxil
Copy link
Member Author

kaxil commented Jul 26, 2024

Awesome, assigned it to you.

@howardyoo
Copy link
Contributor

howardyoo commented Aug 28, 2024

Oh, I guess the above mentioned PR draft got released, which is nice - but looks like it may be on hold for now.

@chomipi88
Copy link

Will StatsD support be dropped with AF 3 or will OTEL become primary for AF 3 with StatsD as an alternative/backup for N versions to ensure there is a smooth and safe migration path to OTEL?

@howardyoo
Copy link
Contributor

Will StatsD support be dropped with AF 3 or will OTEL become primary for AF 3 with StatsD as an alternative/backup for N versions to ensure there is a smooth and safe migration path to OTEL?

My opinion is not to drop statsD support immediately, but have OTEL switch over to statsD as primary, and statsD can still be alternative/backup.

@raphaelauv
Copy link
Contributor

Will StatsD support be dropped with AF 3 or will OTEL become primary for AF 3 with StatsD as an alternative/backup for N versions to ensure there is a smooth and safe migration path to OTEL?

My opinion is not to drop statsD support immediately, but have OTEL switch over to statsD as primary, and statsD can still be alternative/backup.

Ok but what are your arguments in favor of this policy?

@howardyoo
Copy link
Contributor

Will StatsD support be dropped with AF 3 or will OTEL become primary for AF 3 with StatsD as an alternative/backup for N versions to ensure there is a smooth and safe migration path to OTEL?

My opinion is not to drop statsD support immediately, but have OTEL switch over to statsD as primary, and statsD can still be alternative/backup.

Ok but what are your arguments in favor of this policy?

My favor to this policy, is that

  • Opentelemetry has become quite stable and mature, and is vendor neural way of instrumenting softwares
  • StatsD only covers metrics, while Opentelemetry covers metrics, traces, and logs
  • StatsD has limitations in terms of number of attributes, or metric names, while Opentelemetry is less limited
  • Opentelemetry also provides opentelemetry collector which can further be used to control how the telemetry could be transferred.

@raphaelauv
Copy link
Contributor

raphaelauv commented Sep 5, 2024

Yes otel is better ,but why keep statsd ?

@potiuk
Copy link
Member

potiuk commented Sep 5, 2024

I was originally promoting the idea of attempting to use otel metrics to provide backwards compatibility with StatsD (as an option). I think at the last call @ferruzzi mentioned that I was the proponent of replacing StatsD with OpenTelemetry, but in fact it was a bit conditional - let's use single "producing" mechanism (OTEL) but let's make it produce statsd-compatible metrics so that existing dashboards continue to work.

I think there are certain limitations (metrics name length for one) that make it either difficult or impossible or at the very least imperfect. So I am in favour of getting OTEL as "default" interface, but leaving (at least for a long while) statsd as an optional legacy mechanism - available to those who already monitor their airflow with statsd.

The thing is that it's not very difficult to keep both. Metrics change rarely, but most importantly - keeping statsd metrics in the code (since those are only metrics and we keep all the data to produce for OTEL anyway) is not a big overhead.

Yes, it introduces some duplication and possibly the OTEL vs. STATSD metrics will - over time - diverge even more, but that's quite fine. If we focus on improving and making OTEL metrics more useful while treat Statsd as legacy, read-only that we do not actively maintain, this will make OTEL even more attractive over time and we could even recommend people - if you want to get better / improved metrics go OTEL if you are stil using statsd. But we should give people more time to do so and let them do it independently from the decision on migration to Airflow 3.

I see statsd as a way how we can improve adoption of Airflow 3. If we expect future Airflow 3 users to migrate quickly, adding them "yet-another-thing-to-learn-and-configure" gives compounding effect on the difficulty of migrateion and might be a factor where they decide to defer migration decision. And since it does not cost us a lot of maintenance - this is a "cheap" overhead.

@howardyoo
Copy link
Contributor

I was originally promoting the idea of attempting to use otel metrics to provide backwards compatibility with StatsD (as an option). I think at the last call @ferruzzi mentioned that I was the proponent of replacing StatsD with OpenTelemetry, but in fact it was a bit conditional - let's use single "producing" mechanism (OTEL) but let's make it produce statsd-compatible metrics so that existing dashboards continue to work.

I think there are certain limitations (metrics name length for one) that make it either difficult or impossible or at the very least imperfect. So I am in favour of getting OTEL as "default" interface, but leaving (at least for a long while) statsd as an optional legacy mechanism - available to those who already monitor their airflow with statsd.

The thing is that it's not very difficult to keep both. Metrics change rarely, but most importantly - keeping statsd metrics in the code (since those are only metrics and we keep all the data to produce for OTEL anyway) is not a big overhead.

Yes, it introduces some duplication and possibly the OTEL vs. STATSD metrics will - over time - diverge even more, but that's quite fine. If we focus on improving and making OTEL metrics more useful while treat Statsd as legacy, read-only that we do not actively maintain, this will make OTEL even more attractive over time and we could even recommend people - if you want to get better / improved metrics go OTEL if you are stil using statsd. But we should give people more time to do so and let them do it independently from the decision on migration to Airflow 3.

I see statsd as a way how we can improve adoption of Airflow 3. If we expect future Airflow 3 users to migrate quickly, adding them "yet-another-thing-to-learn-and-configure" gives compounding effect on the difficulty of migrateion and might be a factor where they decide to defer migration decision. And since it does not cost us a lot of maintenance - this is a "cheap" overhead.

@potiuk beat me into responding!
but yes, the overhead is little, statsd had been the default choice for years, so maintaining backward compatibility is still important, yet now users who are using statsd could experiment and adopt otel as their next telemetry while they can still use what they're used to would be my opinion.

@ferruzzi
Copy link
Contributor

ferruzzi commented Sep 5, 2024

I think at the last call @ferruzzi mentioned that I was the proponent of replacing StatsD with OpenTelemetry, but in fact it was a bit conditional

My apologies for misrepresenting your stand on that. I shouldn't have spoken for you.

I have mentioned my solution in the past but perhaps I need to codify it, or open an Issue and let someone else tackle it. Currently there are three metrics backends (called loggers) StasD, OTel, and Datadog (which as far as I can tell is a variant on StatsD which supports tagging). The biggest compatibility issue is that OTel has a much shorter max name length (63 characters), but supports tagging. StatsD has a much longer (300 character?) name length limit but does not support tagging.

Solution:

Add a "generate name" abstract method to base metrics logger, something like:

def get_name(name: str, tags:dict) -> str:
    ...

Each logger will then need to implement this and call it before actually emitting the metric. For otel, it would just return name. for StatsD it would return the name with the tags dict concatenated onto it, something more or less like this should work but there's likely a better way

def get_name(name: str, tags:dict) -> str:
    suffix = str(tags)[1:-1] # drop the open and close curly braces
        .replace(": ", "_") # replace the key/value delimiter with an underscore
        .replace(", ", "_") # replace the comma between keys/value pairs with an underscore
        .replace("'", "")   # remove the quotes around the key name 
    
    return f"{name}_{suffix}"

Then we can stop double-emitting, remove the "we have to truncate this name" name length warning, and let each logger handle naming errors however it sees fit.

[EDIT]
shorter but nowhere near as clear statsd implementation

def get_name(name: str, tags:dict) -> str:
    return f'{name}_{("_").join([f"{k}_{v}" for (k, v) in tags.items()])}'

@kaxil
Copy link
Member Author

kaxil commented Sep 5, 2024

My opinion is not to drop statsD support immediately, but have OTEL switch over to statsD as primary, and statsD can still be alternative/backup.

Yup, I agree with this

@kaxil kaxil changed the title Remove StatsD and replace it with Open Telemetry as first-class citizen Make Open Telemetry the default instead of StatsD for monitoring Sep 5, 2024
@dthauvin
Copy link

dthauvin commented Nov 18, 2024

I think at the last call @ferruzzi mentioned that I was the proponent of replacing StatsD with OpenTelemetry, but in fact it was a bit conditional

My apologies for misrepresenting your stand on that. I shouldn't have spoken for you.

I have mentioned my solution in the past but perhaps I need to codify it, or open an Issue and let someone else tackle it. Currently there are three metrics backends (called loggers) StasD, OTel, and Datadog (which as far as I can tell is a variant on StatsD which supports tagging). The biggest compatibility issue is that OTel has a much shorter max name length (63 characters), but supports tagging. StatsD has a much longer (300 character?) name length limit but does not support tagging.

Solution:

Add a "generate name" abstract method to base metrics logger, something like:

def get_name(name: str, tags:dict) -> str:
    ...

Each logger will then need to implement this and call it before actually emitting the metric. For otel, it would just return name. for StatsD it would return the name with the tags dict concatenated onto it, something more or less like this should work but there's likely a better way

def get_name(name: str, tags:dict) -> str:
    suffix = str(tags)[1:-1] # drop the open and close curly braces
        .replace(": ", "_") # replace the key/value delimiter with an underscore
        .replace(", ", "_") # replace the comma between keys/value pairs with an underscore
        .replace("'", "")   # remove the quotes around the key name 
    
    return f"{name}_{suffix}"

Then we can stop double-emitting, remove the "we have to truncate this name" name length warning, and let each logger handle naming errors however it sees fit.

[EDIT] shorter but nowhere near as clear statsd implementation

def get_name(name: str, tags:dict) -> str:
    return f'{name}_{("_").join([f"{k}_{v}" for (k, v) in tags.items()])}'

Hi @ferruzzi ,i don't know if you see this , but opentelemetry specification goes from 63 characters to 255 open-telemetry/opentelemetry-specification#3648 .
With OpenTelemetry configuration i got a lot of truncate metrics . Specially those with the dag name inside . it's very difficult to predict the metrics name and so make observability dashboard automation

exceeds OpenTelemetry's name length limit of 63 characters and will be truncated to

@ferruzzi
Copy link
Contributor

Thanks for pointing it out. We have a fix for it coming in #43340

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
airflow3.0:breaking Candidates for Airflow 3.0 that contain breaking changes airflow3.0:candidate Potential candidates for Airflow 3.0 telemetry Telemetry-related issues
Projects
None yet
Development

No branches or pull requests

10 participants