Skip to content

Conversation

@alexanderbilk
Copy link

@alexanderbilk alexanderbilk commented Oct 29, 2025

What?

Adds infrastructure for dynamic load of telemetry exporters
Beta implementation of prometheus exporter

Why?

Required capability to expose NIXL telemetry in common telemetry formats (e.g OTLP, Prometheus etc.)

How?

Solution works as dynamically loaded user defined plugins

@alexanderbilk alexanderbilk requested a review from a team as a code owner October 29, 2025 09:51
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi alexanderbilk! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@alexanderbilk alexanderbilk changed the title Telemetry exporter plug-in infra DRAFT: Telemetry exporter plug-in infra Oct 29, 2025
@alexanderbilk alexanderbilk marked this pull request as draft October 29, 2025 09:59

# NIXL Prometheus Telemetry Exporter Plugin

This telemetry exporter plugin exports NIXL telemetry events to Prometheus format via an HTTP endpoint that can be scraped by Prometheus servers.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A typo and suggestion - ... in Prometheus format, by_exposing an HTTP endpoint ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. What exactly here is a typo?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly a typo, it was about "to => in".


## Configuration

To enable the Prometheus exporter, set the following environment variable:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add the word "plug-in" and explain that this variable activates a selected plug-in, of a (future) plug-in choices

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be considered to put an explicit .so name or path, to allow for custom plug-ins (e.g., by a third party)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more info here


gauges_["agent_xfer_time"] = &xfer_time_gauge.Add({{"category", "NIXL_TELEMETRY_PERFORMANCE"}});
gauges_["agent_xfer_post_time"] = &xfer_post_time_gauge.Add({{"category", "NIXL_TELEMETRY_PERFORMANCE"}});
gauges_["agent_memory_registered"] = &memory_registered_gauge.Add({{"category", "NIXL_TELEMETRY_MEMORY"}});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these two should be implemented as counters, such that their difference gives an amount of currently registered memory?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It stated in HLD what this metric operates as Gauge. Also what I can see from memory event implementation confirms it

@@ -0,0 +1,65 @@
/*
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general note, please add some .md file that explains how to write a custom telemetry plug-in. What is the API and what templates can be used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@alexanderbilk alexanderbilk force-pushed the telemetry-exporter branch 5 times, most recently from d0b114f to 280b472 Compare October 30, 2025 15:05
@alexanderbilk alexanderbilk marked this pull request as ready for review October 30, 2025 15:49
uint32_t eventLimit_;

public:
explicit nixlTelemetryExporter(const nixlTelemetryExporterInitParams *init_params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest passing by const-reference; if the params are optional they can be wrapped with std::optional (or a default-c'tor can be added).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to const reference

@alexanderbilk alexanderbilk force-pushed the telemetry-exporter branch 3 times, most recently from b575c20 to e412bd1 Compare November 3, 2025 14:31
*/
class nixlTelemetryExporter {
protected:
const uint32_t eventLimit_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a uint64_t in the init params, should be consistent (and perhaps size_t?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, changed to size_t

const nixlTelemetryExporterInitParams *init_params);

// Constructor
nixlTelemetryPlugin(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the two getters can be marked noexcept. The comment is superfluous.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied

NixlTelemetryPluginApiVersion api_version;

// Function pointer for creating a new exporter instance (returns smart pointer for RAII)
std::unique_ptr<nixlTelemetryExporter> (*create_exporter)(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A type alias for the function pointer could make this and the c'tor argument more readable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied

}
}

dlclose(handle_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should handle_ not also be the unique_ptr with the appropriate deleter?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

plugin_(plugin) {}

nixlTelemetryPluginHandle::~nixlTelemetryPluginHandle() {
if (handle_) {
Copy link
Contributor

@ColinNV ColinNV Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace this with assert(handle) in the c'tor (or // assert(handle) just to make the intention clear)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

nixl_status_t
nixlTelemetryPrometheusExporter::exportEvent(const nixlTelemetryEvent &event) {
try {
std::string event_name(event.eventName_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a const std::string_view if the map uses std::less<void>.

Copy link
Contributor

@ColinNV ColinNV Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or not, since it's only coming to unordered containers in C++20.

Could still be const :-)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added const

// Try to load the plugin from all registered directories
for (const auto &dir : plugin_dirs_) {
if (dir.empty()) {
continue;
Copy link
Contributor

@ColinNV ColinNV Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should better be caught at time of insertion into plugin_dirs_.

(Which would also make it easier to not hold the lock the whole time, but that's probably not important for this particular function.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omitted for now then.
All this functional is more of a "just in case" placeholder and to make it consistent with backend plugin manager

if (init_params && !init_params->outputPath.empty()) {
// Validate format: ip_addr:port_num
std::string path = init_params->outputPath;
size_t colon_pos = path.find(':');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variables should be declared const when they are not changed (mostly to let the reader know they won't change).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

* @param dirpath Directory path to search
*/
void
discoverPluginsFromDir(const std::string &dirpath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use std::filesystem::path for paths and filenames?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, replaced

enum class NixlTelemetryPluginApiVersion : unsigned int { V1 = 1 };

inline constexpr NixlTelemetryPluginApiVersion NIXL_TELEMETRY_PLUGIN_API_VERSION =
NixlTelemetryPluginApiVersion::V1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to put v1 into the name of all types (and global functions, if any) that are under versioning to make it clearer which parts exactly are in fact versioned?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another "just in case" placeholder actually for beta version. I'd consider removing API versioning at all

@alexanderbilk alexanderbilk force-pushed the telemetry-exporter branch 2 times, most recently from 8419b52 to 5fd32d2 Compare November 11, 2025 16:46
DRAFT: Prometheus exporter implemetation
Signed-off-by: Aleksandr Bilkovskii <[email protected]>
Comment on lines +28 to +30
inline constexpr char TELEMETRY_EXPORTER_VAR[] = "NIXL_TELEMETRY_EXPORTER";
inline constexpr char TELEMETRY_EXPORTER_PLUGIN_DIR_VAR[] = "NIXL_TELEMETRY_EXPORTER_PLUGIN_DIR";
inline constexpr char TELEMETRY_EXPORTER_OUTPUT_PATH_VAR[] = "NIXL_TELEMETRY_EXPORTER_OUTPUT_PATH";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inline constexpr char TELEMETRY_EXPORTER_VAR[] = "NIXL_TELEMETRY_EXPORTER";
inline constexpr char TELEMETRY_EXPORTER_PLUGIN_DIR_VAR[] = "NIXL_TELEMETRY_EXPORTER_PLUGIN_DIR";
inline constexpr char TELEMETRY_EXPORTER_OUTPUT_PATH_VAR[] = "NIXL_TELEMETRY_EXPORTER_OUTPUT_PATH";
inline constexpr char telemetry_exporter_var[] = "NIXL_TELEMETRY_EXPORTER";
inline constexpr char telemetry_exporter_plugin_dir_var[] = "NIXL_TELEMETRY_EXPORTER_PLUGIN_DIR";
inline constexpr char telemetry_exporter_output_path_var[] = "NIXL_TELEMETRY_EXPORTER_OUTPUT_PATH";

https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#es9-avoid-all_caps-names

* them to various destinations.
*/
class nixlTelemetryExporter {
protected:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +20 to +26
#include "nixl_types.h"
#include "telemetry_event.h"
#include "common/cyclic_buffer.h"

#include <string>
#include <vector>
#include <fstream>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete the unused header files.

#include <string_view>
#include <memory>

enum class NixlTelemetryPluginApiVersion : unsigned int { V1 = 1 };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


enum class NixlTelemetryPluginApiVersion : unsigned int { V1 = 1 };

inline constexpr NixlTelemetryPluginApiVersion NIXL_TELEMETRY_PLUGIN_API_VERSION =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid ALL_CAPS names.

std::unique_ptr<nixlTelemetryExporter> (*)(const nixlTelemetryExporterInitParams &init_params);

class nixlTelemetryPlugin {
private:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the public before private order.

NixlTelemetryPluginApiVersion::V1;

// Type alias for exporter creation function
using ExporterCreatorFn =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using ExporterCreatorFn =
using exporter_creator_fn_t =

https://github.com/ai-dynamo/nixl/blob/main/docs/CodeStyle.md

std::unique_ptr<nixlTelemetryExporter> (*)(const nixlTelemetryExporterInitParams &init_params);

class nixlTelemetryPlugin {
private:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the public before private order.

std::lock_guard<std::mutex> lock(mutex_);
events_.swap(next_queue);
}
for (auto &event : next_queue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

[](const nixlTelemetryEvent &a, const nixlTelemetryEvent &b) {
return a.timestampUs_ < b.timestampUs_;
});
for (auto &event : all_events) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

std::unique_ptr<nixlTelemetryExporter> (*)(const nixlTelemetryExporterInitParams &init_params);

class nixlTelemetryPlugin {
private:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the public before private order.

NixlTelemetryPluginApiVersion::V1;

// Type alias for exporter creation function
using ExporterCreatorFn =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using ExporterCreatorFn =
using exporter_creator_fn_t =

https://github.com/ai-dynamo/nixl/blob/main/docs/CodeStyle.md

std::string telemetry_file = std::string(telemetry_env_dir) + "/" + name;
telemetry_ = std::make_unique<nixlTelemetry>(telemetry_file, backendEngines);
NIXL_DEBUG << "NIXL telemetry is enabled with output file: " << telemetry_file;
} else if (telemetry_env_exporter_plugin_dir != nullptr &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nixlAgentData ctor is overloaded. Refactoring is necessary. Please move telemetry initialization into a separate private function.

}
};

// Plugin must implement these functions for dynamic loading
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of code duplication. telemetry_plugin.h looks like backend_plugin.h, and telemetry_plugin_manager.h looks like nixl_plugin_manager.h.

You can use the existing infrastructure by reorganizing the class structure (inheritance, composition, ...).
To better understand which path to choose, it is useful to have an example of using similar telemetry plugins. In other words, this PR should contain tests that use all the new functionality.

At the moment, I'm not entirely sure if telemetry needs to replicate the infrastructure that is used for NIXL plugins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants