Complete Databricks exporter and mixin package by aalhour · Pull Request #14 · grafana/databricks-prometheus-exporter

aalhour · 2025-11-11T16:29:22Z

Fixes: https://github.com/grafana/cloud-onboarding/issues/9896
Fixes: https://github.com/grafana/cloud-onboarding/issues/9914
Part of: https://github.com/grafana/cloud-onboarding/issues/9716

Dashboards

Overview Dashboard

Jobs & Pipelines Dashboard

Warehouses & Queries Dashboard

github-advanced-security · 2025-11-11T16:30:21Z

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

Dasomeone

Not going to leave too many comments here as we already discussed a bunch in the call this morning and yesterday so I know changes are pending, but here's a few more mixin and metric-specific ideas/questions/things to consider. Will give the exporter a closer look when I'm back 1st Dec.

Dasomeone · 2025-11-21T15:45:14Z

You could consider breaking this panels file out into multiple panel files for ease of use, a la kafka-observ-lib? https://github.com/grafana/jsonnet-libs/tree/master/kafka-observ-lib

Dasomeone

Ended up taking longer than I expected to review, and I've not been able to go into as much detail of the tests as I'd like to (nor the mixin) but here's my comments so far :)

Dasomeone · 2025-12-03T12:57:02Z

Please make sure you've checked this with build tools and golangci-lint, ideally we add that to CI as well :)
(currently it's failing due to a syntax error)

…inality controls BREAKING CHANGE: Sliding window metrics changed from Counter to Gauge type. Configuration: - Add --query-timeout flag (default 5m) for SQL query timeouts - Add --billing-lookback, --jobs-lookback, --pipelines-lookback, --queries-lookback flags - Add --sla-threshold flag for job SLA miss detection - Add --collect-task-retries flag (opt-in, high cardinality) Performance: - Implement persistent connection pooling with health checks - Parallelize billing queries (DBUs, Cost, Price Changes) - Optimize billing cost query using CTE (100s → 67s) Code quality: - Add Makefile with check, lint, test, build targets - Add .golangci.yml configuration - Use stretchr/testify for assertions - Remove unused metrics and labels - Add component=databricks-exporter log prefix Dependencies: - Upgrade databricks-sql-go v1.6.0 → v1.9.0 - Upgrade stretchr/testify v1.9.0 → v1.11.1 - Pin prometheus/common to v0.60.1

Signal type corrections (raw → gauge): - jobs_and_pipelines: jobRunP95Aggregate, pipelineRunP95Aggregate, jobDurationByName, pipelineDurationByName, topJobsByRunsTableSignal, taskRetriesByJobTableSignal, topPipelinesByRunsTableSignal - warehouses_and_queries: queryDuration, queryP95Latency, queryP50Latency, concurrencyCurrent, topWarehousesByQueries, topWarehousesByErrors, queryLatencyByWarehouse Query fixes: - Add workspace_id to warehouse aggregations - Add queryP50Latency7dBaseline signal for baseline comparison - Remove duplicate queryRate signal (use queryLoad) - Rename queryErrorRate1h → queryErrorRateAggregate Style: - Fix sentence case in panel title Dependencies: - Remove unused status-panels-lib dependency - Update vendor (xtd library)

Dasomeone

Layout and content wise I'm pretty happy with the mixin part of things!

The exporter is vastly improved, but still some open questions from previous review that you didn't respond to, would be good to just have a clean look and make sure it's all addressed/discussed

Otherwise I've left a couple new comments, but it all mostly looks good! Getting pretty close to releasing this :D

Co-authored-by: Emily <1282515+Dasomeone@users.noreply.github.com>

- Fix durationToSQLInterval to use correct singular/plural grammar (1 HOUR vs 2 HOURS, 1 DAY vs 2 DAYS) - Remove deprecated time window constants (PricesWindow, BillingWindow, etc.) - Make tableCheckInterval configurable via --table-check-interval flag - Add comprehensive tests for durationToSQLInterval and Build* functions - Update docs: fix metric types (Counter→Gauge for sliding windows), correct labels, add all CLI flags and env vars - Apply sentence case to all documentation per Grafana style guide - Clarify test-integration is a software integration test, not Grafana integration

…trics Logging: - Replace promlog/go-kit with slog/promslog (Go stdlib) - Refactor function signatures: context.Context first, logger last - Remove github.com/go-kit/log dependency Metrics: - Rename databricks_up → databricks_exporter_up - Add databricks_scrape_status{query,status} for per-query health - Add databricks_exporter_info{version,billing_window,...} for config visibility - Update all HELP texts with lookback window info Dependencies: - prometheus/common: v0.60.1 → v0.67.4 - prometheus/client_golang: v1.20.5 → v1.23.2 - prometheus/exporter-toolkit: v0.13.1 → v0.15.0

Add databricks_scrape_status{query} metric to all collectors for granular monitoring of individual system table query success/failure. Changes: - Emit scrape_status (1=success, 0=failure) for billing, jobs, pipelines, and queries collectors - Add ScrapeStatus to Describe() methods in all sub-collectors - Use atomic.Bool instead of sync.Mutex for error tracking in billing - Update metric counts in tests (19 → 21) - Document new metric in docs and mixin README

… query execution - Remove ~380 lines of unused legacy query constants from queries.go - Update queries_test.go to use Build* functions instead of deleted constants - Delete zero-value tests (TestLabelConstants, TestConstants) that only tested string literals - Fix task retries config check: move CollectTaskRetries check before query execution to avoid wasted database calls when feature is disabled - Fix BillingExportErrorsTotal counter semantics: - Rename to BillingScrapeErrors (remove misleading _total suffix) - Change from CounterValue to GaugeValue (value doesn't accumulate across scrapes) - Update description to clarify it's a per-scrape error indicator - Update test mocks to reflect CollectTaskRetries=false default behavior

aalhour · 2025-12-17T15:01:10Z

Summary of updates:

Logging
- Migrated from promlog/go-kit to slog/promslog (Go stdlib)
- Removed github.com/go-kit/log dependency
Metrics
- Renamed databricks_up → databricks_exporter_up
- Renamed databricks_billing_export_errors_total → databricks_billing_scrape_errors (now a gauge)
- Added databricks_scrape_status{query} per-collector health metric
- Added databricks_exporter_info{version,...} for config visibility
- Updated all HELP texts with lookback window info
Query builders
- Removed unused legacy query constants
- Added --table-check-interval flag for optional table availability checks
- Fixed durationToSQLInterval singular/plural grammar (1 HOUR vs 2 HOURS) + tests
Performance
- Moved CollectTaskRetries check before query execution to avoid wasted DB calls
Dependencies
- Pinned all renovate-bot dependencies now that the exporter uses promslog
Tests
- Added tests for durationToSQLInterval and Build* functions
- Removed zero-value tests (TestLabelConstants, TestConstants)
- Updated mocks for CollectTaskRetries=false default
Docs
- Fixed metric types (Counter → Gauge for sliding windows)
- Added all CLI flags and environment variables
- Applied sentence case per Grafana style guide

Dasomeone

Couple minor comments and some existing threads had separate comments added. I'm 99.9% happy with the exporter, fantastic work @aalhour !

Dasomeone · 2025-12-18T11:40:31Z

-		{"queryErrorsQuery", queryErrorsQuery},
-		{"queryDurationQuery", queryDurationQuery},
-		{"queriesRunningQuery", queriesRunningQuery},
+		// Days - singular and plural


You mention in the function description on the actual implementation that:
24 = 1 DAY
48h = 2 DAYS

Could you add tests for this too?

Similarly are we also needing to account for Months/years, etc?

These are already done on Lines 19-20, maybe the file wasn't loaded properly for you? 🤔

Might be, I'll double check!

Metric renames: - databricks_billing_dbus → databricks_billing_dbus_sliding - databricks_billing_cost_estimate_usd → databricks_billing_cost_estimate_usd_sliding - databricks_job_runs → databricks_job_runs_sliding - databricks_job_run_status → databricks_job_run_status_sliding - databricks_job_run_duration_seconds → databricks_job_run_duration_seconds_sliding - databricks_task_retries → databricks_task_retries_sliding - databricks_pipeline_runs → databricks_pipeline_runs_sliding - databricks_pipeline_run_status → databricks_pipeline_run_status_sliding - databricks_pipeline_run_duration_seconds → databricks_pipeline_run_duration_seconds_sliding - databricks_pipeline_freshness_lag_seconds → databricks_pipeline_freshness_lag_seconds_sliding - databricks_pipeline_retry_events → databricks_pipeline_retry_events_sliding - databricks_queries → databricks_queries_sliding - databricks_query_duration_seconds → databricks_query_duration_seconds_sliding - databricks_query_errors → databricks_query_errors_sliding - databricks_queries_running → databricks_queries_running_sliding Exporter changes: - Increase default lookback windows: jobs 2h→3h, pipelines 2h→3h, queries 1h→2h - Update HELP text for billing metrics to document 24-48h data lag - Update COALESCE fallbacks from 'unknown' to CONCAT('job-', id) / CONCAT('pipeline-', id) - Refactor Makefile: split test targets into test, test-exporter-unit, test-exporter-e2e Mixin changes: - Wrap all gauge signals with last_over_time(...[30m:]) to handle Prometheus staleness with 10-30 minute scrape intervals - Fix metric name typo: databricks_queries_sliding_running → databricks_queries_running_sliding - Fix missing _sliding suffix in alerts.libsonnet for duration metrics - Update signal descriptions to document billing data lag Documentation: - Add Prometheus configuration section (scrape interval 10-30m) - Add lookback windows documentation - Create docs/metrics-reference.md consolidating all metric definitions - Update READMEs to reference centralized metrics documentation - Normalize all titles to sentence case

aalhour · 2025-12-18T13:35:38Z

Updates

Metric renames:

databricks_billing_dbus → databricks_billing_dbus_sliding
databricks_billing_cost_estimate_usd → databricks_billing_cost_estimate_usd_sliding
databricks_job_runs → databricks_job_runs_sliding
databricks_job_run_status → databricks_job_run_status_sliding
databricks_job_run_duration_seconds → databricks_job_run_duration_seconds_sliding
databricks_task_retries → databricks_task_retries_sliding
databricks_pipeline_runs → databricks_pipeline_runs_sliding
databricks_pipeline_run_status → databricks_pipeline_run_status_sliding
databricks_pipeline_run_duration_seconds → databricks_pipeline_run_duration_seconds_sliding
databricks_pipeline_freshness_lag_seconds → databricks_pipeline_freshness_lag_seconds_sliding
databricks_pipeline_retry_events → databricks_pipeline_retry_events_sliding
databricks_queries → databricks_queries_sliding
databricks_query_duration_seconds → databricks_query_duration_seconds_sliding
databricks_query_errors → databricks_query_errors_sliding
databricks_queries_running → databricks_queries_running_sliding

Exporter changes:

Increase default lookback windows: jobs 2h→3h, pipelines 2h→3h, queries 1h→2h
Update HELP text for billing metrics to document 24-48h data lag
Update COALESCE fallbacks from 'unknown' to CONCAT('job-', id) / CONCAT('pipeline-', id)
Refactor Makefile: split test targets into test, test-exporter-unit, test-exporter-e2e

Mixin changes:

Wrap all gauge signals with last_over_time(...[30m:]) to handle Prometheus staleness with 10-30 minute scrape intervals
Fix metric name typo: databricks_queries_sliding_running → databricks_queries_running_sliding
Fix missing _sliding suffix in alerts.libsonnet for duration metrics
Update signal descriptions to document billing data lag

Documentation:

Add Prometheus configuration section (scrape interval 10-30m)
Add lookback windows documentation
Create docs/metrics-reference.md consolidating all metric definitions
Update READMEs to reference centralized metrics documentation
Normalize all titles to sentence case

Dasomeone

Fantastic work @aalhour, thank you so much for all the interation and improvements on this one!

Really happy with the current state of both exporter and mixin, so I'm happy to approve here. Remember in the follow-up PRs to update the alloy signatures to expose the CLI flags as alloy config options, but that's a minor thing.

Let's get this wrapped up, can't wait to see it launched as an integration

aalhour changed the title ~~WIP Databricks mixin package~~ Complete Databricks exporter and mixin package Nov 17, 2025

aalhour marked this pull request as ready for review November 18, 2025 14:27

Dasomeone reviewed Nov 21, 2025

View reviewed changes

Dasomeone requested changes Dec 2, 2025

View reviewed changes

Define signals for the 3 main dashboards

cd3f8b5

aalhour force-pushed the databricks-mixin-package branch from dc4f079 to 1c03029 Compare December 8, 2025 12:06

aalhour requested a review from a team as a code owner December 8, 2025 12:06

Databricks exporter & mixin before Dec 8 review

c3db323

aalhour force-pushed the databricks-mixin-package branch from 10e4b46 to c3db323 Compare December 8, 2025 12:24

aalhour added 2 commits December 9, 2025 14:50

aalhour self-assigned this Dec 9, 2025

Forgot to update default timeout to 5 mins.

548d23d

This was referenced Dec 10, 2025

feat(prometheus.exporter.databricks): Add Databricks exporter component grafana/alloy#5054

Merged

Add a runbook for the Databricks integration grafana/integration-sample-apps#115

Merged

Dasomeone reviewed Dec 12, 2025

View reviewed changes

aalhour and others added 6 commits December 15, 2025 15:20

Update cmd/databricks-exporter/main.go

c9de8af

Co-authored-by: Emily <1282515+Dasomeone@users.noreply.github.com>

Removed license blocks and simplified READMEs

6aedd39

aalhour commented Dec 17, 2025

View reviewed changes

Comment thread collector/billing.go

aalhour commented Dec 17, 2025

View reviewed changes

Comment thread collector/collector.go

Dasomeone reviewed Dec 18, 2025

View reviewed changes

Dasomeone approved these changes Dec 19, 2025

View reviewed changes

aalhour merged commit 5730cb3 into main Dec 19, 2025
5 checks passed

aalhour deleted the databricks-mixin-package branch December 19, 2025 15:03

Conversation

aalhour commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dashboards

Overview Dashboard

Jobs & Pipelines Dashboard

Warehouses & Queries Dashboard

Uh oh!

github-advanced-security AI commented Nov 11, 2025

Uh oh!

Dasomeone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Dasomeone Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Dasomeone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Dasomeone commented Dec 3, 2025

Uh oh!

Dasomeone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aalhour commented Dec 17, 2025

Uh oh!

Dasomeone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Dasomeone Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

aalhour Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dasomeone Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

aalhour commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Updates

Uh oh!

Dasomeone left a comment

Choose a reason for hiding this comment

Uh oh!

aalhour commented Nov 11, 2025 •

edited

Loading

aalhour Dec 18, 2025 •

edited

Loading

aalhour commented Dec 18, 2025 •

edited

Loading