Skip to content

Complete Databricks exporter and mixin package#14

Merged
aalhour merged 12 commits into
mainfrom
databricks-mixin-package
Dec 19, 2025
Merged

Complete Databricks exporter and mixin package#14
aalhour merged 12 commits into
mainfrom
databricks-mixin-package

Conversation

@aalhour
Copy link
Copy Markdown
Contributor

@aalhour aalhour commented Nov 11, 2025

@github-advanced-security
Copy link
Copy Markdown

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

@aalhour aalhour changed the title WIP Databricks mixin package Complete Databricks exporter and mixin package Nov 17, 2025
@aalhour aalhour marked this pull request as ready for review November 18, 2025 14:27
Copy link
Copy Markdown
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not going to leave too many comments here as we already discussed a bunch in the call this morning and yesterday so I know changes are pending, but here's a few more mixin and metric-specific ideas/questions/things to consider. Will give the exporter a closer look when I'm back 1st Dec.

Comment thread README.md
Comment thread README.md
Comment thread mixin/README.md Outdated
Comment thread mixin/README.md Outdated
Comment thread mixin/README.md Outdated
Comment thread mixin/panels.libsonnet Outdated
Comment thread mixin/panels.libsonnet
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could consider breaking this panels file out into multiple panel files for ease of use, a la kafka-observ-lib? https://github.com/grafana/jsonnet-libs/tree/master/kafka-observ-lib

Comment thread mixin/panels.libsonnet
Comment thread mixin/panels.libsonnet Outdated
Comment thread mixin/signals/overview.libsonnet Outdated
Copy link
Copy Markdown
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up taking longer than I expected to review, and I've not been able to go into as much detail of the tests as I'd like to (nor the mixin) but here's my comments so far :)

Comment thread cmd/databricks-exporter/main.go Outdated
Comment thread cmd/databricks-exporter/main.go Outdated
Comment thread cmd/databricks-exporter/main.go Outdated
Comment thread collector/billing.go Outdated
Comment thread collector/billing.go Outdated
Comment thread collector/queries.go Outdated
Comment thread collector/pipelines.go Outdated
Comment thread collector/pipelines.go Outdated
Comment thread collector/jobs.go
Comment thread collector/jobs.go
@Dasomeone
Copy link
Copy Markdown
Member

Please make sure you've checked this with build tools and golangci-lint, ideally we add that to CI as well :)
(currently it's failing due to a syntax error)

@aalhour aalhour force-pushed the databricks-mixin-package branch from dc4f079 to 1c03029 Compare December 8, 2025 12:06
@aalhour aalhour requested a review from a team as a code owner December 8, 2025 12:06
@aalhour aalhour force-pushed the databricks-mixin-package branch from 10e4b46 to c3db323 Compare December 8, 2025 12:24
…inality controls

BREAKING CHANGE: Sliding window metrics changed from Counter to Gauge type.

Configuration:
- Add --query-timeout flag (default 5m) for SQL query timeouts
- Add --billing-lookback, --jobs-lookback, --pipelines-lookback, --queries-lookback flags
- Add --sla-threshold flag for job SLA miss detection
- Add --collect-task-retries flag (opt-in, high cardinality)

Performance:
- Implement persistent connection pooling with health checks
- Parallelize billing queries (DBUs, Cost, Price Changes)
- Optimize billing cost query using CTE (100s → 67s)

Code quality:
- Add Makefile with check, lint, test, build targets
- Add .golangci.yml configuration
- Use stretchr/testify for assertions
- Remove unused metrics and labels
- Add component=databricks-exporter log prefix

Dependencies:
- Upgrade databricks-sql-go v1.6.0 → v1.9.0
- Upgrade stretchr/testify v1.9.0 → v1.11.1
- Pin prometheus/common to v0.60.1
Signal type corrections (raw → gauge):
- jobs_and_pipelines: jobRunP95Aggregate, pipelineRunP95Aggregate,
  jobDurationByName, pipelineDurationByName, topJobsByRunsTableSignal,
  taskRetriesByJobTableSignal, topPipelinesByRunsTableSignal
- warehouses_and_queries: queryDuration, queryP95Latency, queryP50Latency,
  concurrencyCurrent, topWarehousesByQueries, topWarehousesByErrors,
  queryLatencyByWarehouse

Query fixes:
- Add workspace_id to warehouse aggregations
- Add queryP50Latency7dBaseline signal for baseline comparison
- Remove duplicate queryRate signal (use queryLoad)
- Rename queryErrorRate1h → queryErrorRateAggregate

Style:
- Fix sentence case in panel title

Dependencies:
- Remove unused status-panels-lib dependency
- Update vendor (xtd library)
@aalhour aalhour self-assigned this Dec 9, 2025
Copy link
Copy Markdown
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout and content wise I'm pretty happy with the mixin part of things!

The exporter is vastly improved, but still some open questions from previous review that you didn't respond to, would be good to just have a clean look and make sure it's all addressed/discussed

Otherwise I've left a couple new comments, but it all mostly looks good! Getting pretty close to releasing this :D

Comment thread Makefile Outdated
Comment thread Makefile Outdated
Comment thread collector/billing_test.go Outdated
Comment thread Makefile Outdated
Comment thread README.md Outdated
Comment thread collector/queries.go Outdated
Comment thread mixin/signals/overview.libsonnet Outdated
Comment thread mixin/panels.libsonnet Outdated
Comment thread mixin/README.md Outdated
Comment thread mixin/README.md Outdated
aalhour and others added 6 commits December 15, 2025 15:20
Co-authored-by: Emily <1282515+Dasomeone@users.noreply.github.com>
- Fix durationToSQLInterval to use correct singular/plural grammar
  (1 HOUR vs 2 HOURS, 1 DAY vs 2 DAYS)
- Remove deprecated time window constants (PricesWindow, BillingWindow, etc.)
- Make tableCheckInterval configurable via --table-check-interval flag
- Add comprehensive tests for durationToSQLInterval and Build* functions
- Update docs: fix metric types (Counter→Gauge for sliding windows),
  correct labels, add all CLI flags and env vars
- Apply sentence case to all documentation per Grafana style guide
- Clarify test-integration is a software integration test, not Grafana integration
…trics

Logging:
- Replace promlog/go-kit with slog/promslog (Go stdlib)
- Refactor function signatures: context.Context first, logger last
- Remove github.com/go-kit/log dependency

Metrics:
- Rename databricks_up → databricks_exporter_up
- Add databricks_scrape_status{query,status} for per-query health
- Add databricks_exporter_info{version,billing_window,...} for config visibility
- Update all HELP texts with lookback window info

Dependencies:
- prometheus/common: v0.60.1 → v0.67.4
- prometheus/client_golang: v1.20.5 → v1.23.2
- prometheus/exporter-toolkit: v0.13.1 → v0.15.0
Add databricks_scrape_status{query} metric to all collectors for
granular monitoring of individual system table query success/failure.

Changes:
- Emit scrape_status (1=success, 0=failure) for billing, jobs,
  pipelines, and queries collectors
- Add ScrapeStatus to Describe() methods in all sub-collectors
- Use atomic.Bool instead of sync.Mutex for error tracking in billing
- Update metric counts in tests (19 → 21)
- Document new metric in docs and mixin README
… query execution

- Remove ~380 lines of unused legacy query constants from queries.go
- Update queries_test.go to use Build* functions instead of deleted constants
- Delete zero-value tests (TestLabelConstants, TestConstants) that only tested string literals

- Fix task retries config check: move CollectTaskRetries check before query
  execution to avoid wasted database calls when feature is disabled

- Fix BillingExportErrorsTotal counter semantics:
  - Rename to BillingScrapeErrors (remove misleading _total suffix)
  - Change from CounterValue to GaugeValue (value doesn't accumulate across scrapes)
  - Update description to clarify it's a per-scrape error indicator

- Update test mocks to reflect CollectTaskRetries=false default behavior
Comment thread collector/billing.go
Comment thread collector/collector.go
@aalhour
Copy link
Copy Markdown
Contributor Author

aalhour commented Dec 17, 2025

Summary of updates:

  • Logging

    • Migrated from promlog/go-kit to slog/promslog (Go stdlib)
    • Removed github.com/go-kit/log dependency
  • Metrics

    • Renamed databricks_updatabricks_exporter_up
    • Renamed databricks_billing_export_errors_totaldatabricks_billing_scrape_errors (now a gauge)
    • Added databricks_scrape_status{query} per-collector health metric
    • Added databricks_exporter_info{version,...} for config visibility
    • Updated all HELP texts with lookback window info
  • Query builders

    • Removed unused legacy query constants
    • Added --table-check-interval flag for optional table availability checks
    • Fixed durationToSQLInterval singular/plural grammar (1 HOUR vs 2 HOURS) + tests
  • Performance

    • Moved CollectTaskRetries check before query execution to avoid wasted DB calls
  • Dependencies

    • Pinned all renovate-bot dependencies now that the exporter uses promslog
  • Tests

    • Added tests for durationToSQLInterval and Build* functions
    • Removed zero-value tests (TestLabelConstants, TestConstants)
    • Updated mocks for CollectTaskRetries=false default
  • Docs

    • Fixed metric types (Counter → Gauge for sliding windows)
    • Added all CLI flags and environment variables
    • Applied sentence case per Grafana style guide

Copy link
Copy Markdown
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple minor comments and some existing threads had separate comments added. I'm 99.9% happy with the exporter, fantastic work @aalhour !

Comment thread Makefile Outdated
Comment thread mixin/README.md
Comment thread collector/billing.go
Comment thread collector/queries_test.go
{"queryErrorsQuery", queryErrorsQuery},
{"queryDurationQuery", queryDurationQuery},
{"queriesRunningQuery", queriesRunningQuery},
// Days - singular and plural
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mention in the function description on the actual implementation that:
24 = 1 DAY
48h = 2 DAYS

Could you add tests for this too?

Similarly are we also needing to account for Months/years, etc?

Copy link
Copy Markdown
Contributor Author

@aalhour aalhour Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are already done on Lines 19-20, maybe the file wasn't loaded properly for you? 🤔

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be, I'll double check!

Metric renames:
- databricks_billing_dbus → databricks_billing_dbus_sliding
- databricks_billing_cost_estimate_usd → databricks_billing_cost_estimate_usd_sliding
- databricks_job_runs → databricks_job_runs_sliding
- databricks_job_run_status → databricks_job_run_status_sliding
- databricks_job_run_duration_seconds → databricks_job_run_duration_seconds_sliding
- databricks_task_retries → databricks_task_retries_sliding
- databricks_pipeline_runs → databricks_pipeline_runs_sliding
- databricks_pipeline_run_status → databricks_pipeline_run_status_sliding
- databricks_pipeline_run_duration_seconds → databricks_pipeline_run_duration_seconds_sliding
- databricks_pipeline_freshness_lag_seconds → databricks_pipeline_freshness_lag_seconds_sliding
- databricks_pipeline_retry_events → databricks_pipeline_retry_events_sliding
- databricks_queries → databricks_queries_sliding
- databricks_query_duration_seconds → databricks_query_duration_seconds_sliding
- databricks_query_errors → databricks_query_errors_sliding
- databricks_queries_running → databricks_queries_running_sliding

Exporter changes:
- Increase default lookback windows: jobs 2h→3h, pipelines 2h→3h, queries 1h→2h
- Update HELP text for billing metrics to document 24-48h data lag
- Update COALESCE fallbacks from 'unknown' to CONCAT('job-', id) / CONCAT('pipeline-', id)
- Refactor Makefile: split test targets into test, test-exporter-unit, test-exporter-e2e

Mixin changes:
- Wrap all gauge signals with last_over_time(...[30m:]) to handle Prometheus
  staleness with 10-30 minute scrape intervals
- Fix metric name typo: databricks_queries_sliding_running → databricks_queries_running_sliding
- Fix missing _sliding suffix in alerts.libsonnet for duration metrics
- Update signal descriptions to document billing data lag

Documentation:
- Add Prometheus configuration section (scrape interval 10-30m)
- Add lookback windows documentation
- Create docs/metrics-reference.md consolidating all metric definitions
- Update READMEs to reference centralized metrics documentation
- Normalize all titles to sentence case
@aalhour
Copy link
Copy Markdown
Contributor Author

aalhour commented Dec 18, 2025

Updates

Metric renames:

  • databricks_billing_dbus → databricks_billing_dbus_sliding
  • databricks_billing_cost_estimate_usd → databricks_billing_cost_estimate_usd_sliding
  • databricks_job_runs → databricks_job_runs_sliding
  • databricks_job_run_status → databricks_job_run_status_sliding
  • databricks_job_run_duration_seconds → databricks_job_run_duration_seconds_sliding
  • databricks_task_retries → databricks_task_retries_sliding
  • databricks_pipeline_runs → databricks_pipeline_runs_sliding
  • databricks_pipeline_run_status → databricks_pipeline_run_status_sliding
  • databricks_pipeline_run_duration_seconds → databricks_pipeline_run_duration_seconds_sliding
  • databricks_pipeline_freshness_lag_seconds → databricks_pipeline_freshness_lag_seconds_sliding
  • databricks_pipeline_retry_events → databricks_pipeline_retry_events_sliding
  • databricks_queries → databricks_queries_sliding
  • databricks_query_duration_seconds → databricks_query_duration_seconds_sliding
  • databricks_query_errors → databricks_query_errors_sliding
  • databricks_queries_running → databricks_queries_running_sliding

Exporter changes:

  • Increase default lookback windows: jobs 2h→3h, pipelines 2h→3h, queries 1h→2h
  • Update HELP text for billing metrics to document 24-48h data lag
  • Update COALESCE fallbacks from 'unknown' to CONCAT('job-', id) / CONCAT('pipeline-', id)
  • Refactor Makefile: split test targets into test, test-exporter-unit, test-exporter-e2e

Mixin changes:

  • Wrap all gauge signals with last_over_time(...[30m:]) to handle Prometheus staleness with 10-30 minute scrape intervals
  • Fix metric name typo: databricks_queries_sliding_running → databricks_queries_running_sliding
  • Fix missing _sliding suffix in alerts.libsonnet for duration metrics
  • Update signal descriptions to document billing data lag

Documentation:

  • Add Prometheus configuration section (scrape interval 10-30m)
  • Add lookback windows documentation
  • Create docs/metrics-reference.md consolidating all metric definitions
  • Update READMEs to reference centralized metrics documentation
  • Normalize all titles to sentence case

Copy link
Copy Markdown
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work @aalhour, thank you so much for all the interation and improvements on this one!

Really happy with the current state of both exporter and mixin, so I'm happy to approve here. Remember in the follow-up PRs to update the alloy signatures to expose the CLI flags as alloy config options, but that's a minor thing.

Let's get this wrapped up, can't wait to see it launched as an integration

@aalhour aalhour merged commit 5730cb3 into main Dec 19, 2025
5 checks passed
@aalhour aalhour deleted the databricks-mixin-package branch December 19, 2025 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants