Complete Databricks exporter and mixin package#14
Conversation
|
This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation. |
Dasomeone
left a comment
There was a problem hiding this comment.
Not going to leave too many comments here as we already discussed a bunch in the call this morning and yesterday so I know changes are pending, but here's a few more mixin and metric-specific ideas/questions/things to consider. Will give the exporter a closer look when I'm back 1st Dec.
There was a problem hiding this comment.
You could consider breaking this panels file out into multiple panel files for ease of use, a la kafka-observ-lib? https://github.com/grafana/jsonnet-libs/tree/master/kafka-observ-lib
Dasomeone
left a comment
There was a problem hiding this comment.
Ended up taking longer than I expected to review, and I've not been able to go into as much detail of the tests as I'd like to (nor the mixin) but here's my comments so far :)
|
Please make sure you've checked this with build tools and golangci-lint, ideally we add that to CI as well :) |
dc4f079 to
1c03029
Compare
10e4b46 to
c3db323
Compare
…inality controls BREAKING CHANGE: Sliding window metrics changed from Counter to Gauge type. Configuration: - Add --query-timeout flag (default 5m) for SQL query timeouts - Add --billing-lookback, --jobs-lookback, --pipelines-lookback, --queries-lookback flags - Add --sla-threshold flag for job SLA miss detection - Add --collect-task-retries flag (opt-in, high cardinality) Performance: - Implement persistent connection pooling with health checks - Parallelize billing queries (DBUs, Cost, Price Changes) - Optimize billing cost query using CTE (100s → 67s) Code quality: - Add Makefile with check, lint, test, build targets - Add .golangci.yml configuration - Use stretchr/testify for assertions - Remove unused metrics and labels - Add component=databricks-exporter log prefix Dependencies: - Upgrade databricks-sql-go v1.6.0 → v1.9.0 - Upgrade stretchr/testify v1.9.0 → v1.11.1 - Pin prometheus/common to v0.60.1
Signal type corrections (raw → gauge): - jobs_and_pipelines: jobRunP95Aggregate, pipelineRunP95Aggregate, jobDurationByName, pipelineDurationByName, topJobsByRunsTableSignal, taskRetriesByJobTableSignal, topPipelinesByRunsTableSignal - warehouses_and_queries: queryDuration, queryP95Latency, queryP50Latency, concurrencyCurrent, topWarehousesByQueries, topWarehousesByErrors, queryLatencyByWarehouse Query fixes: - Add workspace_id to warehouse aggregations - Add queryP50Latency7dBaseline signal for baseline comparison - Remove duplicate queryRate signal (use queryLoad) - Rename queryErrorRate1h → queryErrorRateAggregate Style: - Fix sentence case in panel title Dependencies: - Remove unused status-panels-lib dependency - Update vendor (xtd library)
Dasomeone
left a comment
There was a problem hiding this comment.
Layout and content wise I'm pretty happy with the mixin part of things!
The exporter is vastly improved, but still some open questions from previous review that you didn't respond to, would be good to just have a clean look and make sure it's all addressed/discussed
Otherwise I've left a couple new comments, but it all mostly looks good! Getting pretty close to releasing this :D
Co-authored-by: Emily <1282515+Dasomeone@users.noreply.github.com>
- Fix durationToSQLInterval to use correct singular/plural grammar (1 HOUR vs 2 HOURS, 1 DAY vs 2 DAYS) - Remove deprecated time window constants (PricesWindow, BillingWindow, etc.) - Make tableCheckInterval configurable via --table-check-interval flag - Add comprehensive tests for durationToSQLInterval and Build* functions - Update docs: fix metric types (Counter→Gauge for sliding windows), correct labels, add all CLI flags and env vars - Apply sentence case to all documentation per Grafana style guide - Clarify test-integration is a software integration test, not Grafana integration
…trics
Logging:
- Replace promlog/go-kit with slog/promslog (Go stdlib)
- Refactor function signatures: context.Context first, logger last
- Remove github.com/go-kit/log dependency
Metrics:
- Rename databricks_up → databricks_exporter_up
- Add databricks_scrape_status{query,status} for per-query health
- Add databricks_exporter_info{version,billing_window,...} for config visibility
- Update all HELP texts with lookback window info
Dependencies:
- prometheus/common: v0.60.1 → v0.67.4
- prometheus/client_golang: v1.20.5 → v1.23.2
- prometheus/exporter-toolkit: v0.13.1 → v0.15.0
Add databricks_scrape_status{query} metric to all collectors for
granular monitoring of individual system table query success/failure.
Changes:
- Emit scrape_status (1=success, 0=failure) for billing, jobs,
pipelines, and queries collectors
- Add ScrapeStatus to Describe() methods in all sub-collectors
- Use atomic.Bool instead of sync.Mutex for error tracking in billing
- Update metric counts in tests (19 → 21)
- Document new metric in docs and mixin README
… query execution - Remove ~380 lines of unused legacy query constants from queries.go - Update queries_test.go to use Build* functions instead of deleted constants - Delete zero-value tests (TestLabelConstants, TestConstants) that only tested string literals - Fix task retries config check: move CollectTaskRetries check before query execution to avoid wasted database calls when feature is disabled - Fix BillingExportErrorsTotal counter semantics: - Rename to BillingScrapeErrors (remove misleading _total suffix) - Change from CounterValue to GaugeValue (value doesn't accumulate across scrapes) - Update description to clarify it's a per-scrape error indicator - Update test mocks to reflect CollectTaskRetries=false default behavior
|
Summary of updates:
|
| {"queryErrorsQuery", queryErrorsQuery}, | ||
| {"queryDurationQuery", queryDurationQuery}, | ||
| {"queriesRunningQuery", queriesRunningQuery}, | ||
| // Days - singular and plural |
There was a problem hiding this comment.
You mention in the function description on the actual implementation that:
24 = 1 DAY
48h = 2 DAYS
Could you add tests for this too?
Similarly are we also needing to account for Months/years, etc?
There was a problem hiding this comment.
These are already done on Lines 19-20, maybe the file wasn't loaded properly for you? 🤔
Metric renames:
- databricks_billing_dbus → databricks_billing_dbus_sliding
- databricks_billing_cost_estimate_usd → databricks_billing_cost_estimate_usd_sliding
- databricks_job_runs → databricks_job_runs_sliding
- databricks_job_run_status → databricks_job_run_status_sliding
- databricks_job_run_duration_seconds → databricks_job_run_duration_seconds_sliding
- databricks_task_retries → databricks_task_retries_sliding
- databricks_pipeline_runs → databricks_pipeline_runs_sliding
- databricks_pipeline_run_status → databricks_pipeline_run_status_sliding
- databricks_pipeline_run_duration_seconds → databricks_pipeline_run_duration_seconds_sliding
- databricks_pipeline_freshness_lag_seconds → databricks_pipeline_freshness_lag_seconds_sliding
- databricks_pipeline_retry_events → databricks_pipeline_retry_events_sliding
- databricks_queries → databricks_queries_sliding
- databricks_query_duration_seconds → databricks_query_duration_seconds_sliding
- databricks_query_errors → databricks_query_errors_sliding
- databricks_queries_running → databricks_queries_running_sliding
Exporter changes:
- Increase default lookback windows: jobs 2h→3h, pipelines 2h→3h, queries 1h→2h
- Update HELP text for billing metrics to document 24-48h data lag
- Update COALESCE fallbacks from 'unknown' to CONCAT('job-', id) / CONCAT('pipeline-', id)
- Refactor Makefile: split test targets into test, test-exporter-unit, test-exporter-e2e
Mixin changes:
- Wrap all gauge signals with last_over_time(...[30m:]) to handle Prometheus
staleness with 10-30 minute scrape intervals
- Fix metric name typo: databricks_queries_sliding_running → databricks_queries_running_sliding
- Fix missing _sliding suffix in alerts.libsonnet for duration metrics
- Update signal descriptions to document billing data lag
Documentation:
- Add Prometheus configuration section (scrape interval 10-30m)
- Add lookback windows documentation
- Create docs/metrics-reference.md consolidating all metric definitions
- Update READMEs to reference centralized metrics documentation
- Normalize all titles to sentence case
UpdatesMetric renames:
Exporter changes:
Mixin changes:
Documentation:
|
Dasomeone
left a comment
There was a problem hiding this comment.
Fantastic work @aalhour, thank you so much for all the interation and improvements on this one!
Really happy with the current state of both exporter and mixin, so I'm happy to approve here. Remember in the follow-up PRs to update the alloy signatures to expose the CLI flags as alloy config options, but that's a minor thing.
Let's get this wrapped up, can't wait to see it launched as an integration
Fixes: https://github.com/grafana/cloud-onboarding/issues/9896
Fixes: https://github.com/grafana/cloud-onboarding/issues/9914
Part of: https://github.com/grafana/cloud-onboarding/issues/9716
Dashboards
Overview Dashboard
Jobs & Pipelines Dashboard
Warehouses & Queries Dashboard