-
Notifications
You must be signed in to change notification settings - Fork 607
feat: track credits and latency in clickhouse #3854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| FROM bitnami/clickhouse:25.6.4 | ||
|
|
||
| # Copy ClickHouse schemas | ||
| COPY go/pkg/clickhouse/schema/databases/ /opt/clickhouse-schemas/ | ||
|
|
||
chronark marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| # Create initialization script that will execute our SQL files on first run | ||
| # (script is already made executable on host) | ||
| COPY deployment/init-clickhouse.sh /docker-entrypoint-initdb.d/init-clickhouse.sh | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,25 @@ | ||||||||||||||||||||||||||||||
| #!/bin/bash | ||||||||||||||||||||||||||||||
| set -e | ||||||||||||||||||||||||||||||
chronark marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| echo "Initializing ClickHouse schemas..." | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| # Execute SQL files in order from our schemas directory | ||||||||||||||||||||||||||||||
| for db_dir in /opt/clickhouse-schemas/*/; do | ||||||||||||||||||||||||||||||
| if [ -d "$db_dir" ]; then | ||||||||||||||||||||||||||||||
| echo "Processing database directory: $db_dir" | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| # Execute SQL files in numerical order | ||||||||||||||||||||||||||||||
| for sql_file in "$db_dir"*.sql; do | ||||||||||||||||||||||||||||||
| if [ -f "$sql_file" ] && [[ "$sql_file" == *.sql ]]; then | ||||||||||||||||||||||||||||||
| echo "Executing: $sql_file" | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| if ! clickhouse-client --host localhost --user "$CLICKHOUSE_ADMIN_USER" --password "$CLICKHOUSE_ADMIN_PASSWORD" --queries-file "$sql_file"; then | ||||||||||||||||||||||||||||||
chronark marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||
| echo "Error executing $sql_file - stopping initialization" | ||||||||||||||||||||||||||||||
| exit 1 | ||||||||||||||||||||||||||||||
| fi | ||||||||||||||||||||||||||||||
|
Comment on lines
+16
to
+19
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick (assertive) Surface clearer errors and avoid leaking secrets
+if [[ -z "${CLICKHOUSE_ADMIN_USER:-}" || -z "${CLICKHOUSE_ADMIN_PASSWORD:-}" ]]; then
+ echo "CLICKHOUSE_ADMIN_USER/CLICKHOUSE_ADMIN_PASSWORD must be set" >&2
+ exit 1
+fiI can wire this near the top if you prefer. 📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||
| fi | ||||||||||||||||||||||||||||||
| done | ||||||||||||||||||||||||||||||
| fi | ||||||||||||||||||||||||||||||
| done | ||||||||||||||||||||||||||||||
chronark marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| echo "ClickHouse schema initialization complete!" | ||||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,45 @@ | ||||||
| CREATE TABLE IF NOT EXISTS verifications.raw_key_verifications_v2( | ||||||
| -- the api request id, so we can correlate the verification with traces and logs | ||||||
| request_id String, | ||||||
|
|
||||||
| -- unix milli | ||||||
| time Int64 CODEC(Delta, LZ4), | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick (assertive) Consider DateTime64 for time to avoid repeated conversions Storing time as DateTime64(3) can reduce repeated fromUnixTimestamp64Milli() conversions in PARTITION BY/TTL and downstream MVs. - time Int64 CODEC(Delta, LZ4),
+ time DateTime64(3) CODEC(Delta, LZ4),This implies adjusting ingestion to write DateTime64(3) and updating PARTITION BY/TTL accordingly. 📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||
|
|
||||||
| workspace_id String, | ||||||
| key_space_id String, | ||||||
| -- Empty string if the key has no identity | ||||||
| identity_id String, | ||||||
| key_id String, | ||||||
|
|
||||||
| -- Right now this is a 3 character airport code, but when we move to aws, | ||||||
| -- this will be the region code such as `us-east-1` | ||||||
| region LowCardinality(String), | ||||||
|
|
||||||
| -- Examples: | ||||||
| -- - "VALID" | ||||||
| -- - "RATE_LIMITED" | ||||||
| -- - "EXPIRED" | ||||||
| -- - "DISABLED | ||||||
| outcome LowCardinality(String), | ||||||
|
|
||||||
|
|
||||||
| tags Array(String) DEFAULT [], | ||||||
|
|
||||||
|
Comment on lines
+26
to
+27
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick (assertive) Canonicalize tags array Same concern as aggregated tables: ensure a stable order (e.g., sort at ingestion) to avoid spurious distinctions in downstream aggregates that include tags in GROUP BY. 🤖 Prompt for AI Agents |
||||||
| -- The number of credits spent on this verification | ||||||
| -- 0 means no credits were spent | ||||||
| spent_credits Int64, | ||||||
|
|
||||||
| -- Latency in milliseconds for this verification | ||||||
| latency Float64 | ||||||
|
Comment on lines
+32
to
+33
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick (assertive) Latency unit and numeric type Latency is documented as milliseconds; Float64 is fine. If you plan to store integers (ms) only, consider UInt32 to save space and use quantileTDigest on toFloat64(latency) in MVs. Otherwise, keep Float64 but document whether sub-millisecond precision is expected. 🤖 Prompt for AI Agents |
||||||
|
|
||||||
|
|
||||||
| ) | ||||||
| ENGINE = MergeTree() | ||||||
| PARTITION BY toYYYYMM(fromUnixTimestamp64Milli(time)) | ||||||
| ORDER BY (workspace_id, time, key_space_id, identity_id, key_id) | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick (assertive) Validate ORDER BY against query patterns Primary key is (workspace_id, time, key_space_id, identity_id, key_id). If common queries filter by workspace_id + time range, this is good. If region/outcome are often used in filters before identity_id/key_id, consider including them in PK suffix or adding appropriate skip indexes. Share typical WHERE clauses; I can propose a PK and secondary index layout aligned to real workloads. |
||||||
| TTL fromUnixTimestamp64Milli(time) + INTERVAL 100 DAY | ||||||
| SETTINGS non_replicated_deduplication_window = 10000 | ||||||
|
Comment on lines
+38
to
+41
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick (assertive) Deduplication setting may not be effective without enabling non-replicated dedup non_replicated_deduplication_window works when non-replicated dedup is enabled. Confirm your server config sets use_non_replicated_mergetree_deduplication=1, or set it per table if permissible. If dedup of repeated inserts by request_id is required, consider ReplacingMergeTree with a version column or enforce dedup at ingestion. 🤖 Prompt for AI Agents |
||||||
| ; | ||||||
|
|
||||||
| ALTER TABLE verifications.raw_key_verifications_v2 | ||||||
| ADD INDEX IF NOT EXISTS idx_request_id (request_id) TYPE minmax GRANULARITY 1; | ||||||
|
Comment on lines
+44
to
+45
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Use a Bloom filter skip index for request_id instead of minmax minmax over String provides little selectivity for equality lookups. Bloom filter indexes are the standard choice for tracing-style IDs. -ALTER TABLE verifications.raw_key_verifications_v2
-ADD INDEX IF NOT EXISTS idx_request_id (request_id) TYPE minmax GRANULARITY 1;
+ALTER TABLE verifications.raw_key_verifications_v2
+ADD INDEX IF NOT EXISTS idx_request_id (request_id) TYPE bloom_filter GRANULARITY 4;Optionally add a secondary index on outcome for common filters: ALTER TABLE verifications.raw_key_verifications_v2
ADD INDEX IF NOT EXISTS idx_outcome (outcome) TYPE set(100) GRANULARITY 4;🤖 Prompt for AI Agents |
||||||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,18 @@ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| CREATE TABLE IF NOT EXISTS verifications.key_verifications_per_minute_v2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ( | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| time DateTime, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| workspace_id String, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
chronark marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| key_space_id String, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| identity_id String, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| key_id String, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| outcome LowCardinality(String), | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| tags Array(String), | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| count Int64, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| spent_credits Int64, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| latency_avg Float64, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| latency_p75 Float64, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| latency_p99 Float64 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ENGINE = SummingMergeTree() | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ORDER BY (workspace_id, time, key_space_id, identity_id, key_id, tags, outcome) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+3
to
+17
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Use AggregatingMergeTree with stateful metrics; avoid summing finalized averages/quantiles. Same issue as ratelimits: SummingMergeTree will corrupt latency_* in merges. Also switch counters to unsigned and consider a normalized key for tags. Proposed DDL: CREATE TABLE IF NOT EXISTS verifications.key_verifications_per_minute_v2
(
time DateTime,
workspace_id String,
key_space_id String,
identity_id String,
key_id String,
outcome LowCardinality(String),
- tags Array(String),
- count Int64,
- spent_credits Int64,
- latency_avg Float64,
- latency_p75 Float64,
- latency_p99 Float64
+ tags Array(String),
+ tags_hash UInt64 MATERIALIZED cityHash64(arrayStringConcat(arraySort(tags))),
+ count SimpleAggregateFunction(sum, UInt64),
+ spent_credits SimpleAggregateFunction(sum, UInt64),
+ latency_avg_state AggregateFunction(avg, Float64),
+ latency_p75_state AggregateFunction(quantileTDigest(0.75), Float64),
+ latency_p99_state AggregateFunction(quantileTDigest(0.99), Float64)
)
-ENGINE = SummingMergeTree()
-ORDER BY (workspace_id, time, key_space_id, identity_id, key_id, tags, outcome)
+ENGINE = AggregatingMergeTree()
+PARTITION BY toYYYYMM(time)
+ORDER BY (workspace_id, time, key_space_id, identity_id, key_id, tags_hash, outcome)
;📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ; | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,5 @@ | ||
| -- +goose up | ||
| CREATE MATERIALIZED VIEW IF NOT EXISTS verifications.key_verifications_per_minute_mv_v1 | ||
| TO verifications.key_verifications_per_minute_v1 | ||
| CREATE MATERIALIZED VIEW IF NOT EXISTS verifications.key_verifications_per_minute_mv_v2 | ||
| TO verifications.key_verifications_per_minute_v2 | ||
| AS | ||
| SELECT | ||
| workspace_id, | ||
|
|
@@ -9,9 +8,13 @@ SELECT | |
| key_id, | ||
| outcome, | ||
| count(*) as count, | ||
| sum(spent_credits) as spent_credits, | ||
| avg(latency) as latency_avg, | ||
| quantileTDigest(0.75)(latency) as latency_p75, | ||
| quantileTDigest(0.99)(latency) as latency_p99, | ||
|
Comment on lines
+11
to
+14
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💡 Verification agent 🧩 Analysis chainEmit aggregate states instead of finalized metrics; avoid “count” as a column name.
Apply: - count(*) as count,
- sum(spent_credits) as spent_credits,
- avg(latency) as latency_avg,
- quantileTDigest(0.75)(latency) as latency_p75,
- quantileTDigest(0.99)(latency) as latency_p99,
+ count(*) AS total,
+ sum(spent_credits) AS spent_credits,
+ avgState(latency) AS latency_avg_state,
+ quantileTDigestState(latency) AS latency_tdigest_state,Ensure verifications.key_verifications_per_minute_v2 uses AggregatingMergeTree with:
Also applies to: 17-17 🏁 Script executed: #!/bin/bash
set -euo pipefail
# Display the first 200 lines of the view definition
echo "=== File head (lines 1–200) ==="
sed -n '1,200p' go/pkg/clickhouse/schema/databases/001_verifications/004_key_verifications_per_minute_mv_v2.sql || true
# Show all ENGINE definitions in the file
echo
echo "=== ENGINE definitions ==="
grep -n -i "ENGINE" go/pkg/clickhouse/schema/databases/001_verifications/004_key_verifications_per_minute_mv_v2.sql || true
# Check for AggregatingMergeTree usage
echo
echo "=== AggregatingMergeTree occurrences ==="
grep -n -i "AggregatingMergeTree" go/pkg/clickhouse/schema/databases/001_verifications/004_key_verifications_per_minute_mv_v2.sql || true
# Show the exact column list where aggregate functions are declared
echo
echo "=== Aggregate column definitions around the snippet ==="
grep -n -C3 -E "sum\\(spent_credits\\)|avg\\(latency\\)|quantileTDigest" go/pkg/clickhouse/schema/databases/001_verifications/004_key_verifications_per_minute_mv_v2.sql || trueLength of output: 1885 🏁 Script executed: #!/bin/bash
set -euo pipefail
echo "=== Searching for target table creation in 001_verifications ==="
grep -R -n "CREATE TABLE.*key_verifications_per_minute_v2" go/pkg/clickhouse/schema/databases/001_verifications || true
echo
echo "=== Searching for any reference to key_verifications_per_minute_v2 in entire schema directory ==="
grep -R -n "key_verifications_per_minute_v2" go/pkg/clickhouse/schema || trueLength of output: 1038 🏁 Script executed: #!/bin/bash
set -euo pipefail
echo "=== 003_key_verifications_per_minute_v2 table definition (first 200 lines) ==="
sed -n '1,200p' go/pkg/clickhouse/schema/databases/001_verifications/003_key_verifications_per_minute_v2.sql || trueLength of output: 795 Update MV and table to emit aggregate states and use AggregatingMergeTree The current materialized view is emitting finalized metrics (
Locations:
🤖 Prompt for AI Agents |
||
| toStartOfMinute(fromUnixTimestamp64Milli(time)) AS time, | ||
| tags | ||
| FROM verifications.raw_key_verifications_v1 | ||
| FROM verifications.raw_key_verifications_v2 | ||
| GROUP BY | ||
| workspace_id, | ||
| key_space_id, | ||
|
|
@@ -20,7 +23,4 @@ GROUP BY | |
| outcome, | ||
| time, | ||
| tags | ||
| ; | ||
|
|
||
| -- +goose down | ||
| DROP VIEW verifications.key_verifications_per_minute_mv_v1; | ||
| ; | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,18 @@ | ||||||||||||||||||||||||||||||||||||||||||||
| CREATE TABLE IF NOT EXISTS verifications.key_verifications_per_hour_v2 | ||||||||||||||||||||||||||||||||||||||||||||
| ( | ||||||||||||||||||||||||||||||||||||||||||||
| time DateTime, | ||||||||||||||||||||||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick (assertive) Consider DateTime('UTC') for stored bucket timestamps If all bucketing is meant to be UTC, declare time as DateTime('UTC') for clarity and to prevent accidental timezone conversion. - time DateTime,
+ time DateTime('UTC'),📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||||||||
| workspace_id String, | ||||||||||||||||||||||||||||||||||||||||||||
| key_space_id String, | ||||||||||||||||||||||||||||||||||||||||||||
| identity_id String, | ||||||||||||||||||||||||||||||||||||||||||||
| key_id String, | ||||||||||||||||||||||||||||||||||||||||||||
| outcome LowCardinality(String), | ||||||||||||||||||||||||||||||||||||||||||||
| tags Array(String), | ||||||||||||||||||||||||||||||||||||||||||||
| count Int64, | ||||||||||||||||||||||||||||||||||||||||||||
| spent_credits Int64, | ||||||||||||||||||||||||||||||||||||||||||||
| latency_avg Float64, | ||||||||||||||||||||||||||||||||||||||||||||
| latency_p75 Float64, | ||||||||||||||||||||||||||||||||||||||||||||
| latency_p99 Float64 | ||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+12
to
+14
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. latency_avg/p75/p99 are not additive — SummingMergeTree will corrupt these over time With SummingMergeTree, background merges sum numeric columns for identical primary keys. That is correct for count/spent_credits, but wrong for avg and quantiles. This will produce inflated averages/quantiles as more parts merge. Two viable fixes:
Preferred table DDL change: CREATE TABLE IF NOT EXISTS verifications.key_verifications_per_hour_v2
(
time DateTime,
workspace_id String,
key_space_id String,
identity_id String,
key_id String,
outcome LowCardinality(String),
tags Array(String),
- count Int64,
- spent_credits Int64,
- latency_avg Float64,
- latency_p75 Float64,
- latency_p99 Float64
+ count SimpleAggregateFunction(sum, UInt64),
+ spent_credits SimpleAggregateFunction(sum, Int64),
+ latency_avg AggregateFunction(avg, Float64),
+ latency_p75 AggregateFunction(quantileTDigest(0.75), Float64),
+ latency_p99 AggregateFunction(quantileTDigest(0.99), Float64)
)
-ENGINE = SummingMergeTree()
-ORDER BY (workspace_id, time, key_space_id, identity_id, key_id)
+ENGINE = AggregatingMergeTree()
+ORDER BY (workspace_id, time, key_space_id, identity_id, key_id, outcome)
;Follow-up: update the hour/day/month MVs to emit avgState()/quantileTDigestState() as suggested in those review comments. 📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||
| ENGINE = SummingMergeTree() | ||||||||||||||||||||||||||||||||||||||||||||
| ORDER BY (workspace_id, time, key_space_id, identity_id, key_id) | ||||||||||||||||||||||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing outcome (and likely tags) in ORDER BY merges dissimilar rows The primary key omits outcome (and tags). MergeTree family merges are per primary key; omitting these dimensions will merge SUCCESS/ERROR (and different tag sets) into single rows, corrupting metrics. Minimum fix (if you keep SummingMergeTree and finalized values, not recommended): -ORDER BY (workspace_id, time, key_space_id, identity_id, key_id)
+ORDER BY (workspace_id, time, key_space_id, identity_id, key_id, outcome, tags)Note: including Array(String) in ORDER BY increases index cardinality and may be undesirable or unsupported in your target CH version. A safer alternative is to include a stable tags_hash: + tags_hash UInt64 MATERIALIZED cityHash64(arrayStringConcat(arraySort(tags), ',')),
...
-ORDER BY (workspace_id, time, key_space_id, identity_id, key_id)
+ORDER BY (workspace_id, time, key_space_id, identity_id, key_id, outcome, tags_hash)Choose one approach and make sure all corresponding MVs group by the same key set. 📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||||||||
| ; | ||||||||||||||||||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,4 @@ | ||
| -- +goose up | ||
| CREATE MATERIALIZED VIEW verifications.key_verifications_per_hour_mv_v2 | ||
| CREATE MATERIALIZED VIEW IF NOT EXISTS verifications.key_verifications_per_hour_mv_v2 | ||
| TO verifications.key_verifications_per_hour_v2 | ||
| AS | ||
| SELECT | ||
|
|
@@ -9,9 +8,13 @@ SELECT | |
| key_id, | ||
| outcome, | ||
| count(*) as count, | ||
| sum(spent_credits) as spent_credits, | ||
| avg(latency) as latency_avg, | ||
| quantileTDigest(0.75)(latency) as latency_p75, | ||
| quantileTDigest(0.99)(latency) as latency_p99, | ||
| toStartOfHour(fromUnixTimestamp64Milli(time)) AS time, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick (assertive) Timezone determinism for hourly buckets Normalize to UTC to avoid environment-specific hour boundaries. - toStartOfHour(fromUnixTimestamp64Milli(time)) AS time,
+ toStartOfHour(fromUnixTimestamp64Milli(time, 'UTC')) AS time,🤖 Prompt for AI Agents |
||
| tags | ||
|
Comment on lines
+11
to
16
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Emit aggregate states to match an AggregatingMergeTree target These finalized aggregates will not survive merges. Emit states so the table can merge them correctly. - count(*) as count,
- sum(spent_credits) as spent_credits,
- avg(latency) as latency_avg,
- quantileTDigest(0.75)(latency) as latency_p75,
- quantileTDigest(0.99)(latency) as latency_p99,
+ countState() AS count,
+ sumState(spent_credits) AS spent_credits,
+ avgState(latency) AS latency_avg,
+ quantileTDigestState(0.75)(latency) AS latency_p75,
+ quantileTDigestState(0.99)(latency) AS latency_p99,Ensure the target table schema is updated accordingly (see comments on the table file).
🤖 Prompt for AI Agents |
||
| FROM verifications.raw_key_verifications_v1 | ||
| FROM verifications.raw_key_verifications_v2 | ||
| GROUP BY | ||
| workspace_id, | ||
| key_space_id, | ||
|
|
@@ -21,7 +24,3 @@ GROUP BY | |
| time, | ||
| tags | ||
| ; | ||
|
|
||
|
|
||
| -- +goose down | ||
| DROP VIEW verifications.key_verifications_per_hour_mv_v2; | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,18 @@ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| CREATE TABLE IF NOT EXISTS verifications.key_verifications_per_day_v4 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ( | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| time DateTime, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| workspace_id String, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| key_space_id String, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| identity_id String, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| key_id String, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| outcome LowCardinality(String), | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| tags Array(String), | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+3
to
+9
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick (assertive) Primary key includes Array(String) “tags”; canonicalize and consider footprint Using Array(String) in ORDER BY inflates primary key and splits rows when tag order differs. Canonicalize tags in the MV (arraySort(tags)) and consider hashing tags for the PK if cardinality is high. I can draft MV changes to apply arraySort(tags) at write time. 🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| count Int64, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| spent_credits Int64, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| latency_avg Float64, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| latency_p75 Float64, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| latency_p99 Float64 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ENGINE = SummingMergeTree() | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ORDER BY (workspace_id, time, key_space_id, identity_id, key_id, tags, outcome) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+16
to
+17
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick (assertive) Add partitioning for data pruning and manage retention No PARTITION BY is set. Add PARTITION BY toYYYYMM(time) for day-level table to enable pruning and targeted retention policies. -ENGINE = SummingMergeTree()
-ORDER BY (workspace_id, time, key_space_id, identity_id, key_id, tags, outcome)
+ENGINE = AggregatingMergeTree()
+PARTITION BY toYYYYMM(time)
+ORDER BY (workspace_id, time, key_space_id, identity_id, key_id, tags, outcome)📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ; | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+1
to
+18
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Averages and quantiles will be corrupted with SummingMergeTree SummingMergeTree will sum latency_avg/p75/p99 across parts, yielding nonsensical values. Use AggregatingMergeTree with aggregate states, or store additive fields only (latency_sum, count) and compute on read. Quantiles require state columns. Option A (preferred): AggregatingMergeTree with states: -CREATE TABLE IF NOT EXISTS verifications.key_verifications_per_day_v4
+CREATE TABLE IF NOT EXISTS verifications.key_verifications_per_day_v4
(
time DateTime,
workspace_id String,
key_space_id String,
identity_id String,
key_id String,
outcome LowCardinality(String),
- tags Array(String),
- count Int64,
- spent_credits Int64,
- latency_avg Float64,
- latency_p75 Float64,
- latency_p99 Float64
+ tags Array(String),
+ count SimpleAggregateFunction(sum, UInt64),
+ spent_credits SimpleAggregateFunction(sum, UInt64),
+ latency_avg_state AggregateFunction(avg, Float64),
+ latency_p75_state AggregateFunction(quantileTDigest(0.75), Float64),
+ latency_p99_state AggregateFunction(quantileTDigest(0.99), Float64)
)
-ENGINE = SummingMergeTree()
-ORDER BY (workspace_id, time, key_space_id, identity_id, key_id, tags, outcome)
+ENGINE = AggregatingMergeTree()
+ORDER BY (workspace_id, time, key_space_id, identity_id, key_id, tags, outcome)This requires the MV to emit avgState/quantileTDigestState and queries to use finalizeAggregation() or *_Merge(). Option B: Keep SummingMergeTree but change schema:
📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Uh oh!
There was an error while loading. Please reload this page.