Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(metrics): create generic sets aggregate table + indices #2782

Merged
merged 15 commits into from
Jun 6, 2022

Conversation

onewland
Copy link
Contributor

@onewland onewland commented Jun 3, 2022

create generic_metrics_sets in ClickHouse

generic_metrics_sets is the first aggregate table I'm creating for the metrics-enhanced performance project (internal doc: https://www.notion.so/sentry/Metrics-Enhanced-Performance-Q2-SNS-Plan-79247be0ab134aae90269de0523c35f5). The goal with this approach is to create end-to-end functionality for sets (starting from the indexer ingestion in sentry, to writing to ClickHouse, and returning results via the query API) before implementing the other metric types (counters, distributions) and their processors.

Changes in this PR:

  • new storage set key (for sets type only)
  • new storage set (for sets type only)
  • new migration group (will be shared with other "generic" metrics clusters)
  • a migration to create the table

Next steps:

  • create a raw table and materialized view for writing to generic_metrics_sets.
  • create a processor and consumer in snuba for writing to the raw input table for generic_metrics_sets

Differences from old sets aggregate table:

  • we are going to allow two modes of tag storage: (indexed key) -> (indexed value) and (indexed key) -> (raw value (string)). This means separate columns and indices.
  • granularity is no longer delineated in seconds but will essentially be a constant so we can use a smaller storage format (e.g. 0=10s, 1=60s, 2=3600s, 3=86400s vs. having to store 86400 as the day value)
  • a new timeseries_id column that lets us control sharding from within the python codebase.

Testing:

  • ran local/dist migrations locally

@github-actions
Copy link

github-actions bot commented Jun 3, 2022

This PR has a migration; here is the generated SQL

-- start migrations

-- migration generic_metrics : 0001_sets_aggregate_table
Local operations:
CREATE TABLE IF NOT EXISTS generic_metric_sets_local (org_id UInt64, project_id UInt64, metric_id UInt64, granularity UInt8, timestamp DateTime CODEC (DoubleDelta), retention_days UInt16, tags Nested(key UInt64, indexed_value UInt64, raw_value String), value AggregateFunction(uniqCombined64, UInt64), use_case_id LowCardinality(String)) ENGINE ReplicatedAggregatingMergeTree('/clickhouse/tables/generic_metrics_sets/{shard}/default/generic_metric_sets_local', '{replica}') PRIMARY KEY (org_id, project_id, metric_id, granularity, timestamp) ORDER BY (org_id, project_id, metric_id, granularity, timestamp, tags.key, tags.indexed_value, tags.raw_value, retention_days, use_case_id) PARTITION BY (retention_days, toMonday(timestamp)) TTL timestamp + toIntervalDay(retention_days) SETTINGS index_granularity=2048;
ALTER TABLE generic_metric_sets_local ADD COLUMN IF NOT EXISTS _indexed_tags_hash Array(UInt64) MATERIALIZED arrayMap((k, v) -> cityHash64(concat(toString(k), '=', toString(v))), tags.key, tags.indexed_value);
ALTER TABLE generic_metric_sets_local ADD COLUMN IF NOT EXISTS _raw_tags_hash Array(UInt64) MATERIALIZED arrayMap((k, v) -> cityHash64(concat(toString(k), '=', v)), tags.key, tags.raw_value);
ALTER TABLE generic_metric_sets_local ADD INDEX IF NOT EXISTS bf_indexed_tags_hash _indexed_tags_hash TYPE bloom_filter() GRANULARITY 1;
ALTER TABLE generic_metric_sets_local ADD INDEX IF NOT EXISTS bf_raw_tags_hash _raw_tags_hash TYPE bloom_filter() GRANULARITY 1;
ALTER TABLE generic_metric_sets_local ADD INDEX IF NOT EXISTS bf_tags_key_hash tags.key TYPE bloom_filter() GRANULARITY 1;


Dist operations:
CREATE TABLE IF NOT EXISTS generic_metric_sets_aggregated_dist (org_id UInt64, project_id UInt64, metric_id UInt64, granularity UInt8, timestamp DateTime CODEC (DoubleDelta), retention_days UInt16, tags Nested(key UInt64, indexed_value UInt64, raw_value String), value AggregateFunction(uniqCombined64, UInt64), use_case_id LowCardinality(String)) ENGINE Distributed(cluster_one_sh, default, generic_metric_sets_local);
-- end migration generic_metrics : 0001_sets_aggregate_table

Column("indexed_tags", Nested([("key", UInt(64)), ("value", UInt(64))])),
Column("raw_tags", Nested([("key", UInt(64)), ("value", String())])),
Column("value", AggregateFunction("uniqCombined64", [UInt(64)])),
Column("timeseries_id", UInt(64)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeseries_id will be a hash generated by the consumer for sharding so we can isolate that logic from ClickHouse itself

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not need this in the aggregated merge tree. Only in the raw table.
Sharding happens at the raw table level.

  • you write on the raw table
  • the distributed raw table partitions data across shards according to the sharding key
  • each storage node writes data on the raw table locally
  • the materialized view generates the aggregated data and stores them in the aggregatingMergeTree locally (from the local raw table). There is no sharding happening at this step.

(also I think we will be able to get away with 32 bit for the time being, since it is data retained for a short period of time it will not be a problem.)

@onewland onewland marked this pull request as ready for review June 3, 2022 23:06
@onewland onewland requested a review from a team as a code owner June 3, 2022 23:06
Copy link
Contributor

@fpacifici fpacifici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments inline

snuba/datasets/storages/generic_metrics.py Outdated Show resolved Hide resolved
Column("indexed_tags", Nested([("key", UInt(64)), ("value", UInt(64))])),
Column("raw_tags", Nested([("key", UInt(64)), ("value", String())])),
Column("value", AggregateFunction("uniqCombined64", [UInt(64)])),
Column("timeseries_id", UInt(64)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not need this in the aggregated merge tree. Only in the raw table.
Sharding happens at the raw table level.

  • you write on the raw table
  • the distributed raw table partitions data across shards according to the sharding key
  • each storage node writes data on the raw table locally
  • the materialized view generates the aggregated data and stores them in the aggregatingMergeTree locally (from the local raw table). There is no sharding happening at this step.

(also I think we will be able to get away with 32 bit for the time being, since it is data retained for a short period of time it will not be a problem.)

@onewland
Copy link
Contributor Author

onewland commented Jun 6, 2022

I think I've addressed all the issues you've brought up.

@onewland onewland requested a review from fpacifici June 6, 2022 15:20
@codecov-commenter
Copy link

codecov-commenter commented Jun 6, 2022

Codecov Report

Merging #2782 (7087c38) into master (cba5b76) will increase coverage by 0.01%.
The diff coverage is 78.04%.

@@            Coverage Diff             @@
##           master    #2782      +/-   ##
==========================================
+ Coverage   92.83%   92.84%   +0.01%     
==========================================
  Files         610      612       +2     
  Lines       28680    28723      +43     
==========================================
+ Hits        26626    26669      +43     
  Misses       2054     2054              
Impacted Files Coverage Δ
snuba/datasets/storages/generic_metrics.py 0.00% <0.00%> (ø)
snuba/datasets/storages/metrics.py 96.29% <ø> (ø)
snuba/settings/__init__.py 94.59% <ø> (+0.04%) ⬆️
snuba/settings/settings_distributed.py 100.00% <ø> (ø)
snuba/clusters/storage_sets.py 100.00% <100.00%> (ø)
snuba/datasets/storages/__init__.py 100.00% <100.00%> (ø)
snuba/datasets/storages/tags_hash_map.py 100.00% <100.00%> (ø)
snuba/migrations/groups.py 94.44% <100.00%> (+0.39%) ⬆️
...tions/generic_metrics/0001_sets_aggregate_table.py 100.00% <100.00%> (ø)
snuba/cli/devserver.py 14.63% <0.00%> (-2.04%) ⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cba5b76...7087c38. Read the comment docs.

@onewland onewland requested a review from nikhars June 6, 2022 19:50
Copy link
Member

@nikhars nikhars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +33 to +40
"tags",
Nested(
[
("key", UInt(64)),
("indexed_value", UInt(64)),
("raw_value", String()),
]
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need to do anything right now.
But watch out when you introduce the storage. I think some of the storage query processor to optimize tags may rely on the value column being called value and they will likely have to be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I noticed that might be an issue on the ArrayJoinKeyValueOptimizer

@onewland onewland merged commit 06a8ed4 into master Jun 6, 2022
@onewland onewland deleted the new-generic-metrics-sets-table branch June 6, 2022 22:19
onewland added a commit that referenced this pull request Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants