Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(issue-platform): process extra fields from event #3571

Merged

Conversation

barkbarkimashark
Copy link
Contributor

This PR extracts additional field values from the event payload so it can be used for querying.

@barkbarkimashark barkbarkimashark requested a review from a team January 5, 2023 17:46
@barkbarkimashark barkbarkimashark requested a review from a team as a code owner January 5, 2023 17:46
@codecov-commenter
Copy link

codecov-commenter commented Jan 5, 2023

Codecov Report

Base: 92.23% // Head: 92.44% // Increases project coverage by +0.20% 🎉

Coverage data is based on head (85d4e64) compared to base (66197e5).
Patch coverage: 99.32% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3571      +/-   ##
==========================================
+ Coverage   92.23%   92.44%   +0.20%     
==========================================
  Files         724      725       +1     
  Lines       33772    33779       +7     
==========================================
+ Hits        31151    31228      +77     
+ Misses       2621     2551      -70     
Impacted Files Coverage Δ
tests/test_search_issues_api.py 100.00% <ø> (ø)
tests/datasets/test_search_issues_processor.py 99.34% <99.11%> (+1.06%) ⬆️
...uba/datasets/processors/search_issues_processor.py 93.38% <100.00%> (+3.03%) ⬆️
snuba/consumers/strategy_factory.py 79.59% <0.00%> (-11.15%) ⬇️
snuba/web/views.py 69.64% <0.00%> (-0.25%) ⬇️
snuba/clickhouse/formatter/expression.py 94.57% <0.00%> (-0.25%) ⬇️
snuba/cli/consumer.py 94.44% <0.00%> (-0.11%) ⬇️
snuba/web/query.py 98.00% <0.00%> (-0.06%) ⬇️
snuba/clickhouse/formatter/query.py 98.76% <0.00%> (-0.05%) ⬇️
snuba/datasets/table_storage.py 94.65% <0.00%> (-0.05%) ⬇️
... and 14 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@@ -61,10 +82,10 @@ class SearchIssueEvent(TypedDict, total=False):
organization_id: int
project_id: int
event_id: str
group_id: int # backwards compatibility
group_ids: Sequence[int]
group_id: int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changing to singular group_id now? I thought we'd always have multiple ids for non-error issue stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had the same discussion with Dan. An occurrence event will be sent for each group(s) that resulted in processing that event. So if multiple Issues are detected for a single issue, then multiple events will be sent to the eventstream.

@@ -77,6 +98,13 @@ def ensure_uuid(value: str) -> str:
class SearchIssuesMessageProcessor(DatasetMessageProcessor):
FINGERPRINTS_HARD_LIMIT_SIZE = 100

PROMOTED_TAGS = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does promoted mean in this context?

Copy link
Contributor Author

@barkbarkimashark barkbarkimashark Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are 'special' tag values we lift directly from tags and place them in top-level columns. I basically took this pattern from how transactions was written. I presume this was done for performance - so instead of indexing into the tags array, we do some special processing when ingesting events and normalize it into its own specific column.

) -> None:
tags_maybe = event_data.get("tags", None)
if not tags_maybe:
processed["tags.key"], processed["tags.value"] = [], []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should return here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had it return early before, but then the extra params won't get processed:

processed["release"] = promoted_tags.get(
            "sentry:release",
            event_data.get("release"),
        )

So events like:

{
        "project_id": 1,
        "organization_id": 2,
        "group_id": 3,
        "event_id": str(uuid.uuid4()),
        "retention_days": 90,
        "primary_hash": str(uuid.uuid4()),
        "datetime": datetime.utcnow().isoformat() + "Z",
        "platform": "other",
        "data": {
            "tags": {}
            "release": "[email protected]"
        }
        ...
    }

would get skipped entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK but you're processing the tags twice then. Why call extract_extra_tags if there are no tags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, right. I rearranged process_tags to be a bit more readable and to avoid doing dupe work.

processed["http_method"] = http_data["http_method"]
processed["http_referer"] = http_data["http_referer"]

def _process_sdk_data(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not totally sure, but I think they're needed to support searching like this?
Screen Shot 2023-01-05 at 5 04 57 PM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rephrase that question: do you have a sdk_name and sdk_version in your table? If not, then this function is adding columns that are just being ignored when writing to Clickhouse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we do:

{ name: sdk_name, type: String, args: { schema_modifiers: [ low_cardinality, nullable ] } },
{ name: sdk_version, type: String, args: { schema_modifiers: [ low_cardinality, nullable ] } },

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK cool, just making sure.

@@ -65,6 +65,14 @@ schema:
readable_storage: search_issues
writable_storage: search_issues
query_processors:
- processor: TimeSeriesProcessor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to deprecate this processor, so I would challenge whether this is necessary on new datasets. It provides two functions:

  1. Allow syntactic sugar for grouping to a particular time range (toStartOfDay(...)). This is easily doable using the Snuba SDK, so it isn't necessary for Snuba to do it. This functionality also tends to cause confusion because users don't understand why one date-based column works differently from another. This has come up many times with time vs. timestamp columns in discover.
  2. Convert conditions on the time_parse_columns to use datetimes instead of strings. This is a legacy problem. Before SnQL, everything was strings, there was no way to tell Snuba that a certain string was supposed to be a datetime (timestamp >= '2011...'). SnQL allows conditions on datetimes so you should never be sending conditions using strings anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I recall you mentioning before about TimeSeriesProcessor. One of the legacy queries I was trying to get working looks to be grouping on time. Let me take a crack at porting that bit over. Appreciate the clarification.

) -> None:
tags_maybe = event_data.get("tags", None)
if not tags_maybe:
processed["tags.key"], processed["tags.value"] = [], []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK but you're processing the tags twice then. Why call extract_extra_tags if there are no tags?

processed["http_method"] = http_data["http_method"]
processed["http_referer"] = http_data["http_referer"]

def _process_sdk_data(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rephrase that question: do you have a sdk_name and sdk_version in your table? If not, then this function is adding columns that are just being ignored when writing to Clickhouse.

Copy link
Member

@evanh evanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me.

@barkbarkimashark barkbarkimashark merged commit 6bc04b9 into master Jan 6, 2023
@barkbarkimashark barkbarkimashark deleted the sharky/search-issues-processing-extra-fields branch January 6, 2023 20:03
barkbarkimashark added a commit to getsentry/sentry that referenced this pull request Jan 10, 2023
…or Issue stats (#42865)

Tests will fail until the following PRs are merged in:
* getsentry/snuba#3571
* getsentry/snuba#3575

Resolves #42038
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants