-
Notifications
You must be signed in to change notification settings - Fork 84
fix: remove topic from semantic_hash and handle migration #801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } else { | ||
| false | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: moved to SemanticHashUtils.scala
63348b8 to
ad90dae
Compare
|
Hi @hzding621 my understanding on this is: we want to provide a way for users to update topic without triggering the archive. Could you clarify what is the new workflow if users want to update the topic without triggering an archive? Do they need to set the exclude_topic flag in hive table and update the topic at the same time? Also when we load the old semantic hash how do we avoid the diff with new hash when the topic is updated and exclude_topic flag is set? I am confused of this line. Should we perform an archive or not here? |
| val TopicInvalidSuffix = "_invalid" | ||
| val lineTab = "\n " | ||
| val SemanticHashKey = "semantic_hash" | ||
| val SemanticHashExcludeTopicKey = "semantic_hash_exclude_topic" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the boolean flag a map so that we can reuse this for future semantic hash change, like exclude the backfill start date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion. Will do
ad90dae to
783b032
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| confTableProps ++ Map(Constants.SemanticHashKey -> gson.toJson(joinConf.semanticHash.asJava)) | ||
| confTableProps ++ Map( | ||
| Constants.SemanticHashKey -> gson.toJson(joinConf.semanticHash(excludeTopic = true).asJava), | ||
| Constants.SemanticHashOptionsKey -> gson.toJson( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: with SemanticHashOptionsKey, we might not need to pass something like excludeTopic to .semanticHash(). All arguments can be packaged into SemanticHashOptionsKey.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@donghanz i think we will still keep SemanticHashKey and SemanticHashOptionsKey as two separate properties in hive tables.
SemanticHashKey: will store the semantic hashes (a map from string to hash string)SemanticHashOptionsKey: will store a few flags (a map from string to boolean)
I pass excludeTopic to .semanticHash(), only because excludeTopic will determine the logic I use to compute the hash, the flag itself is not stored in the hash.
Lmk if that makes sense
Summary
This PR removes topic (including both EventSource.topic, and EntitySource.mutationTopic) from the calculation logic of semantic_hash during offline join backfill logic.
detailed logic
migration plan
for production join confs without active airflow runs, manually synchronize the semantic_hash.
Why / Goal
Test Plan
Integration Test Plan
✅ test case 1 (also in UT):
✅ test case 2:
✅ test case 3 (also in UT):
✅ test case 4:
Reviewers
@donghanz @yuli-han @pengyu-hou