-
Notifications
You must be signed in to change notification settings - Fork 135
Implements DataChain.hash()
#1327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
24a82a5
eedd2b8
1c8dcce
b051b9a
72dc833
8cc6582
172e677
781e935
dfcf0a9
7862515
f17ea92
46051be
7032af4
01cbd9c
26e64a3
579ab94
119316c
e60ff36
590acb1
214eb6f
f8e2f40
c4a493d
caba039
5c26a8c
1d6d64c
94d7258
279dc06
93b4f7c
bf03730
46abe90
3e260b3
d75b483
2d83c18
7cbef9c
0d53b4a
125cbd1
b36fd66
b5cfaea
bae876f
041e8e8
99ca59a
3def3bd
1cb1216
6c631e5
4d7fbf0
d188adf
fd1099d
64a9988
32c987c
17e5e19
718c515
731636b
6b77576
43670d1
9af27b4
29ba477
e6576d2
d9458e1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,147 @@ | ||
| import hashlib | ||
| import inspect | ||
| import json | ||
| import textwrap | ||
| from collections.abc import Sequence | ||
| from typing import TypeVar, Union | ||
|
|
||
| from sqlalchemy.sql.elements import ( | ||
| BinaryExpression, | ||
| BindParameter, | ||
| ColumnElement, | ||
| Label, | ||
| Over, | ||
| UnaryExpression, | ||
| ) | ||
| from sqlalchemy.sql.functions import Function | ||
|
|
||
| T = TypeVar("T", bound=ColumnElement) | ||
| ColumnLike = Union[str, T] | ||
|
|
||
|
|
||
| def serialize_column_element(expr: Union[str, ColumnElement]) -> dict: # noqa: PLR0911 | ||
| """ | ||
| Recursively serialize a SQLAlchemy ColumnElement into a deterministic structure. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we use str(query) or something like this? seems too complicated atm
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not possible because:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
do we care about this for hash?
same - is it important? it seems not tbh ...
what are the cases?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should care about all 3 of those items. About the unstable param names - I think that param names depend on ordering which should not affect the hash (e,g orders of elements in >>> from datachain import C
>>> expr = C("age") == 20
>>> str(expr)
'age = :age_1'
>>> expr = C("age") == 30
>>> str(expr)
'age = :age_1'
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
can we / don't we take all values separately in the Step? it should be simple I guess? I don't think we should care about order - again, we are complicating this ahead of time I feel
could you elaborate please?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Inputs to the BTW
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I would say - Let's simplify unless we really have a blocker for a basic requirement.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the detailed response!
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My very first thought was to use compile with
|
||
| """ | ||
|
|
||
| # Binary operations: col > 5, col1 + col2, etc. | ||
| if isinstance(expr, BinaryExpression): | ||
| op = ( | ||
| expr.operator.__name__ | ||
| if hasattr(expr.operator, "__name__") | ||
| else str(expr.operator) | ||
| ) | ||
| return { | ||
| "type": "binary", | ||
| "op": op, | ||
| "left": serialize_column_element(expr.left), | ||
| "right": serialize_column_element(expr.right), | ||
| } | ||
|
|
||
| # Unary operations: -col, NOT col, etc. | ||
| if isinstance(expr, UnaryExpression): | ||
| op = ( | ||
| expr.operator.__name__ | ||
| if expr.operator is not None and hasattr(expr.operator, "__name__") | ||
| else str(expr.operator) | ||
| ) | ||
|
|
||
| return { | ||
| "type": "unary", | ||
| "op": op, | ||
| "element": serialize_column_element(expr.element), # type: ignore[arg-type] | ||
| } | ||
|
|
||
| # Function calls: func.lower(col), func.count(col), etc. | ||
| if isinstance(expr, Function): | ||
| return { | ||
| "type": "function", | ||
| "name": expr.name, | ||
| "clauses": [serialize_column_element(c) for c in expr.clauses], | ||
| } | ||
|
|
||
| # Window functions: func.row_number().over(partition_by=..., order_by=...) | ||
| if isinstance(expr, Over): | ||
| return { | ||
| "type": "window", | ||
| "function": serialize_column_element(expr.element), | ||
| "partition_by": [ | ||
| serialize_column_element(p) for p in getattr(expr, "partition_by", []) | ||
| ], | ||
| "order_by": [ | ||
| serialize_column_element(o) for o in getattr(expr, "order_by", []) | ||
| ], | ||
| } | ||
|
|
||
| # Labeled expressions: col.label("alias") | ||
| if isinstance(expr, Label): | ||
| return { | ||
| "type": "label", | ||
| "name": expr.name, | ||
| "element": serialize_column_element(expr.element), | ||
| } | ||
|
|
||
| # Bound values (constants) | ||
| if isinstance(expr, BindParameter): | ||
| return {"type": "bind", "value": expr.value} | ||
|
|
||
| # Plain columns | ||
| if hasattr(expr, "name"): | ||
| return {"type": "column", "name": expr.name} | ||
|
|
||
| # Fallback: stringify unknown nodes | ||
| return {"type": "other", "repr": str(expr)} | ||
|
|
||
|
|
||
| def hash_column_elements(columns: Sequence[ColumnLike]) -> str: | ||
| """ | ||
| Hash a list of ColumnElements deterministically, dialect agnostic. | ||
| Only accepts ordered iterables (like list or tuple). | ||
| """ | ||
| serialized = [serialize_column_element(c) for c in columns] | ||
| json_str = json.dumps(serialized, sort_keys=True) # stable JSON | ||
| return hashlib.sha256(json_str.encode("utf-8")).hexdigest() | ||
|
|
||
|
|
||
| def hash_callable(func): | ||
| """ | ||
| Calculate a hash from a callable. | ||
| Rules: | ||
| - Named functions (def) → use source code for stable, cross-version hashing | ||
| - Lambdas → use bytecode (deterministic in same Python runtime) | ||
| """ | ||
| if not callable(func): | ||
| raise TypeError("Expected a callable") | ||
|
|
||
| # Determine if it is a lambda | ||
| is_lambda = func.__name__ == "<lambda>" | ||
|
|
||
| if not is_lambda: | ||
| # Try to get exact source of named function | ||
| try: | ||
| lines, _ = inspect.getsourcelines(func) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we extract source code here? do we include it into hash? (this is not what is expected)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we use UDF function source to calculate hash. I know we spoke how user should be able to change UDF function code and continue on that UDF and this doesn't block that. This are implementation details / my plans about UDF checkpoints:
Let me know if you see some issues in the approach.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yep, my main concern is complexity while it doesn't fully solve the issue - anything outside in the code can change that leads to change in the UDF - e.g. it is using some helper that changed? or some package? We won't be able to detect this - so unless I'm missing something I would keep it simple, skip recalculation no matter what and have some mechanism to trigger full recall (or partial on the UDF side - e.g. by introducing some version of the function as a param).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True, this hash will only catch function code changes itself, not the helper method code changes or lib changes. But my feeling was that it's good enough for the users (we can clearly communicate this to them) and it's def better than to only give option to recalculate the whole job. ... idk BTW it's not that hard to add those helper methods changes affect the hash as well, the only problematic part are the external libs (for them we can only include name + type in the hash so hash will change if user stops using one or adds another one etc.)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
100%
the worst thing here - is that it will be unpredictable - sometimes it works, sometimes doesn't. I don't think we want that level of magic let's please simplify this, at least for now |
||
| payload = textwrap.dedent("".join(lines)).strip() | ||
| except (OSError, TypeError): | ||
| # Fallback: bytecode if source not available | ||
| payload = func.__code__.co_code | ||
| else: | ||
| # For lambdas, fall back directly to bytecode | ||
| payload = func.__code__.co_code | ||
|
|
||
| # Normalize annotations | ||
| annotations = { | ||
| k: getattr(v, "__name__", str(v)) for k, v in func.__annotations__.items() | ||
| } | ||
|
|
||
| # Extras to distinguish functions with same code but different metadata | ||
| extras = { | ||
| "name": func.__name__, | ||
| "defaults": func.__defaults__, | ||
| "annotations": annotations, | ||
| } | ||
|
|
||
| # Compute SHA256 | ||
| h = hashlib.sha256() | ||
| h.update(str(payload).encode() if isinstance(payload, str) else payload) | ||
| h.update(str(extras).encode()) | ||
| return h.hexdigest() | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,6 @@ | ||
| import copy | ||
| import hashlib | ||
| import json | ||
| import warnings | ||
| from collections.abc import Iterator, Sequence | ||
| from dataclasses import dataclass | ||
|
|
@@ -257,6 +259,11 @@ def serialize(self) -> dict[str, Any]: | |
| signals["_custom_types"] = custom_types | ||
| return signals | ||
|
|
||
| def hash(self) -> str: | ||
| """Create SHA hash of this schema""" | ||
| json_str = json.dumps(self.serialize(), sort_keys=True, separators=(",", ":")) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how often / when do we call it? it is a very heavy operation AFAIU ... what if implementation becomes unstable (e.g. some model names are different each time, etc) -will we catch it with tests? how can we guarantee this?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We call it for each UDF in the chain when we calculate the whole chain hash. Chain hash will be calculated:
IMO we should not worry about this being bottleneck. I will test with some big schema to be 100% sure, but since this is only converting column and their types to string implementation (doesn't do any IO or similar) I would guess this is not even on the same page with UDF calculation, or applying all the steps in chain. Regarding instability, can you give example of something that could happen? I'm trying to come up with scenario but don't have much ideas ..
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't tbh ... I remember we have things like random names for generated DataModels in some places ... SignalSchema overall and serialization is quite complex - I think it is worth reviewing it high level
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Once? or per each input?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For each UDF we calculate it twice, for |
||
| return hashlib.sha256(json_str.encode("utf-8")).hexdigest() | ||
|
|
||
| @staticmethod | ||
| def _split_subtypes(type_name: str) -> list[str]: | ||
| """This splits a list of subtypes, including proper square bracket handling.""" | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed to make hash function consistent. Before we were generating random names for these temporary columns that are added during diff process and removed before the end but that was causing
.hash()function of chain having.diff()to be inconsistent since column names added with mutate are part of hash calculation.Workaround is not to always generate full random column name, but just set some constant that is already random enough not to cause collision with user's columns and since it's constant, hash will be consistent as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, but I can't resist to put it here 😅

Although the question is: how will it works with multiple diff operations, applied to the same chain? 🤔 Are there any cases when it might overlaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just didn't have any other ideas how to fix this but it shouldn't affect our diff logic. Good point about multiple diffs in the single chain - I've added the test for that and it works fine.
Regarding randomization, those randoms were used for column names to not collide with user columns and when you think about it, it's not very different than just setting large "random" uuid. It's the same chance to collision which is ~ 0 . ... but it does look strange in code, I do agree with that :D
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know 😅 This image is just a perfect fit for this case, I could not resist 🤣
Awesome! Thank you! 🙏