-
Notifications
You must be signed in to change notification settings - Fork 134
Fix UDF hash calculation for class-based UDFs #1378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Modified UDFBase.hash() to hash self.process when self._func is None - This ensures class-based UDFs (Mapper/Generator with overridden process()) are properly hashed based on their implementation - Added test cases for class-based DoubleMapper and TripleGenerator - Verified hash changes when class-based UDF code is modified
Reviewer's GuideThis PR updates the UDF hash calculation to correctly include class-based UDFs by hashing their process method when no function is provided, and adds corresponding test cases to validate the new behavior. Class diagram for updated UDFBase hash calculationclassDiagram
class UDFBase {
+_func
+process()
+params
+output
+hash()
}
UDFBase : hash() now hashes process() if _func is None
Flow diagram for UDFBase.hash() decision logicflowchart TD
A["UDFBase.hash() called"] --> B{_func is None?}
B -- Yes --> C["Hash process method"]
B -- No --> D["Hash _func"]
C --> E["Combine with params and output hashes"]
D --> E["Combine with params and output hashes"]
E --> F["Return SHA hash"]
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- Consider including the UDF class’s name and module path in the hash to avoid collisions when multiple classes share identical process implementations.
- For stateful class-based UDFs, you may want to incorporate instance attributes (e.g. dict) into the hash so different constructor parameters yield distinct hashes.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider including the UDF class’s name and module path in the hash to avoid collisions when multiple classes share identical process implementations.
- For stateful class-based UDFs, you may want to incorporate instance attributes (e.g. __dict__) into the hash so different constructor parameters yield distinct hashes.
## Individual Comments
### Comment 1
<location> `src/datachain/lib/udf.py:168` </location>
<code_context>
func_to_hash = self._func if self._func else self.process
</code_context>
<issue_to_address>
**suggestion (code-quality):** Replace if-expression with `or` ([`or-if-exp-identity`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/or-if-exp-identity))
```suggestion
func_to_hash = self._func or self.process
```
<br/><details><summary>Explanation</summary>Here we find ourselves setting a value if it evaluates to `True`, and otherwise
using a default.
The 'After' case is a bit easier to read and avoids the duplication of
`input_currency`.
It works because the left-hand side is evaluated first. If it evaluates to
true then `currency` will be set to this and the right-hand side will not be
evaluated. If it evaluates to false the right-hand side will be evaluated and
`currency` will be set to `DEFAULT_CURRENCY`.
</details>
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| For class-based UDFs, hashes the process method. | ||
| """ | ||
| # Hash user code: either _func (function-based) or process method (class-based) | ||
| func_to_hash = self._func if self._func else self.process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Replace if-expression with or (or-if-exp-identity)
| func_to_hash = self._func if self._func else self.process | |
| func_to_hash = self._func or self.process |
Explanation
Here we find ourselves setting a value if it evaluates toTrue, and otherwiseusing a default.
The 'After' case is a bit easier to read and avoids the duplication of
input_currency.
It works because the left-hand side is evaluated first. If it evaluates to
true then currency will be set to this and the right-hand side will not be
evaluated. If it evaluates to false the right-hand side will be evaluated and
currency will be set to DEFAULT_CURRENCY.
Deploying datachain-documentation with
|
| Latest commit: |
1d37165
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://e3d960f3.datachain-documentation.pages.dev |
| Branch Preview URL: | https://ilongin-1377-fix-udf-functio.datachain-documentation.pages.dev |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1378 +/- ##
=======================================
Coverage 87.79% 87.79%
=======================================
Files 160 160
Lines 14990 14991 +1
Branches 2148 2148
=======================================
+ Hits 13160 13161 +1
Misses 1344 1344
Partials 486 486
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Summary by Sourcery
Fix UDF hash calculation for class-based UDFs by hashing the process method when no function is provided
Bug Fixes:
Documentation:
Tests: