Skip to content

Conversation

@ilongin
Copy link
Contributor

@ilongin ilongin commented Oct 3, 2025

  • Modified UDFBase.hash() to hash self.process when self._func is None
  • This ensures class-based UDFs (Mapper/Generator with overridden process()) are properly hashed based on their implementation
  • Added test cases for class-based UDF

Summary by Sourcery

Fix UDF hash calculation for class-based UDFs by hashing the process method when no function is provided

Bug Fixes:

  • Correct UDFBase.hash to use the process method for class-based UDFs instead of always hashing _func

Documentation:

  • Update UDFBase.hash docstring to clarify hashing logic for function-based and class-based UDFs

Tests:

  • Add unit tests with expected hash values for DoubleMapper and TripleGenerator class-based UDFs

- Modified UDFBase.hash() to hash self.process when self._func is None
- This ensures class-based UDFs (Mapper/Generator with overridden process())
  are properly hashed based on their implementation
- Added test cases for class-based DoubleMapper and TripleGenerator
- Verified hash changes when class-based UDF code is modified
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 3, 2025

Reviewer's Guide

This PR updates the UDF hash calculation to correctly include class-based UDFs by hashing their process method when no function is provided, and adds corresponding test cases to validate the new behavior.

Class diagram for updated UDFBase hash calculation

classDiagram
    class UDFBase {
        +_func
        +process()
        +params
        +output
        +hash()
    }
    UDFBase : hash() now hashes process() if _func is None
Loading

Flow diagram for UDFBase.hash() decision logic

flowchart TD
    A["UDFBase.hash() called"] --> B{_func is None?}
    B -- Yes --> C["Hash process method"]
    B -- No --> D["Hash _func"]
    C --> E["Combine with params and output hashes"]
    D --> E["Combine with params and output hashes"]
    E --> F["Return SHA hash"]
Loading

File-Level Changes

Change Details Files
Adjust UDFBase.hash to support class-based UDF hashing
  • Introduce func_to_hash variable that selects self._func or self.process
  • Replace direct hash_callable(self._func) call with hash_callable(func_to_hash)
  • Update method docstring to describe function-based vs class-based behavior
src/datachain/lib/udf.py
Add tests for class-based UDF hash calculation
  • Define DoubleMapper and TripleGenerator classes overriding process()
  • Add expected hash entry for DoubleMapper in test_udf_mapper_hash
  • Add expected hash entry for TripleGenerator in test_udf_generator_hash
tests/unit/test_query_steps_hash.py

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Consider including the UDF class’s name and module path in the hash to avoid collisions when multiple classes share identical process implementations.
  • For stateful class-based UDFs, you may want to incorporate instance attributes (e.g. dict) into the hash so different constructor parameters yield distinct hashes.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider including the UDF class’s name and module path in the hash to avoid collisions when multiple classes share identical process implementations.
- For stateful class-based UDFs, you may want to incorporate instance attributes (e.g. __dict__) into the hash so different constructor parameters yield distinct hashes.

## Individual Comments

### Comment 1
<location> `src/datachain/lib/udf.py:168` </location>
<code_context>
        func_to_hash = self._func if self._func else self.process

</code_context>

<issue_to_address>
**suggestion (code-quality):** Replace if-expression with `or` ([`or-if-exp-identity`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/or-if-exp-identity))

```suggestion
        func_to_hash = self._func or self.process
```

<br/><details><summary>Explanation</summary>Here we find ourselves setting a value if it evaluates to `True`, and otherwise
using a default.

The 'After' case is a bit easier to read and avoids the duplication of
`input_currency`.

It works because the left-hand side is evaluated first. If it evaluates to
true then `currency` will be set to this and the right-hand side will not be
evaluated. If it evaluates to false the right-hand side will be evaluated and
`currency` will be set to `DEFAULT_CURRENCY`.
</details>
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

For class-based UDFs, hashes the process method.
"""
# Hash user code: either _func (function-based) or process method (class-based)
func_to_hash = self._func if self._func else self.process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Replace if-expression with or (or-if-exp-identity)

Suggested change
func_to_hash = self._func if self._func else self.process
func_to_hash = self._func or self.process


ExplanationHere we find ourselves setting a value if it evaluates to True, and otherwise
using a default.

The 'After' case is a bit easier to read and avoids the duplication of
input_currency.

It works because the left-hand side is evaluated first. If it evaluates to
true then currency will be set to this and the right-hand side will not be
evaluated. If it evaluates to false the right-hand side will be evaluated and
currency will be set to DEFAULT_CURRENCY.

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Oct 3, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 1d37165
Status: ✅  Deploy successful!
Preview URL: https://e3d960f3.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-1377-fix-udf-functio.datachain-documentation.pages.dev

View logs

@ilongin ilongin linked an issue Oct 3, 2025 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Oct 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.79%. Comparing base (f59d6cd) to head (1d37165).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1378   +/-   ##
=======================================
  Coverage   87.79%   87.79%           
=======================================
  Files         160      160           
  Lines       14990    14991    +1     
  Branches     2148     2148           
=======================================
+ Hits        13160    13161    +1     
  Misses       1344     1344           
  Partials      486      486           
Flag Coverage Δ
datachain 87.71% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/lib/udf.py 94.80% <100.00%> (+0.02%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ilongin ilongin merged commit 7fb8905 into main Oct 3, 2025
83 of 87 checks passed
@ilongin ilongin deleted the ilongin/1377-fix-udf-function-hash branch October 3, 2025 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix hashing of UDF functions

2 participants