-
Notifications
You must be signed in to change notification settings - Fork 405
Feat cache residual evaluator #2695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat cache residual evaluator #2695
Conversation
Fixes apache#2147 - Implement ResidualEvaluatorCache with LRU eviction and thread safety - Cache evaluators by partition spec, expression, case sensitivity, and schema - Fix mypy type annotations and add type ignore for cachetools decorator
55e53d0 to
0ec381e
Compare
| cached = _residual_evaluator_cache.get(spec, expr, case_sensitive, schema) | ||
| if cached is not None: | ||
| return cached |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the build in LRU cache? https://docs.python.org/3/library/functools.html#functools.lru_cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Fokko 👋 ! We certainly could try to simplify this — my main motivation for the current approach was to build something that felt more extensible, more readable (so contributors don’t have to dig too deep to understand what’s going on), and that gives users a bit more control.
That said, when I considered switching to functools.lru_cache for residual evaluator creation, it looked like it would actually require more work to avoid a custom cache. Specifically, making the cache keys hashable while still passing the spec, expr, and schema objects — which aren’t inherently hashable — seemed tricky. I would have needed helper functions to produce stable key representations, and potentially to check the hashability of PartitionSpec, just to ensure caching didn’t break evaluator construction.
I’m definitely open to simplifying this if you think that’s the right direction. Let me know your thoughts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we do something similar to this? https://github.com/apache/iceberg-python/blob/main/pyiceberg/manifest.py#L877 and use a small helper function to produce the key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayceslesar thanks for the the review!! Glad your back in the saddle! @Fokko what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's more or less what I suggested as well. I don't think we should implement this ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko are we good to merge this then?
Closes #2147
Rationale for this change
For queries, the same combinations of (partition spec, expression, case sensitivity, schema) are evaluated repeatedly, causing unnecessary computational overhead and increased query latency.
Are these changes tested?
Yes, using the existing test cases as they seemed sufficient for the changes.
Are there any user-facing changes?
No.