Skip to content

Conversation

@mtshiba
Copy link
Collaborator

@mtshiba mtshiba commented Dec 18, 2025

Summary

From #17371

See the comments in #17371 for the motivation for this change.

Test Plan

N/A

@mtshiba mtshiba added ty Multi-file analysis & type inference internal An internal refactor or improvement labels Dec 18, 2025
@astral-sh-bot
Copy link

astral-sh-bot bot commented Dec 18, 2025

Diagnostic diff on typing conformance tests

No changes detected when running ty on typing conformance tests ✅

@astral-sh-bot
Copy link

astral-sh-bot bot commented Dec 18, 2025

mypy_primer results

Changes were detected when running on open source projects
Tanjun (https://github.com/FasterSpeeding/Tanjun)
- tanjun/dependencies/data.py:347:12: error[invalid-return-type] Return type does not match returned value: expected `_T@cached_inject`, found `Coroutine[Any, Any, _T@cached_inject | Coroutine[Any, Any, _T@cached_inject]] | _T@cached_inject`
+ tanjun/dependencies/data.py:347:12: error[invalid-return-type] Return type does not match returned value: expected `_T@cached_inject`, found `_T@cached_inject | Coroutine[Any, Any, _T@cached_inject | Coroutine[Any, Any, _T@cached_inject]]`

static-frame (https://github.com/static-frame/static-frame)
- static_frame/core/bus.py:671:16: error[invalid-return-type] Return type does not match returned value: expected `InterGetItemLocReduces[Bus[Any], object_]`, found `InterGetItemLocReduces[Bus[Any] | TypeBlocks | Batch | ... omitted 7 union elements, object_]`
+ static_frame/core/bus.py:671:16: error[invalid-return-type] Return type does not match returned value: expected `InterGetItemLocReduces[Bus[Any], object_]`, found `InterGetItemLocReduces[Bus[Any] | Top[Index[Any]] | Top[Series[Any, Any]] | ... omitted 7 union elements, object_]`
- static_frame/core/series.py:772:16: error[invalid-return-type] Return type does not match returned value: expected `InterGetItemILocReduces[Series[Any, Any], TVDtype@Series]`, found `InterGetItemILocReduces[Series[Any, Any] | Top[Index[Any]] | TypeBlocks | ... omitted 6 union elements, generic[object]]`
+ static_frame/core/series.py:772:16: error[invalid-return-type] Return type does not match returned value: expected `InterGetItemILocReduces[Series[Any, Any], TVDtype@Series]`, found `InterGetItemILocReduces[Series[Any, Any] | TypeBlocks | Batch | ... omitted 6 union elements, generic[object]]`
- static_frame/core/series.py:4072:16: error[invalid-return-type] Return type does not match returned value: expected `InterGetItemILocReduces[SeriesHE[Any, Any], TVDtype@SeriesHE]`, found `InterGetItemILocReduces[SeriesHE[Any, Any] | Top[Index[Any]] | TypeBlocks | ... omitted 7 union elements, generic[object]]`
+ static_frame/core/series.py:4072:16: error[invalid-return-type] Return type does not match returned value: expected `InterGetItemILocReduces[SeriesHE[Any, Any], TVDtype@SeriesHE]`, found `InterGetItemILocReduces[SeriesHE[Any, Any] | TypeBlocks | Batch | ... omitted 7 union elements, generic[object]]`
- static_frame/core/yarn.py:418:16: error[invalid-return-type] Return type does not match returned value: expected `InterGetItemILocReduces[Yarn[Any], object_]`, found `InterGetItemILocReduces[Yarn[Any] | Top[Index[Any]] | TypeBlocks | ... omitted 7 union elements, generic[object]]`
+ static_frame/core/yarn.py:418:16: error[invalid-return-type] Return type does not match returned value: expected `InterGetItemILocReduces[Yarn[Any], object_]`, found `InterGetItemILocReduces[Yarn[Any] | TypeBlocks | Batch | ... omitted 7 union elements, generic[object]]`

pandas-stubs (https://github.com/pandas-dev/pandas-stubs)
- pandas-stubs/_typing.pyi:1223:16: warning[unused-ignore-comment] Unused blanket `type: ignore` directive
+ tests/frame/test_groupby.py:228:15: error[type-assertion-failure] Type `Series[Any]` does not match asserted type `Series[str | bytes | int | ... omitted 12 union elements]`
+ tests/frame/test_groupby.py:624:15: error[type-assertion-failure] Type `Series[Any]` does not match asserted type `Series[str | bytes | int | ... omitted 12 union elements]`
- Found 5084 diagnostics
+ Found 5085 diagnostics

No memory usage changes detected ✅

@astral-sh-bot
Copy link

astral-sh-bot bot commented Dec 18, 2025

ecosystem-analyzer results

Lint rule Added Removed Changed
invalid-argument-type 0 2 2
invalid-return-type 0 0 3
Total 0 2 5

@codspeed-hq
Copy link

codspeed-hq bot commented Dec 18, 2025

CodSpeed Performance Report

Merging #22055 will not alter performance

Comparing mtshiba:intersection-hash (d529737) with main (76854fd)

Summary

✅ 22 untouched
⏩ 30 skipped1

Footnotes

  1. 30 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@mtshiba mtshiba marked this pull request as ready for review December 18, 2025 18:15
// create a union of intersections.
intersections: Vec<InnerIntersectionBuilder<'db>>,
/// Stores hash values ​​of `intersections` to prevent adding identical `InnerIntersectionBuilder`s.
intersection_hashes: FxHashSet<u64>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems risky to rely solely on the hash value, given the possibility of hash collisions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially since my understanding is that FxHasher is optimized for speed, not for collision-resistance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the hash function were ideal, the probability of collisions would be almost negligible in this case (where the number of elements is only about 100).
The collision probability is calculated using the same formula as the birthday paradox. If the hash space is $H$ and the number of elements is $N$, then:

$$ P(N) \simeq 1-\exp(-\frac{N(N-1)}{2H}) \simeq \frac{N^2}{2H} $$

Since $H = 2^{64} \simeq 10^{19}, N^2 \simeq 10^4$, the probability is almost negligible.

The problem is that fxhash is not an ideal hash function. In this case, the actual effective hash space may be much smaller.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a cryptographic hash function seems too expensive, so it's better to simply use an equality check to remove duplicates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could intersections be a FxIndexMap ir is the issue that we keep updating the entries?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use hashmaps because we're using intersections as a mutable iterator.

@AlexWaygood
Copy link
Member

It looks like pydantic has a huge performance improvement here, but most other benchmarks regress: https://codspeed.io/astral-sh/ruff/branches/mtshiba%3Aintersection-hash?utm_source=github&utm_medium=comment&utm_content=header. I'd be interested in seeing if we still see the huge performance improvement on pydantic after #22044 has been merged.

@AlexWaygood
Copy link
Member

Can you rebase on main now #22044 has landed?

@mtshiba mtshiba added ecosystem-analyzer and removed ecosystem-analyzer internal An internal refactor or improvement labels Dec 19, 2025
@mtshiba
Copy link
Collaborator Author

mtshiba commented Dec 19, 2025

I'd be interested in seeing if we still see the huge performance improvement on pydantic after #22044 has been merged.

Hmm, it seems that the performance improvement on pydantic has disappeared after merging #22044.
However, this PR still seems necessary in some points.

  • The mypy_primer results show that this change improves boundness analysis (narrowing).
  • The original goal of this PR, which was to reduce the abnormal memory consumption in [ty] infer function's return type #17371, seems to be impossible to achieve without this PR.

@AlexWaygood
Copy link
Member

This change no longer appears to have significant standalone benefits, so I'm sort-of inclined to say it should again just be part of #17371 rather than being a standalone PR

@mtshiba mtshiba closed this Dec 19, 2025
@mtshiba mtshiba deleted the intersection-hash branch December 19, 2025 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ecosystem-analyzer ty Multi-file analysis & type inference

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants