Treat arg-cache length mismatch as a cache miss in ChunkSizeTuner by GMNGeoffrey · Pull Request #224 · aqlaboratory/openfold-3

GMNGeoffrey · 2026-05-18T18:56:30Z

Summary
The chunk size tuner currently throws if the function it's tuning for receives arguments with different ranks in different calls. Instead, treat this as a cache miss. I don't think the tuner should be forcing invariants on the functions it's tuning for.

I encountered an issue with this when doing #213. That enables batch > 1 and chunking to happen together with Triton kernels (you hit the same bug with kernels off without that change). When you run inference with samples both above and below per_sample_token_cutoff, the small inputs use the batched/normal pairformer embedding path which flattens the batch dimensions (i.e. batch and num_samples), but the larger inputs follow the per-sample path, which doesn't flatten their batch dimensions and results in higher-rank inputs (#223 aims to fix this mismatch).

_compare_arg_caches recurses on tensor shapes (torch.Size is a tuple subclass), so when the same ChunkSizeTuner instance was invoked with tensors of different ranks across calls, the inner zip(..., strict=True) raised. Instead in this change we treat any length mismatch as a cache miss so the caller re-tunes instead. The top-level length assert in tune_chunk_size is now handled the same way and removed.

Also adds dtype element size to the cache key for a tensor argument. We could use the entire dtype, but I think in terms of what matters for chunking, the element size is the key factor.

Changes

Remove assert that tuned function always called with same number of args
Add dtype.itemsize to tensor cache key along with shape
Record a cache miss when cache key lengths differ rather than raising from zip(...,strict=True)
Unit tests for the same

Related Issues

[BUG] per_sample_pairformer_emb passes wrong-shaped x_pred to embed_zij #199 in that it's also related to the rank changes in the pairformer per-sample path.

Testing

Unit tests verifying cache miss on arg count, arg rank, and arg dtype size changes.

GMNGeoffrey · 2026-05-18T18:56:40Z

@christinaflo @jnwei PTAL

I encountered an issue with this when doing aqlaboratory#213. That enables `batch > 1` and chunking to happen together. When you run inference with samples both above and below per_sample_token_cutoff, the small inputs use the batched/normal pairformer embedding path and flatten their batch dimensions (i.e. batch and num_samples), but the larger inputs follow the per-sample path, which *doesn't* flatten its batch dimensions and results in higher-rank inputs (aqlaboratory#223 aims to fix this mismatch). `_compare_arg_caches` recurses on tensor shapes (`torch.Size` is a tuple subclass), so when the same `ChunkSizeTuner` instance is invoked with tensors of different ranks across calls, the inner `zip(..., strict=True)` raised. Instead treat any length mismatch as a cache miss so the caller re-tunes instead. The redundant top-level length assert in `tune_chunk_size` is now handled the same way and removed. Also adds dtype element size to the cache key for a tensor argument. We could use the entire dtype, but I think in terms of what matters for chunking, the element size is the key factor.

jandom

Those are all good – all 3 test for the negative case, would it make sense to test for the positive as well, or do we think that's overkill?

Meta-comment, and outside of scope, we should really migrate all these to pytest just to be consistent

GMNGeoffrey · 2026-06-01T18:00:06Z

Those are all good – all 3 test for the negative case, would it make sense to test for the positive as well

Done

christinaflo

LGTM!

GMNGeoffrey · 2026-06-03T00:21:42Z

LGTM!

Can you mark safe to test and merge? Or were you waiting for @jandom?

christinaflo · 2026-06-03T00:56:20Z

LGTM!

Can you mark safe to test and merge? Or were you waiting for @jandom?

Yeah I was going to see if @jandom had other comments but I think it's fine to merge after the tests finish running.

GMNGeoffrey force-pushed the chunk-tune-rank-change branch from 9da1180 to 6d8e73b Compare May 18, 2026 19:00

GMNGeoffrey force-pushed the chunk-tune-rank-change branch from 6d8e73b to 0baf5ad Compare May 28, 2026 00:54

GMNGeoffrey mentioned this pull request May 28, 2026

Avoid retesting non-viable chunk sizes in tuner #212

Merged

GMNGeoffrey force-pushed the chunk-tune-rank-change branch from 0baf5ad to f943d7c Compare May 28, 2026 16:53

christinaflo self-requested a review May 29, 2026 01:18

christinaflo added the safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. label May 29, 2026

jandom reviewed Jun 1, 2026

View reviewed changes

Add chunk size tuner positive caching test

61b4ec0

christinaflo approved these changes Jun 2, 2026

View reviewed changes

GMNGeoffrey mentioned this pull request Jun 2, 2026

Make config chunk_size maximum rather than minimum #227

Open

GMNGeoffrey requested a review from jandom June 3, 2026 00:21

christinaflo added safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. and removed safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. labels Jun 3, 2026

christinaflo merged commit cf2c00c into aqlaboratory:main Jun 3, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treat arg-cache length mismatch as a cache miss in ChunkSizeTuner#224

Treat arg-cache length mismatch as a cache miss in ChunkSizeTuner#224
christinaflo merged 2 commits into
aqlaboratory:mainfrom
GMNGeoffrey:chunk-tune-rank-change

GMNGeoffrey commented May 18, 2026 •

edited

Loading

Uh oh!

GMNGeoffrey commented May 18, 2026

Uh oh!

jandom left a comment •

edited

Loading

Uh oh!

GMNGeoffrey commented Jun 1, 2026

Uh oh!

christinaflo left a comment

Uh oh!

GMNGeoffrey commented Jun 3, 2026

Uh oh!

christinaflo commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

GMNGeoffrey commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GMNGeoffrey commented May 18, 2026

Uh oh!

jandom left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GMNGeoffrey commented Jun 1, 2026

Uh oh!

christinaflo left a comment

Choose a reason for hiding this comment

Uh oh!

GMNGeoffrey commented Jun 3, 2026

Uh oh!

christinaflo commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GMNGeoffrey commented May 18, 2026 •

edited

Loading

jandom left a comment •

edited

Loading