Make config chunk_size maximum rather than minimum by GMNGeoffrey · Pull Request #227 · aqlaboratory/openfold-3

GMNGeoffrey · 2026-05-20T23:49:46Z

Summary
The chunk_size specified in the config is currently the minimum value that the tuner will go down to. I don't think there's a clear use case for this to be the minimum and users were confused. Instead, we can make this the maximum. This also allows us to remove the hardcoded values for the max with default and flash-style kernels: these instead are just config settings. I removed the minimum chunk size entirely, so the tuner will search all powers of 2 up to the maximum. I'm struggling to think of a use case where someone would want to limit chunks size to a minimum of 4 (I think this number just came from the original suggested chunk size in the AlphaFold 3 paper).

Note that this changes the hardcoded different chunk size for diffusion conditioning. I'm not sure how important this is to preserve and whether we saw actual latency improvements from raising the chunk size here vs just observing that it was possible to do so. If we want different chunk sizes for different modules, I think we should probably thread through a config setting per module because having a hardcoded value here is a bit surprising IMO (and note that it only gets applied when tuning is turned on). Probably the best analog in the config would be offload_inference as a per-module setting that isn't an init arg. If more things got added here, IDK whether we'd want to keep this structure or perhaps mirror the structure of architecture.

Changes

Make config chunk_size indicate the maximum chunk size when tuning is active
Remove hardcoded min and max chunk size constants
Update Triton and Cueq example configs to specify chunk size 1024

Testing

Added unit tests for the chunk tuner. I stole some from my other PR Avoid retesting non-viable chunk sizes in tuner #212 (I'll resolve merge conflicts with whichever is ready to go first). If you think these are overkill, I can also pare them back

GMNGeoffrey · 2026-05-20T23:50:00Z

@christinaflo PTAL

This converts the configured chunk size to the *maximum* rather than the *minimum*. I don't think there's a clear use case for this to be the minimum and users were confused. This also allows us to remove the hardcoded values for the max with default and flash-style kernels: these instead are just config settings. I removed the minimum chunk size entirely, so the tuner will search all powers of 2 up to the maximum. I'm struggling to think of a use case where someone would want to limit chunks size to a minimum of 4 (I think this number just came from the original suggested chunk size in the AlphaFold 3 paper). Note that this changes the hardcoded different chunk size for diffusion conditioning. I'm not sure how important this is to preserve and whether we saw actual latency improvements from raising the chunk size here vs just observing that it was possible to do so. If we want different chunk sizes for different modules, I think we should probably thread through a config setting per module because having a hardcoded value here is a bit surprising IMO (and note that it only gets applied when tuning is turned on). Probably the best analog in the config would be `offload_inference` as a per-module setting that isn't an init arg. If more things got added here, IDK whether we'd want to keep this structure or perhaps mirror the structure of architecture.

This would have caught the issue with the chunk size for attention getting divided by 4. I think for only that bug, it's not really carrying its weight, but testing this boundary case still seems useful.

christinaflo · 2026-05-28T01:03:16Z

+        )
+        if chunk_size is not None:
+            config.settings.memory.eval.chunk_size = chunk_size
+            config.settings.memory.train.chunk_size = chunk_size


We never really run chunking with training because the activations stack up anyway in the backward pass so it doesnt save you much. I havent actually run this but i think it may fail some assert. Diffusion conditioning at least has a assert not self.training for its chunking function

ah okay i see now you only have a test for eval mode anyway

can we just delete this line anyway since it cant run

In that case, shouldn't we remove chunk_size from the train settings entirely? I set it here because if chunk size is overridden for the test it would be surprising if it weren't overridden for training as well IMO. I can delete it if you prefer though

I had it there originally because I needed chunking enabled during validation for some large samples but not for training, so it'll pick what to use in the model like: chunk_size=mode_mem_settings.chunk_size depending on the stage it's in. I guess there's nothing stopping you from doing it during training, it's just not really worth it ever.

We could fix the assert in diffusion conditioning to match the other modules so it runs:

if chunk_size is not None and self.chunk_size_tuner is not None: assert not self.training

or change model.py to always set it to None for training and not reference the config, i thought it was easier to distinguish in the config though

Oh yeah thats a lot of memory. on 80gb gpus with no kernels I can see the chunking take effect with seq lengths > 1500 tokens. I have a really old and messy script for benchmarking inference speed + mem using random_of3_features. I can clean it up and share it, but I have a bit of a backlog this week so I can just send it for testing #213 so I don't block this PR. As you said, this is really just a config change. Btw realistic values are n_msa=16384 (this will get subsampled to 1024 each recycle) and n_templ=4.

Perfect, thanks! I think that gives me enough to do testing without being limited by what I can find in the pdb. A standardized script in-repo would be nice, but you don't need to rush to clean up yours 🙂

So those don't vary much with input or scale with n_tok? I found that homoers used significantly less memory at very large n_tok I think due to msa reuse for the shared sequences, but I didn't dig in.

I can test for smaller memory caps and the chunk tuner behavior by limiting torch mem_fraction. The only problem is the combinatorial explosion of options. If 80gb is of particular interest (H100?) I can test it as well. The other option is to test fixed chunk sizes and just track memory. I've got a memory snapshotting callback (happy to clean up and upstream if it would be generally useful). The only issues there are that tuning itself can affect peak due to some clones and then diffusion conditioning getting chunk size 2048 is guarded by the tuner getting on (is the diff there worth it?)

It's just the max allowable input n_msa and n_templ, there could be less but it'll get capped at those numbers. I'm not sure why large homomers use less memory either off the top of my head, I'd have to look into that as well, since msas should still be capped at 1024 per recycle I'd expect they would still reach that threshold.
Yeah 80 gb h100 is generally what we have access to so that is of particular interest. Even on mi300A I try to limit model gpu memory quite a bit anyway since our data loader processes take up a lot of memory, so I kind of treat it like an h100.
For the memory snapshotting, I have a callback for this also internally that will get merged in at some point, I assume they're probably the same.
For the diffusion conditioning chunk size, this PR changes it to the global max but I think it's fine, I highly doubt it's that much slower and I eyeballed that number to begin with :).

Ah I looked closer and found the issue. One of the homomers I had Claude dig up for me (6R7M-1) is a 40-chain homomer with an MSA depth of only 122. This was throwing things off and I thought it would be an issue with all homomers to a lesser extent, but it's really just this one that is weird. If n_msa gets subsampled to 1024 every recycle, does it actually make a difference if the random input has n_msa=1024 or n_msa=16384?

For the diffusion conditioning chunk size, this PR changes it to the global max but I think it's fine, I highly doubt it's that much slower and I eyeballed that number to begin with :).

Oh right, I forgot that I did that already :-D

No actually it doesn't make a difference, it's only if you want to exercise the subsampling logic which doesn't matter here.

I added an assert to guard against this so there isn't some surprising behavior.

GMNGeoffrey force-pushed the chunk-config branch from 228cbd2 to 9ba1377 Compare May 21, 2026 00:58

christinaflo reviewed May 21, 2026

View reviewed changes

Comment thread examples/example_runner_yamls/cuequivariance.yml Outdated

christinaflo reviewed May 21, 2026

View reviewed changes

Comment thread openfold3/core/model/latent/base_stacks.py

GMNGeoffrey requested a review from christinaflo May 21, 2026 04:45

GMNGeoffrey added 4 commits May 27, 2026 17:50

Add tests for chunk size tuner

52236b7

Add e2e test of chunk_size=1

476eafc

This would have caught the issue with the chunk size for attention getting divided by 4. I think for only that bug, it's not really carrying its weight, but testing this boundary case still seems useful.

Fix chunk size // 4 leading to attention chunk 0

d89aeae

GMNGeoffrey force-pushed the chunk-config branch from 0459025 to d89aeae Compare May 28, 2026 00:52

christinaflo reviewed May 28, 2026

View reviewed changes

Comment thread openfold3/tests/test_of3_model.py

christinaflo added the safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. label May 28, 2026

GMNGeoffrey mentioned this pull request May 28, 2026

Avoid retesting non-viable chunk sizes in tuner #212

Merged

GMNGeoffrey added 2 commits May 28, 2026 09:45

Set chunk size to 1024 in model presets

8bbcb5e

Don't set chunk size for training

bf1f80e

I added an assert to guard against this so there isn't some surprising behavior.

GMNGeoffrey force-pushed the chunk-config branch 2 times, most recently from b9d20a1 to 22edd67 Compare June 4, 2026 18:21

Merge branch 'main' into chunk-config

de89051

GMNGeoffrey force-pushed the chunk-config branch from 22edd67 to de89051 Compare June 4, 2026 18:22

Rename tests for consistency

40ae604

GMNGeoffrey mentioned this pull request Jun 9, 2026

Add callback for profiling GPU memory usage #249

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make config chunk_size maximum rather than minimum#227

Make config chunk_size maximum rather than minimum#227
GMNGeoffrey wants to merge 8 commits into
aqlaboratory:mainfrom
GMNGeoffrey:chunk-config

GMNGeoffrey commented May 20, 2026 •

edited

Loading

Uh oh!

GMNGeoffrey commented May 20, 2026

Uh oh!

Uh oh!

Uh oh!

christinaflo May 28, 2026

Uh oh!

christinaflo May 28, 2026

Uh oh!

christinaflo May 28, 2026 •

edited

Loading

Uh oh!

GMNGeoffrey May 28, 2026

Uh oh!

christinaflo May 29, 2026

Uh oh!

christinaflo Jun 3, 2026

Uh oh!

GMNGeoffrey Jun 3, 2026

Uh oh!

christinaflo Jun 3, 2026

Uh oh!

GMNGeoffrey Jun 3, 2026

Uh oh!

christinaflo Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

GMNGeoffrey commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GMNGeoffrey commented May 20, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christinaflo May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GMNGeoffrey commented May 20, 2026 •

edited

Loading

christinaflo May 28, 2026 •

edited

Loading