Enable seed randomization in dynamic samplers #1278

pzelasko · 2024-01-31T14:29:57Z

This PR enables specifying seed="randomized" and seed="trng" for DynamicCutSampler and DynamicBucketingSampler.

Both options are intended for use with IterableDatasetWrapper and cause the samplers to iterate with different random seeds in each node and dataloading worker. Note that for bucketing this will have the effect of de-synchronizing batch sizes across GPUs from the start of iteration (before the change, this occurs anyway after a number of training steps as observed in #857).

From now on, the sampler also attaches a custom field called dataloading_info to each cut which is a dict containing rank, world_size, and worker_id keys that help diagnose the dataloading.

…xtended diagnostics.

pzelasko · 2024-01-31T18:17:16Z

The failing test is flaky - merging

Enable seed randomization in dynamic samplers

4ce3685

pzelasko added this to the v1.20.0 milestone Jan 31, 2024

Extend unit test. Attach dataloading_info to sampled cuts to enable e…

8da56c7

…xtended diagnostics.

pzelasko marked this pull request as ready for review January 31, 2024 16:05

pzelasko merged commit e043228 into master Jan 31, 2024
7 of 10 checks passed

pzelasko deleted the feature/enable-randomized-sampler-seed branch January 31, 2024 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable seed randomization in dynamic samplers #1278

Enable seed randomization in dynamic samplers #1278

pzelasko commented Jan 31, 2024 •

edited

Loading

pzelasko commented Jan 31, 2024

Enable seed randomization in dynamic samplers #1278

Enable seed randomization in dynamic samplers #1278

Conversation

pzelasko commented Jan 31, 2024 • edited Loading

pzelasko commented Jan 31, 2024

pzelasko commented Jan 31, 2024 •

edited

Loading