Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CutSet.from_files constructor for random order multi-file cutsets #1085

Merged
merged 2 commits into from
Jun 2, 2023

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Jun 1, 2023

An example of typical usage is:

# We start with a big cut set that refers to all of the data.
big_cutset = CutSet.from_file("100k_hour_cuts.jsonl.gz")

# It's being split into smaller parts below.
big_cutset.split_lazy("shards", chunk_size=1000)

# Now we can iterate over the shards, randomizing their order on each iteration.
# Randomization is enabled by default.
sharded_cutset = CutSet.from_files(Path("shards").glob("*.jsonl.gz"))

# Order of shards will be randomized
for cut in sharded_cutset:
    pass

# This time the order will be different
for cut in sharded_cutset:
   pass

# We can randomize the cuts across the shards as well in a streaming fashion
# (note: this happens automatically in samplers when shuffle=True)
cuts = sharded_cutset.shuffle(buffer_size=50000)

# Works as usual with samplers
sampler = DynamicBucketingSampler(sharded_cutset, ...)

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jun 1, 2023

I added one more improvement: by default we'll use Python RNG here so it's automatically different on each script instantiation. The user can specify the seed if they want reproducibility.

@pzelasko pzelasko merged commit 427cabd into master Jun 2, 2023
@pzelasko pzelasko deleted the feature/cutset-from-files branch June 2, 2023 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant