Skip to content

Conversation

@fabianlim
Copy link
Collaborator

Best way to test the dataloading is to run the actual script (e.g., finetune) and then pull out the data samples that is created.

  • doing this, we guarantee the samples are created in the same way.
  • we can test this without torch.distributed

In this PR we provide

  • utilities for extracting data samples from the training script
  • a test_finetune function (and script) to extract out the data

See the below usage example

from scripts.test_dataloading import test_finetune

data = test_finetune(
    model_name_or_path='/path/to/model/',
    train_file='/path/to/train_file', 
    max_train_steps=2, # number of steps to run
)

then we can get the samples from data[0], data[1], and so on

# data[0]
{'input_ids': tensor([[100256, 100256, 100256,  ...,  74477,     13, 100257],
         [100256, 100256, 100256,  ...,   8082,     13, 100257],
         [100256, 100256, 100256,  ...,   5687,     13, 100257],
         ...,
         [100256, 100256, 100256,  ...,   3157,     13, 100257],
         [100256, 100256, 100256,  ...,    279,   8712, 100257],
         [100256, 100256, 100256,  ...,  67092,   1210, 100257]]),
 'labels': tensor([[  -100,   -100,   -100,  ...,  74477,     13, 100257],
         [  -100,   -100,   -100,  ...,   8082,     13, 100257],
         [  -100,   -100,   -100,  ...,   5687,     13, 100257],
         ...,
         [  -100,   -100,   -100,  ...,   3157,     13, 100257],
         [  -100,   -100,   -100,  ...,    279,   8712, 100257],
         [  -100,   -100,   -100,  ...,  67092,   1210, 100257]])}

@fabianlim fabianlim requested a review from dangxuanhong June 21, 2025 20:53
@garrett361
Copy link
Owner

Looks good to me. Haven't use mocks before, they look super useful. We are going to use lora?

@fabianlim
Copy link
Collaborator Author

fabianlim commented Jun 23, 2025

Looks good to me. Haven't use mocks before, they look super useful. We are going to use lora?

Not sure..

@fabianlim fabianlim merged commit a40871f into padding-free Jun 23, 2025
@fabianlim fabianlim deleted the data-tester branch June 23, 2025 19:57
@fabianlim fabianlim mentioned this pull request Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants