remove_non_tensor_columns #831
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In #765 additional string metadata was added to the dataset which conflicts with the application of the
DataCollatorForSeq2Seqcollator function, which expects only tensor data. This makesfinetune.pyfail (when--packing False) with errors like:This error arises when trying to create a tensor from a list of strings, e.g.
torch.tensor(["hello"]).This PR adds a utility for filtering non-tensor columns out of the dataset before using and uses the filter in both the sft and dpo scripts.
CC @hamishivi @jacob-morrison
The issue specifically is the addition of the
DATASET_ORIGIN_KEYmetadata here.I believe this wrapper is only needed for
finetune.pyanddpo_cache_tune.py, and not needed forreward_modeling.pyorreward_modeling_eval.py, but have only testedfinetune.pywith these fixed e2e.