Skip to content

Conversation

@garrett361
Copy link
Contributor

In #765 additional string metadata was added to the dataset which conflicts with the application of the DataCollatorForSeq2Seq collator function, which expects only tensor data. This makes finetune.py fail (when --packing False) with errors like:

Traceback (most recent call last):
  File "/proj/data-eng/swanand/sft_dpo/venv_sft_dpo/venv/venv_sft_dpo/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 767, in convert_to_tensors
    tensor = as_tensor(value)
  File "/proj/data-eng/swanand/sft_dpo/venv_sft_dpo/venv/venv_sft_dpo/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 729, in as_tensor
    return torch.tensor(value)
ValueError: too many dimensions 'str'

This error arises when trying to create a tensor from a list of strings, e.g. torch.tensor(["hello"]).

This PR adds a utility for filtering non-tensor columns out of the dataset before using and uses the filter in both the sft and dpo scripts.

CC @hamishivi @jacob-morrison

The issue specifically is the addition of the DATASET_ORIGIN_KEY metadata here.

I believe this wrapper is only needed for finetune.py and dpo_cache_tune.py, and not needed for reward_modeling.py or reward_modeling_eval.py, but have only tested finetune.py with these fixed e2e.

@hamishivi
Copy link
Collaborator

Thanks for noticing and the PR, I believe this is fixed by #825 (which I just merged).

@hamishivi hamishivi closed this Jul 28, 2025
@garrett361
Copy link
Contributor Author

Yes, hadn't seen that one! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants