-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Fix JSON Serialization Error in TrainerState due to np.float32 #3250 #3251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix JSON Serialization Error in TrainerState due to np.float32 #3250 #3251
Conversation
Fix float32 serialization in TrainerState & improve KD training script
|
Hi @tomaarsen – just following up on this PR that fixes a JSON serialization issue in The PR has workflows awaiting approval. I'd really appreciate your review or any feedback to help move this forward. Let me know if any changes are needed. I’m happy to revise the PR if there’s a better approach. Thanks for your time! |
…rmers in model_distillation.py"
|
Hi @tomaarsen – just a quick update on this PR: I’ve fixed the missing imports that caused the earlier test failures. All checks should now pass once the workflow is approved and re-run. Would appreciate your time to approve the workflow when you get a chance! Let me know if any further changes are needed. Thanks again |
|
Hello! Thank you for opening this, and apologies for my radio silence so far. PRs like these are always easiest to test when my hardware is freed up to run some of the related training scripts, but I've been running models for #3222 all day lately. When that PR is ready, I'll definitely switch focus to the new open PRs before I release v4.0
|
|
Hi Tom, Thanks again! |
|
I ran the Could you perhaps check if you can run the script with the
|
|
Hey Tom, thanks for checking this! I’m currently re-running model_distillation.py on the latest master and will confirm shortly if the issue persists. Quick question though — the error I saw was from TrainerState.log_history due to np.float32 metrics failing JSON serialization. Since #3096 focuses on evaluator outputs, just wanted to double-check: could this fix also have impacted TrainerState.log_history? I’ll also share my transformers version in case it’s relevant. Thanks again!! |
Yes, it seems like the evaluator outputs also affect log_history. I looked into the {
"best_metric": null,
"best_model_checkpoint": null,
"epoch": 1.3888888888888888,
"eval_steps": 100,
"global_step": 500,
"is_hyper_param_search": false,
"is_local_process_zero": true,
"is_world_process_zero": true,
"log_history": [
{
"epoch": 0.2777777777777778,
"eval_runtime": 1.2965,
"eval_samples_per_second": 0.0,
"eval_steps_per_second": 0.0,
"eval_sts-dev_pearson_cosine": 0.7796666906730896,
"eval_sts-dev_spearman_cosine": 0.7730397171620476,
"step": 100
},
{
"epoch": 0.5555555555555556,
"eval_runtime": 1.2716,
"eval_samples_per_second": 0.0,
"eval_steps_per_second": 0.0,
"eval_sts-dev_pearson_cosine": 0.8386008826071591,
"eval_sts-dev_spearman_cosine": 0.8350261474348337,
"step": 200
},
{
"epoch": 0.8333333333333334,
"eval_runtime": 1.2818,
"eval_samples_per_second": 0.0,
"eval_steps_per_second": 0.0,
"eval_sts-dev_pearson_cosine": 0.8555351042472681,
"eval_sts-dev_spearman_cosine": 0.8517600639199182,
"step": 300
},
{
"epoch": 1.1111111111111112,
"eval_runtime": 1.2651,
"eval_samples_per_second": 0.0,
"eval_steps_per_second": 0.0,
"eval_sts-dev_pearson_cosine": 0.8610353558300488,
"eval_sts-dev_spearman_cosine": 0.8579782016575516,
"step": 400
},
{
"epoch": 1.3888888888888888,
"grad_norm": 0.6396923661231995,
"learning_rate": 1.4489164086687308e-05,
"loss": 0.0442,
"step": 500
},
{
"epoch": 1.3888888888888888,
"eval_runtime": 1.2579,
"eval_samples_per_second": 0.0,
"eval_steps_per_second": 0.0,
"eval_sts-dev_pearson_cosine": 0.862790555751594,
"eval_sts-dev_spearman_cosine": 0.8611115838346327,
"step": 500
}
],
"logging_steps": 500,
"max_steps": 1440,
"num_input_tokens_seen": 0,
"num_train_epochs": 4,
"save_steps": 100,
"stateful_callbacks": {
"TrainerControl": {
"args": {
"should_epoch_stop": false,
"should_evaluate": false,
"should_log": false,
"should_save": true,
"should_training_stop": false
},
"attributes": {}
}
},
"total_flos": 0.0,
"train_batch_size": 16,
"trial_name": null,
"trial_params": null
}So, it looks like the evaluation results (e.g.
|
|
Hey @tomaarsen , Apologies for the delay, Thanks for the detailed insight on the trainer_state.json and log_history. I dug into it a bit more, and it turns out the root of the issue is with how the evaluator results are handled in Sentence Transformers. Specifically, in sentence_transformers/training/losses/MultipleNegativesRankingLoss.py, the call to trainer.evaluate(eval_dataset=evaluator) returns a dict of results, but unlike the Trainer.evaluate() call in Hugging Face transformers (which internally logs to log_history), Sentence Transformers doesn’t seem to manage logging for these evaluator results in the same way. As a result, when the trainer.log(output) line runs, it appends the full dict—including all evaluator metrics—directly into log_history. This causes the json.dump to eventually crash due to type serialization issues (e.g., if it encounters a tensor or other non-serializable type). So essentially, this evaluator output is getting appended without being properly sanitized or logged via the standard Trainer mechanisms, leading to the JSON dumping error. That’s why we see extra entries in log_history with all the eval_sts-* metrics from the evaluator. Let me know what you think |
|
I'm a bit confused, I'm also on transformers v4.49.0 - I haven't had issues yet. Do you still get errors if you run
|
|
To clarify, I did run model_distillation.py on the latest master branch of Sentence Transformers using transformers v4.49.0, and I still hit the JSON serialization error. Here’s the relevant part of the traceback: TypeError: Object of type float32 is not JSON serializable Full traceback (trimmed for brevity): From what I can tell, the evaluator output still seems to be getting appended to log_history with non-serializable types (like float32), which triggers this crash during the JSON save. Let me know if you'd like any more details from my end! |
|
Hmm, that's tricky. I'm still not able to reproduce it. I just ran the script and it uploaded https://huggingface.co/tomaarsen/TinyBERT_L-4_H-312_v2-distilled-from-stsb-roberta-base-v2-new with 3.5.0.dev0 (i.e.
|
|
thanks again for double-checking and running the script, it’s super helpful to know that it worked on your end and that you even uploaded the result, really appreciate that! I was thinking if this could be an environment-specific edge case. I’m running this on Python 3.12.1 (inside a VSCode Dev Container workspace) and I’ve seen that json.dumps can sometimes be stricter about types in newer Python versions. Do you think it’s possible that some evaluator outputs might still return float32 under certain conditions (maybe based on dataset or hardware)? I’m happy to try and dig into this further or test with an older Python version if you think it might help narrow it down. |
|
I was able to reproduce it in Python 3.12, but only once for some reason. I think I was able to find the underlying issue, though. I'll open a PR in a bit.
|
|
Hey @tomaarsen really appreciate you taking the time to look into this and track it down! Makes sense now, especially with Python 3.12 being a bit stricter on JSON serialization. The fix in the PR looks clean and definitely more robust than my initial idea of patching save_to_json directly. Thanks again for following up and opening the PR |
|
Thanks for the update; I've merged the alternative fix that should hopefully work on all example scripts instead of just the distillation one. I'm closing this one under the expectation that your issue is now solved, but please do let me know if the issue persists even with the
|
TrainerState.save_to_json fails due to np.float32 values not being JSON serializable, causing crashes when saving training state.
Problem:
The TrainerState.save_to_json method in the transformers library fails when trying to save training state because np.float32 values are not natively serializable in JSON. This issue causes the training process to crash when saving checkpoints.
Steps to Reproduce:
Run the model distillation script using SentenceTransformerTrainer.
The training proceeds normally but fails when saving state due to an np.float32 serialization error.
The error message typically looks like:
TypeError: Object of type float32 is not JSON serializable
Expected Behavior:
The training state should save without errors.
The script should complete training and store checkpoints correctly.
Proposed Fix:
Convert np.float32 values to Python float before saving JSON.
Patch TrainerState.save_to_json to handle this conversion.