[fix] Make `gather_for_metrics` usage more strict #315

maxreciprocate · 2023-02-17T21:17:26Z

This PR makes accelerate.gather_for_metrics happen for each batch. Additionally elements of the batch are now padded to batch elements' max_length instead of global seq_length. And they are collected as lists, to avoid one big allocation (possibly bigger than RAM), since they don't have to be tensors afterwards.

https://wandb.ai/sorry/trlx/reports/Make-gather_for_metrics-more-strict-315--VmlldzozNTkyMTUy

jon-tow

Nice! I left one comment if you could address it when free.

jon-tow · 2023-02-18T00:25:35Z

trlx/utils/modeling.py

@@ -280,6 +281,24 @@ def update(self, xs: torch.Tensor) -> Tuple[float, float]:
        return xs_mean, (xs_var * xs_count / (xs_count - 1)).sqrt()


+def gather_for_metrics(tensor, expected_number, batch_size, length):


This seems to work in our case because the tensor has expected_number == dataset_size but if you loop through the eval dataloader and collect metrics with this, the last batch might contain duplicate entries from exhausted ranks when world size, dataset size, and batch size aren't aligned. Maybe we should leave a warning comment so that in the future we're aware to manually truncate them in such loops? One might expect it to have the same behavior as accelerate's function here.

What you describe is exactly what this PR amends (or at least intends to 🥲)

Let's use very unaligned hyperparameters batch_size=12, world_size=3, len(eval_prompts)=109 with which running on main
accelerate launch --num_processes 3 --config_file configs/accelerate/zero2-bf16.yaml examples/ppo_sentiments.py
would give only 1 eval sample – https://wandb.ai/sorry/trlx/runs/vh3ayv6b
(to reproduce you can pull main...1-eval-sample or go to main and set those manually)

But with the new gather_for_metrics it's back to the expected 109 – https://wandb.ai/sorry/trlx/runs/c6agq89q/
let's also check whether prompts weren't perturbed (because it's ambiguous in this example) by setting:

eval_prompts = list(map(str, range(109))),

https://wandb.ai/sorry/trlx/runs/c6agq89q

(On a similar note: can one add tests that rely on multiprocessing but would still be run on CI?)

Oh yeah, this definitely fixes our specific eval code but if you were to use this to collect metrics in a dataloader loop (assuming, for example, when one big tensor doesn't fit in RAM or wherever) it might still duplicate entries. Repro the behavior with this gist https://gist.github.com/jon-tow/304efadc9fce470d7b4f7212d5cfcf18 (sorry I couldn't figure out how to simulate multi-process on google colab lol) I could be misunderstanding things so let me know :)

Re CI with tests on multi-process functions: I'm not sure 😅 We could always write the test and have it around to at least run locally, then skip the test case in CI. I'll get back to you on that!

That's a great code snippet Jon, thanks! I see now, you are correct, accelerate.gather_for_metrics and gather_for_metrics are two slightly different functions with different usages. To not introduce any additional code, it's possible to refactor existing to resolve the issue instead: accelerate.gather_for_metrics now happens for each batch. Additionally elements of the batch are now padded to batch elements' max_length instead of global seq_length. And they are collected as lists, to avoid one big allocation (possibly bigger than RAM), since they don't have to be tensors afterwards.

109 example: https://wandb.ai/sorry/trlx/runs/t8nc849j
comparision with main: https://wandb.ai/sorry/trlx/reports/Make-gather_for_metrics-more-strict-315--VmlldzozNTkyMTUy

Re CI with tests on multi-process functions: I guess there is no rush for it as of now 😅

Oh, that's a very clean refactor! For some reason, this breaks seq2seq evaluation with a RuntimeError from non-contiguous tensors in the underlying gather call 🤔 Can you reproduce it on your end? Full traceback and config run here https://gist.github.com/jon-tow/8154845bd05cea3946e35b4a7f89a88c

I think once this is cleared up we'll be good 🤞

Thanks for spotting it in my stead! The fix is easy enough https://wandb.ai/sorry/trlx/reports/Make-gather_for_metrics-usage-more-strict-315--VmlldzozNTk2ODEz

jon-tow

Awesome! The gather-hanging issue I brought up at the meeting seems related to num_rollouts config as you mentioned, so ignore :)

fix(base_trainer): make gather_for_metrics more strict

9116a9f

maxreciprocate requested a review from jon-tow February 17, 2023 21:18

jon-tow reviewed Feb 18, 2023

View reviewed changes

maxreciprocate added 2 commits February 19, 2023 17:35

fix(base_trainer): gather_for_metrics each batch, tolist results

1cbb874

fix(base_trainer): make contiguous samples for seq2seq

ff1dc8f

maxreciprocate changed the title ~~[fix] Make gather_for_metrics more strict~~ [fix] Make gather_for_metrics usage more strict Feb 20, 2023

jon-tow approved these changes Feb 20, 2023

View reviewed changes

jon-tow merged commit 3396bf1 into main Feb 20, 2023

Jiaxin-Wen mentioned this pull request Feb 23, 2023

PPO training stucks #319

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Make `gather_for_metrics` usage more strict #315

[fix] Make `gather_for_metrics` usage more strict #315

maxreciprocate commented Feb 17, 2023 •

edited

Loading

jon-tow left a comment

jon-tow Feb 18, 2023

maxreciprocate Feb 18, 2023

jon-tow Feb 19, 2023 •

edited

Loading

maxreciprocate Feb 19, 2023

jon-tow Feb 20, 2023

maxreciprocate Feb 20, 2023

jon-tow left a comment

		@@ -280,6 +281,24 @@ def update(self, xs: torch.Tensor) -> Tuple[float, float]:
		return xs_mean, (xs_var * xs_count / (xs_count - 1)).sqrt()


		def gather_for_metrics(tensor, expected_number, batch_size, length):

[fix] Make gather_for_metrics usage more strict #315

[fix] Make gather_for_metrics usage more strict #315

Conversation

maxreciprocate commented Feb 17, 2023 • edited Loading

jon-tow left a comment

Choose a reason for hiding this comment

jon-tow Feb 18, 2023

Choose a reason for hiding this comment

maxreciprocate Feb 18, 2023

Choose a reason for hiding this comment

jon-tow Feb 19, 2023 • edited Loading

Choose a reason for hiding this comment

maxreciprocate Feb 19, 2023

Choose a reason for hiding this comment

jon-tow Feb 20, 2023

Choose a reason for hiding this comment

maxreciprocate Feb 20, 2023

Choose a reason for hiding this comment

jon-tow left a comment

Choose a reason for hiding this comment

[fix] Make `gather_for_metrics` usage more strict #315

[fix] Make `gather_for_metrics` usage more strict #315

maxreciprocate commented Feb 17, 2023 •

edited

Loading

jon-tow Feb 19, 2023 •

edited

Loading