[feat] Gather experience samples #305

maxreciprocate · 2023-02-10T23:20:39Z

This PR lets PPO trainer to gather all experience samples on the main rank and to do a single joint reward_fn call per each rollout.

This enables hosting a single reward model on the same machine as the main rank, when previously every process had to had access to the reward model. Also it enables deliberate micro-batching for the reward model, unlike in the case when each process tries to infer reward model (for example deployed on Triton server) with its own small chunk_size number of samples usually bottle-necking whole training.

Main goal here is to give up the current dependency on the Triton server and to enable simple and self-contained 7+1(RM) or 15+1(RM) setups (can finally move to cw)

https://wandb.ai/sorry/trlx/reports/Gather-experience-samples-305--VmlldzozNTQ0OTkz
https://wandb.ai/sorry/trlx/reports/Gather-experience-samples-305---VmlldzozNTMxMzc3

Side-note: every reference remains the same except for sentiments, after some debugging I've noticed that even doing multiple redundant passes of sentiment pipeline on the same data apparently changes rng or otherwise slightly influences the run, as in doing:

scores = torch.tensor(
    self.reward_fn(
        samples=str_samples,
        prompts=str_prompts,
        outputs=str_outputs,
    ),
    dtype=torch.float,
).to(device)

scores = torch.tensor(
    self.reward_fn(
        samples=str_samples,
        prompts=str_prompts,
        outputs=str_outputs,
    ),
    dtype=torch.float,
).to(device)

scores = torch.tensor(
    self.reward_fn(
        samples=str_samples,
        prompts=str_prompts,
        outputs=str_outputs,
    ),
    dtype=torch.float,
).to(device)

cat-state · 2023-02-10T23:58:30Z

re the sidenote: Maybe we need to call .eval() somewhere, if its using dropout? Or it could be nondeterminism from the kernels.

maxreciprocate · 2023-02-11T00:28:31Z

Oh I've forgot to mention that scores from the sentiment pipeline remain the same, as doing so passes:

scores = torch.tensor(
    self.reward_fn(
        samples=str_samples,
        prompts=str_prompts,
        outputs=str_outputs,
    ),
    dtype=torch.float,
).to(device)

scores2 = torch.tensor(
    self.reward_fn(
        samples=str_samples,
        prompts=str_prompts,
        outputs=str_outputs,
    ),
    dtype=torch.float,
).to(device)

assert torch.all(scores == scores2)

However the number of calls to reward_fn is divided by num_processes as compared to the reference, so I suspect that's reason behind the difference

jon-tow

Awesome! Glad to see you push us to the CW side hehe Left one comment if you could address before merging 🙏

jon-tow · 2023-02-11T00:22:35Z

trlx/trainer/accelerate_ppo_trainer.py

+
+            scores = torch.empty(len(samples), device=device)
+            torch.distributed.scatter(scores, all_scores)
+
            str_samples, str_prompts, str_outputs = self.decode(prompt_tensors, samples)


Looks like we're decoding tokens twice now; which seems to not slow down anything from the system plots? (if anything it's probably from the gather overhead). ~~Does decode mutate the inputs in any way such that the line below will be different from a single call to decode?~~ Does not seem to

trlx/trlx/trainer/accelerate_ppo_trainer.py

Lines 325 to 329 in 2a45c08

str_samples, str_prompts, str_outputs = self.decode(prompt_tensors, samples)

# Pad the sample outputs

outputs = self.tokenizer(str_outputs).input_ids

outputs = list(map(torch.LongTensor, outputs))

To avoid repetition I could scatter samples as well, but I don't think it's worth it since the second decode (which would be second only on the main rank) is needed just to strip stop_sequences, so it's basically just an allocation. Runtime in make_experience is still dominated by generate and reward_fn

Yeah, that's fine! (I actually meant to approve this last night but forgot to 😅)

jon-tow · 2023-02-11T03:06:51Z

re the sidenote: Maybe we need to call .eval() somewhere, if its using dropout? Or it could be nondeterminism from the kernels.

The sentiments pipeline is in eval-mode so at least it's not any stochasticity from dropout in that distilbert RM (sentiment_fn.model.training == False) 🤔

Also, the returns/values stats sort of explode for sentiments on main - it's interesting that the repeat call smooths things out.

LouisCastricato · 2023-02-11T16:14:02Z

This is great! Happy that we're finally merging something like this.

jon-tow

Thanks Max!

maxreciprocate · 2023-02-11T23:34:12Z

@jon-tow Even if the pipeline is in eval-mode, rng state is still accessed somewhere from within it, a simple test:

torch.manual_seed(1000)
print(f'{torch.rand(1)=}')
print(f'{torch.rand(1)=}')
print(f'{torch.rand(1)=}')
>>> torch.rand(1)=tensor([0.3189])
>>> torch.rand(1)=tensor([0.6136])
>>> torch.rand(1)=tensor([0.4418])

torch.manual_seed(1000)
print(f'{torch.rand(1)=}')
print(f'{torch.rand(1)=}')
reward_fn(['1'])
print(f'{torch.rand(1)=}')
>>> torch.rand(1)=tensor([0.3189])
>>> torch.rand(1)=tensor([0.6136])
>>> torch.rand(1)=tensor([0.2724])

Since none of ranks except the main now access reward_fn, a difference surfaced. reward_fn in this case being:

def get_positive_score(scores):
    "Extract value associated with a positive sentiment from pipeline's output"
    return dict(map(lambda x: tuple(x.values()), scores))["POSITIVE"]

sentiment_fn = pipeline(
    "sentiment-analysis",
    top_k=2,
    truncation=True,
    batch_size=256,
    device=device,
)

def reward_fn(samples: List[str], **kwargs) -> List[float]:
    sentiments = list(map(get_positive_score, sentiment_fn(samples)))
    return sentiments

https://wandb.ai/sorry/trlx/reports/Gather-experience-samples-305--VmlldzozNTM1OTUz

Here's two additional reports: a single process run, and a run with changed reward_fn from the pipeline to something deterministic like:

def reward_fn(samples, **kwargs):
    return [len(s) for s in samples]

jon-tow · 2023-02-12T00:28:02Z

@reciprocated yeah that's weird. If you call the underlying distilbert with some arbitrary inputs the RNG doesn't change https://colab.research.google.com/drive/1FCmeWEJGl5GAhikeUXXR5VrdHOt4S6WN?usp=sharing
pipeline is super opaque so I can't tell where it's happening 🤔 This PR should be fine especially since the t5 summarize reference is good.

LouisCastricato · 2023-02-12T00:36:28Z

We should ping hugging face folks

maxreciprocate · 2023-02-13T11:49:00Z

Final double check:
https://wandb.ai/sorry/trlx/reports/Gather-experience-samples-305--VmlldzozNTQ0OTkz

feat(ppo_trainer): gather exp samples for a single reward_fn call

2a45c08

maxreciprocate requested review from jon-tow and cat-state February 10, 2023 23:21

jon-tow requested changes Feb 11, 2023

View reviewed changes

jon-tow approved these changes Feb 11, 2023

View reviewed changes

fix(ppo_trainer): hide torch.distributed calls for single process

038d43c

maxreciprocate merged commit 724b618 into main Feb 13, 2023

maxreciprocate deleted the gather-exp-samples branch February 13, 2023 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Gather experience samples #305

[feat] Gather experience samples #305

maxreciprocate commented Feb 10, 2023 •

edited

Loading

cat-state commented Feb 10, 2023 •

edited

Loading

maxreciprocate commented Feb 11, 2023

jon-tow left a comment

jon-tow Feb 11, 2023 •

edited

Loading

maxreciprocate Feb 11, 2023 •

edited

Loading

jon-tow Feb 11, 2023

jon-tow commented Feb 11, 2023

LouisCastricato commented Feb 11, 2023

jon-tow left a comment

maxreciprocate commented Feb 11, 2023 •

edited

Loading

jon-tow commented Feb 12, 2023

LouisCastricato commented Feb 12, 2023

maxreciprocate commented Feb 13, 2023

	str_samples, str_prompts, str_outputs = self.decode(prompt_tensors, samples)

	# Pad the sample outputs
	outputs = self.tokenizer(str_outputs).input_ids
	outputs = list(map(torch.LongTensor, outputs))

[feat] Gather experience samples #305

[feat] Gather experience samples #305

Conversation

maxreciprocate commented Feb 10, 2023 • edited Loading

cat-state commented Feb 10, 2023 • edited Loading

maxreciprocate commented Feb 11, 2023

jon-tow left a comment

Choose a reason for hiding this comment

jon-tow Feb 11, 2023 • edited Loading

Choose a reason for hiding this comment

maxreciprocate Feb 11, 2023 • edited Loading

Choose a reason for hiding this comment

jon-tow Feb 11, 2023

Choose a reason for hiding this comment

jon-tow commented Feb 11, 2023

LouisCastricato commented Feb 11, 2023

jon-tow left a comment

Choose a reason for hiding this comment

maxreciprocate commented Feb 11, 2023 • edited Loading

jon-tow commented Feb 12, 2023

LouisCastricato commented Feb 12, 2023

maxreciprocate commented Feb 13, 2023

maxreciprocate commented Feb 10, 2023 •

edited

Loading

cat-state commented Feb 10, 2023 •

edited

Loading

jon-tow Feb 11, 2023 •

edited

Loading

maxreciprocate Feb 11, 2023 •

edited

Loading

maxreciprocate commented Feb 11, 2023 •

edited

Loading