Update ILQL details #156

maxreciprocate · 2023-01-02T17:53:00Z

This pr

adds tensor statistics to ILQL training
changes offline training api by removing split_token and instead delegating preprocessing per example
confines optional crossentropy loss to logits from completions

https://api.wandb.ai/report/sorry/t4w326pt
https://api.wandb.ai/report/sorry/93xza5j0

The difference in randomwalks is due to removing <bos> from samples and specifying starting nodes directly instead

as it already is in the ppo_randomwalks

LouisCastricato · 2023-01-06T14:01:14Z

examples/ilql_sentiments.py

@@ -36,7 +36,6 @@ def metric_fn(samples: List[str]) -> Dict[str, List[float]]:
    imdb = load_dataset("imdb", split="train+test")

    trlx.train(
-        "gpt2",


Why are we changing this on the example?

I like this change. Previously, if your ILQL config set the model_path option it would be overridden by this "gpt2" arg. It's led to unexpected behavior on my end whereby editing the example ILQL config's model path doesn't actually update the model.

Yes, it's for consistency with other examples

LouisCastricato · 2023-01-06T17:17:51Z

trlx/orchestrator/offline_orchestrator.py

 import torch

 from trlx.orchestrator import Orchestrator, register_orchestrator
 from trlx.pipeline.offline_pipeline import ILQLRolloutStorage


+def tokenize_dialogue(


Why are we adding dialogue related functionality to the core api?

I think having this func in trlx makes sense, it makes the completion split based preproc for ilql a bit clearer although we should probably pair it with a multi-turn example to show it working

That makes sense.

Perhaps including it is fine but I think it should be optional (and not the default)

LouisCastricato · 2023-01-06T17:18:29Z

Where exactly does this fix the memory issue?

trlx/orchestrator/offline_orchestrator.py

maxreciprocate · 2023-01-09T12:39:09Z

Where exactly does this fix the memory issue?

Implicitly it fixes the issue by removing split_token which was easy to abuse and turn the whole sample into an output over which q/v losses have to be computed. Also cross-entropy is now computed only on a few output logits (as per Charlie's comment) instead of the whole sample. From results of both I can now train on @AlekseyKorshuk's data referenced in the issue with the same gpt2-xl no layers frozen, mixed precision, batch size 8, context length 1024, on a single A100 without OOMs.

https://wandb.ai/sorry/public/runs/17v1c8st

import yaml
import trlx
from trlx.data.configs import TRLConfig
from datasets import load_dataset

default_config = yaml.safe_load(open("configs/ilql_dalio.yml"))

def main(hparams={}):
    config = TRLConfig.update(default_config, hparams)

    val_split = 16
    dataset = load_dataset("ChaiML/dalio_scored_responses_v1")["train"]
    valid_ds = dataset[:val_split]
    train_ds = dataset[val_split:]

    eval_prompts = valid_ds["input_text"]

    dataset = (
        list(zip(
            train_ds["input_text"],
            train_ds["output_text"],
        )),
        train_ds["score"]
    )

    trlx.train(
        dataset=dataset,
        config=config,
        eval_prompts=eval_prompts,
    )


if __name__ == "__main__":
    main()

Dahoas · 2023-01-09T18:42:07Z

examples/ilql_sentiments.py

@@ -36,7 +36,6 @@ def metric_fn(samples: List[str]) -> Dict[str, List[float]]:
    imdb = load_dataset("imdb", split="train+test")

    trlx.train(
-        "gpt2",


Dahoas · 2023-01-09T18:57:35Z

trlx/trainer/nn/ilql_models.py

@@ -48,18 +51,20 @@ def heads(self, hidden_size: int, vocab_size: int):

    def loss(self, outputs, labels: ILQLBatch):
        logits, (qs, target_qs, vs) = outputs
+        terminal_mask = labels.dones[:, :-1]


Can you remind why we need this?

It's the same as attention_mask except only accounting states_ixs (subset of tokens from which continuations were sampled), plus the last token for which is also masked. It lets V(terminal) = 0 in V[:, 1:] * terminal_mask[:, 1:] and also masks padding in other losses

Dahoas · 2023-01-09T19:01:37Z

trlx/orchestrator/offline_orchestrator.py

+            for phrase in sample:
+                if isoutput:
+                    actions_ixs.append(
+                        torch.arange(length - 1, length + len(phrase) - 1)


Why do states and actions append exactly the same thing?

It can be refactored, the difference between the two is few lines below

Dahoas · 2023-01-09T19:03:28Z

trlx/orchestrator/offline_orchestrator.py

+            all_input_ids.append(torch.tensor(sum(sample, [])))
+            isoutput = False
+            actions_ixs, states_ixs = [], []
+            for phrase in sample:


So a sample can contain multiple "phrases" which we can reward?

Is there a reason we can't just separate each phrase into it's own datapoint with the "output" at the end(with all the prior dialog as context)? Do you see some advantage to treating things this way?

Yes, both are supported and opt-in

Dahoas · 2023-01-09T19:12:23Z

trlx/orchestrator/offline_orchestrator.py

+    dialogue: Union[str, List[str]], tokenizer, max_length=2048, truncation_side="left"
+) -> List[int]:
+    """
+    Tokenize sample with the interleaved form of (question_1, answer_1, question_2, answer_2...)


Do we intend for answer_2 to depend on question_1, answer_1?

Yes, ILQL was designed to do this

Dahoas · 2023-01-09T19:13:56Z

trlx/orchestrator/offline_orchestrator.py

+                break
+
+        # in case of odd number of phrases (possibly due to truncation)
+        # since the first phrase always has to be a prompt, force it to be <bos>


Shouldn't the last phrase always be a prompt? Why must first always be prompt?

Counting from the left the last one should be action, otherwise there would be nothing to prompt and it would be redundant since the trajectory (more specifically actions) might as well be graded without it. At least datasets we are interested in, to my knowledge, are of this form

Dahoas · 2023-01-09T19:15:25Z

trlx/orchestrator/offline_orchestrator.py

+                out[0].pop(0)
+            out.insert(0, [tokenizer.bos_token_id])
+
+    elif truncation_side == "right":


When would we want this? If we truncate on the right the agent is missing the most recent dialog

This function accepts both dialogues and monologues and it's in anticipation of

Add tokenizer truncation side option #177

Dahoas · 2023-01-09T19:17:59Z

Left some comments. In general I'm kinda skeptical about the need to support multiple "actions" (responses) in one trajectory. I don't see why we can't handle this just by splitting a dialog with 10 interactions into 10 samples, each of only has one action (at the very end of the dialog). Perhaps including it is fine but I think it should be optional

(Sorry if I misunderstood something)

maxreciprocate · 2023-01-10T11:12:09Z

No Alex, this is a great question. The difference would be when you don't have intermediate rewards for each cumulative interaction (q0, a0, r0), (q0, a0, q1, a1, r1) etc, but instead a single rating for the whole trajectory. Therefore it sense makes to credit assign the return to those preceding interactions as well and not necessarily only to the latest one, given that all interactions were in fact judged as a whole. And it's already totally optional depending how you pre-split your data yourself either way could work, it's to be experimentally determined which one is better

maxreciprocate added 8 commits December 22, 2022 20:38

feat(offline_orchestrator): delegate spliting & enable interleaving

6e93ec0

feat(ilql_models): weight and tie awac to actions & add tensor stats

43ba1d5

feat(ilql_randomwalks): change starting state from bos to first node

1bd0be5

as it already is in the ppo_randomwalks

feat(ilql_sentiments): keep first sentences only

c58986b

fix(offline_orchestrator): force returns dtype

f626d82

Merge branch 'main' into update-ilql

7cd009f

chore(configs): plug with beta

267c7f8

refactor(*): satisfy isort

4d30d31

jon-tow added this to the v0.4.0 milestone Jan 2, 2023

maxreciprocate added 5 commits January 3, 2023 22:26

fix(ilql_models): proper masking

e573e7a

feat(offline_orchestrator): add variable truncation

12e6391

chore(examples): revert to old behaviour

18dee47

Merge branch 'main' into update-ilql

bd48535

refactor(*): style check

c6a2637

maxreciprocate requested a review from cat-state January 5, 2023 16:14

maxreciprocate marked this pull request as ready for review January 5, 2023 16:14

maxreciprocate mentioned this pull request Jan 5, 2023

CUDA OOM with large prompt length #127

Closed

LouisCastricato reviewed Jan 6, 2023

View reviewed changes

cat-state reviewed Jan 6, 2023

View reviewed changes

trlx/orchestrator/offline_orchestrator.py Outdated Show resolved Hide resolved

maxreciprocate added 4 commits January 9, 2023 12:37

feat(offline_orchestrator): convert samples from tuples to lists

a017168

refactor(offline_orchestrator): printing of dataset statistics

25abc34

refactor(offline_orchestrator): simplify if statement

aa9b723

chore(trlx): document dataset argument

1159120

Dahoas reviewed Jan 9, 2023

View reviewed changes

maxreciprocate added 2 commits January 10, 2023 21:20

refactor(offline_orchestrator): remove duplication

475497c

chore(trlx): update docs

090d459

Dahoas approved these changes Jan 11, 2023

View reviewed changes

Dahoas merged commit 0cb8438 into main Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ILQL details #156

Update ILQL details #156

maxreciprocate commented Jan 2, 2023 •

edited

Loading

LouisCastricato Jan 6, 2023

jon-tow Jan 7, 2023

maxreciprocate Jan 9, 2023

Dahoas Jan 9, 2023

LouisCastricato Jan 6, 2023

cat-state Jan 6, 2023

LouisCastricato Jan 6, 2023

Dahoas Jan 9, 2023

LouisCastricato commented Jan 6, 2023

maxreciprocate commented Jan 9, 2023 •

edited

Loading

Dahoas Jan 9, 2023

Dahoas Jan 9, 2023

maxreciprocate Jan 10, 2023

Dahoas Jan 9, 2023

maxreciprocate Jan 10, 2023

Dahoas Jan 9, 2023

maxreciprocate Jan 10, 2023

Dahoas Jan 9, 2023

maxreciprocate Jan 10, 2023

Dahoas Jan 9, 2023

maxreciprocate Jan 10, 2023

Dahoas Jan 9, 2023

maxreciprocate Jan 10, 2023

Dahoas commented Jan 9, 2023 •

edited

Loading

maxreciprocate commented Jan 10, 2023

Update ILQL details #156

Update ILQL details #156

Conversation

maxreciprocate commented Jan 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LouisCastricato commented Jan 6, 2023

maxreciprocate commented Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dahoas commented Jan 9, 2023 • edited Loading

maxreciprocate commented Jan 10, 2023

maxreciprocate commented Jan 2, 2023 •

edited

Loading

maxreciprocate commented Jan 9, 2023 •

edited

Loading

Dahoas commented Jan 9, 2023 •

edited

Loading