Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In ShareGPT, why the conversation from human is accumulated? #11

Open
qmpham opened this issue Apr 5, 2023 · 5 comments
Open

In ShareGPT, why the conversation from human is accumulated? #11

qmpham opened this issue Apr 5, 2023 · 5 comments

Comments

@qmpham
Copy link

qmpham commented Apr 5, 2023

image
Line 329 data_loading.py

@chiayewken
Copy link
Collaborator

Hi, to train the model to generate GPT-like responses, we set the target sequence as the GPT response and input/source sequence as the previous dialog history.

@qmpham
Copy link
Author

qmpham commented Apr 6, 2023

But LLAMA has input's max_len of only 2048 tokens

@chiayewken
Copy link
Collaborator

This can be handled by the data loader/tokenizer. For example, we truncate the input on the left side if it exceeds the max length:

def __getitem__(self, i: int) -> dict:

@qmpham
Copy link
Author

qmpham commented Apr 6, 2023

yes, I understand. but why interested in having long input while the model's capacity is only 2048. You might risk of truncating the question to which the target is addressed

@chiayewken
Copy link
Collaborator

That's true, the dialog commonly exceeds the maximum sequence length while training. However, we can mitigate this by truncating inputs on the left side, so that the most recent dialog history on the right is preserved:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants