Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prompt tokenization bugfix #4197

Merged
merged 5 commits into from
May 19, 2022
Merged

Prompt tokenization bugfix #4197

merged 5 commits into from
May 19, 2022

Conversation

vadam5
Copy link
Contributor

@vadam5 vadam5 commented May 18, 2022

What does this PR do ?

  1. adds <> around virtual prompt token string
  2. Removes pseudo token configurability from the config file in favor of using a consistent pseudo token set by us
  3. Makes PROMPT lowercase

Collection: BigNLP

Changelog

Updated megatron_gpt_prompt_learning_model
Updated prompt learning dataset unit tests
Added Enum for virtual prompt token string

Usage

Same as before

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
  • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Virginia Adams <[email protected]>
@vadam5 vadam5 requested a review from yidong72 May 18, 2022 20:51
Copy link
Collaborator

@yidong72 yidong72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a tiny comment. Good job using the enum type.

@@ -71,8 +71,8 @@ def get_task_templates():
"task_id_num": 0,
}
task_templates['task name B'] = {
"prompt_template": "<|VIRTUAL_PROMPT_0|>{question}<|VIRTUAL_PROMPT_1|>{answer}{extra}",
"prompt_template_fields": ['question', 'answer'],
"prompt_template": "<|VIRTUAL_PROMPT_0|> {question} <|VIRTUAL_PROMPT_1|> {answer}{extra}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why you add the space around the virtual tokens? If space is needed, virtual prompt should learn to be a space embedding?

Copy link
Contributor Author

@vadam5 vadam5 May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was playing around with the unit tests to help diagnose a bug with the huggingface tokenizer. This change isn't needed, I just forgot to put it back. I removed the spaces so they should be back the way they were.

]

return pseudo_tokens

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you have the similar logics in the prompt_lenaring_model file. Put the function there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that

Signed-off-by: Virginia Adams <[email protected]>
Copy link
Collaborator

@yidong72 yidong72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vadam5 vadam5 merged commit 7943fb5 into main May 19, 2022
@vadam5 vadam5 deleted the prompt_tokenization_bugfix branch May 19, 2022 19:51
yaoyu-33 pushed a commit that referenced this pull request May 24, 2022
* Updated default virtual token placeholder

Signed-off-by: Virginia Adams <[email protected]>

* Python style fix

Signed-off-by: Virginia Adams <[email protected]>

* Addressed reviewer comments

Signed-off-by: Virginia Adams <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants