Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update benchmark to run openorca dataset #21

Merged
merged 3 commits into from
Mar 30, 2024

Conversation

morgandu
Copy link
Contributor

No description provided.

@morgandu morgandu requested review from JoeZijunZhou, patemotter, FanhaiLu1 and vipannalla and removed request for vipannalla March 29, 2024 02:15
@morgandu morgandu force-pushed the mor--golden-openorca-dataset branch 2 times, most recently from 5f5540f to 54194d3 Compare March 29, 2024 02:26
@@ -42,7 +42,7 @@
python -m benchmarks.benchmark_serving \
--request-rate 1

e2e example: python3 benchmark_serving.py --tokenizer /home/rwitten/maxtext/assets/tokenizer --num-prompts 100 --dataset ~/ShareGPT_V3_unfiltered_cleaned_split.json
e2e example: python3 benchmark_serving.py --tokenizer /home/rwitten/maxtext/assets/tokenizer --num-prompts 100 --dataset ~/ShareGPT_V3_unfiltered_cleaned_split.json
Copy link
Collaborator

@JoeZijunZhou JoeZijunZhou Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also update the example here with your change. And also the README in /benchmark.

@morgandu morgandu force-pushed the mor--golden-openorca-dataset branch from 54194d3 to 7477e24 Compare March 29, 2024 22:46
@morgandu morgandu force-pushed the mor--golden-openorca-dataset branch from 7477e24 to 71c111c Compare March 29, 2024 22:51
@morgandu morgandu force-pushed the mor--golden-openorca-dataset branch from e9e3ac7 to 4f41058 Compare March 30, 2024 01:13
Copy link
Collaborator

@JoeZijunZhou JoeZijunZhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge this change first, since I need to release a new JetStream py package. We can do refactor later since I see the current sample filter logic is identical for both dataset.

Comment on lines +206 to +211
# Tokenize the prompts and completions.
prompts = dataset["prompts"]
outputs = dataset["results"]
n = len(prompts)
prompt_token_ids = tokenizer.tokenize(prompts)
output_token_ids = tokenizer.tokenize(outputs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we could extract this part out as a func for different dataset, and the rest are identical and thus we could keep them in the sample_request function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, there are some existing data processing that may not be necessary. Will revisit and refactor the data preprocessing part if needed.

@morgandu morgandu merged commit 81beb11 into main Mar 30, 2024
3 checks passed
@morgandu morgandu deleted the mor--golden-openorca-dataset branch March 30, 2024 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants