-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phi3 mini 128k instruct conversion to onnx yields model not deployable to AWS Sagemaker #915
Comments
The AWS SageMaker instructions are auto-generated by Hugging Face on the model cards. Hugging Face assumes that each repo contains PyTorch models, which is why you are getting a For deploying ONNX models to AWS SageMaker, there are some online guides that you can try to follow. Here is an example one that you can try to follow starting from the "Create an Inference Handler" section. Here is an example with Triton Inference Server. You can also use Azure ML to deploy ONNX models. Here is a guide you can follow. For a more detailed example, you can look at this guide. Please note that there are multiple ONNX models uploaded in this repo. You can follow this example to pick one of the ONNX models to load using Hugging Face's Optimum. Then you can use Optimum to manage the generation loop with |
Thanks for the help @kunal-vaishnavi I'm unfortunately bound to using AWS Sagemaker for my deployments, so I'll have to proceed with a custom inference script using Optimum like you mentioned. Are there plans to include optimum or onnxruntime-genai support in the next tgi images within sagemaker? |
If you create your own image on top of existing TGI images, you can install and use ONNX Runtime GenAI directly instead of Optimum for the best performance in your custom inference script. Here is an example inference script that you can modify for AWS SageMaker.
According to this issue, Hugging Face's TGI currently doesn't support ONNX models. We will discuss internally to see if adding ONNX Runtime GenAI is possible. |
Thanks once again @kunal-vaishnavi I've actually used the
Do let me know if you have any pointers in this regards, appreciate the help! |
There isn't an environment variable to set this in ONNX Runtime GenAI. But you can filter out the user prompt with some additional logic in the inference script. The input tokens are set here.
After the output tokens have been generated, you can go through them to remove the first
Here is some pseudocode for a naive implementation. output_tokens_without_user_prompt = []
for b in range(len(output_tokens)):
N_b = len(input_tokens[b])
without_user_prompt = output_tokens[b][N_b : ]
output_tokens_without_user_prompt.append(without_user_prompt) Then, when you print the generated tokens, the user prompt will not be re-printed. for i in range(len(prompts)):
print(f'Prompt #{i}: {prompts[i]}')
print()
print(tokenizer.decode(output_tokens_without_user_prompt[i]))
print() Please note that this logic may need to be modified to handle padding in the input tokens when calculating
We are able to repro this and we will look into it. |
Thanks a lot for the detailed response @kunal-vaishnavi! Appreciate it, closing this discussion for now 👍 |
Describe the bug
The model card here mentions that the phi3-mini-128k-instruct-onnx model is directly deployable to sagemaker runtime using
get_huggingface_llm_image_uri("huggingface",version="2.2.0")
as image uri. However, on deploying, sagemaker fails to recognize the onnx model and attempts to find pytorch weights, thus failing the builds.I'm loading in the model to
/opt/ml/model
folder using s3_uri, and then mentioning the same path asHF_MODEL_ID
.To Reproduce
Steps to reproduce the behavior:
Expected behavior
Model should get deployed seamlessly, just like how phi-3-mini-128k-instruct (non-onnx) does.
Screenshots
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: