Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phi3 mini 128k instruct conversion to onnx yields model not deployable to AWS Sagemaker #915

Closed
nimishbongale opened this issue Sep 24, 2024 · 7 comments

Comments

@nimishbongale
Copy link

Describe the bug

The model card here mentions that the phi3-mini-128k-instruct-onnx model is directly deployable to sagemaker runtime using get_huggingface_llm_image_uri("huggingface",version="2.2.0") as image uri. However, on deploying, sagemaker fails to recognize the onnx model and attempts to find pytorch weights, thus failing the builds.

I'm loading in the model to /opt/ml/model folder using s3_uri, and then mentioning the same path as HF_MODEL_ID.

image

To Reproduce
Steps to reproduce the behavior:

  1. Try deploying onnx model to AWS Sagemaker using latest version of sagemaker==2.32.1

Expected behavior
Model should get deployed seamlessly, just like how phi-3-mini-128k-instruct (non-onnx) does.

Screenshots

image

Desktop (please complete the following information):

  • AWS Sagemaker Runtime
@nimishbongale
Copy link
Author

image

@nimishbongale nimishbongale changed the title Phi3 128k instruct conversion to onnx yields model not deployable to AWS Sagemaker Phi3 mini 128k instruct conversion to onnx yields model not deployable to AWS Sagemaker Sep 24, 2024
@kunal-vaishnavi
Copy link
Contributor

The AWS SageMaker instructions are auto-generated by Hugging Face on the model cards. Hugging Face assumes that each repo contains PyTorch models, which is why you are getting a FileNotFoundError in this repo containing only ONNX models.

For deploying ONNX models to AWS SageMaker, there are some online guides that you can try to follow. Here is an example one that you can try to follow starting from the "Create an Inference Handler" section. Here is an example with Triton Inference Server.

You can also use Azure ML to deploy ONNX models. Here is a guide you can follow. For a more detailed example, you can look at this guide.

Please note that there are multiple ONNX models uploaded in this repo. You can follow this example to pick one of the ONNX models to load using Hugging Face's Optimum. Then you can use Optimum to manage the generation loop with model.generate(...) in your inference script. Optimum will use ONNX Runtime under the hood and manage preparing the inputs for you.

@nimishbongale
Copy link
Author

nimishbongale commented Sep 25, 2024

Thanks for the help @kunal-vaishnavi

I'm unfortunately bound to using AWS Sagemaker for my deployments, so I'll have to proceed with a custom inference script using Optimum like you mentioned.

Are there plans to include optimum or onnxruntime-genai support in the next tgi images within sagemaker?

@kunal-vaishnavi
Copy link
Contributor

I'm unfortunately bound to using AWS Sagemaker for my deployments, so I'll have to proceed with a custom inference script using Optimum like you mentioned.

If you create your own image on top of existing TGI images, you can install and use ONNX Runtime GenAI directly instead of Optimum for the best performance in your custom inference script. Here is an example inference script that you can modify for AWS SageMaker.

Are there plans to include optimum or onnxruntime-genai support in the next tgi images within sagemaker?

According to this issue, Hugging Face's TGI currently doesn't support ONNX models. We will discuss internally to see if adding ONNX Runtime GenAI is possible.

@nimishbongale
Copy link
Author

nimishbongale commented Sep 26, 2024

Thanks once again @kunal-vaishnavi

I've actually used the model-generate.py file for the deployment and that has gone smoothly. However a couple of observations:

  1. The phi3 model prior to conversion supports an environment variable "return_full_text":False, which stops the model from echoing the user prompt. The onnxruntime-genai library however does not allow for setting this variable later on.
  2. At times the qa type (using the token streaming) is faster than just the direct generation, but that ideally shouldn't be the case.

Do let me know if you have any pointers in this regards, appreciate the help!

@kunal-vaishnavi
Copy link
Contributor

The phi3 model prior to conversion supports an environment variable "return_full_text":False, which stops the model from echoing the user prompt. The onnxruntime-genai library however does not allow for setting this variable later on.

There isn't an environment variable to set this in ONNX Runtime GenAI. But you can filter out the user prompt with some additional logic in the inference script.

The input tokens are set here.

params.input_ids = input_tokens

After the output tokens have been generated, you can go through them to remove the first $N_b$ tokens per batch entry where $N_b$ is the length of the input tokens at batch entry $b$.

output_tokens = model.generate(params)

Here is some pseudocode for a naive implementation.

output_tokens_without_user_prompt = []
for b in range(len(output_tokens)):
    N_b = len(input_tokens[b])
    without_user_prompt = output_tokens[b][N_b : ]
    output_tokens_without_user_prompt.append(without_user_prompt)

Then, when you print the generated tokens, the user prompt will not be re-printed.

for i in range(len(prompts)): 
    print(f'Prompt #{i}: {prompts[i]}') 
    print() 
    print(tokenizer.decode(output_tokens_without_user_prompt[i])) 
    print() 

Please note that this logic may need to be modified to handle padding in the input tokens when calculating $N_b$.

At times the qa type (using the token streaming) is faster than just the direct generation, but that ideally shouldn't be the case.

We are able to repro this and we will look into it.

@nimishbongale
Copy link
Author

Thanks a lot for the detailed response @kunal-vaishnavi! Appreciate it, closing this discussion for now 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants