Phi3 mini 128k instruct conversion to onnx yields model not deployable to AWS Sagemaker #915

nimishbongale · 2024-09-24T15:27:05Z

Describe the bug

The model card here mentions that the phi3-mini-128k-instruct-onnx model is directly deployable to sagemaker runtime using get_huggingface_llm_image_uri("huggingface",version="2.2.0") as image uri. However, on deploying, sagemaker fails to recognize the onnx model and attempts to find pytorch weights, thus failing the builds.

I'm loading in the model to /opt/ml/model folder using s3_uri, and then mentioning the same path as HF_MODEL_ID.

To Reproduce
Steps to reproduce the behavior:

Try deploying onnx model to AWS Sagemaker using latest version of sagemaker==2.32.1

Expected behavior
Model should get deployed seamlessly, just like how phi-3-mini-128k-instruct (non-onnx) does.

Screenshots

Desktop (please complete the following information):

AWS Sagemaker Runtime

The text was updated successfully, but these errors were encountered:

nimishbongale · 2024-09-24T15:29:08Z

kunal-vaishnavi · 2024-09-24T16:47:57Z

The AWS SageMaker instructions are auto-generated by Hugging Face on the model cards. Hugging Face assumes that each repo contains PyTorch models, which is why you are getting a FileNotFoundError in this repo containing only ONNX models.

For deploying ONNX models to AWS SageMaker, there are some online guides that you can try to follow. Here is an example one that you can try to follow starting from the "Create an Inference Handler" section. Here is an example with Triton Inference Server.

You can also use Azure ML to deploy ONNX models. Here is a guide you can follow. For a more detailed example, you can look at this guide.

Please note that there are multiple ONNX models uploaded in this repo. You can follow this example to pick one of the ONNX models to load using Hugging Face's Optimum. Then you can use Optimum to manage the generation loop with model.generate(...) in your inference script. Optimum will use ONNX Runtime under the hood and manage preparing the inputs for you.

nimishbongale · 2024-09-25T03:11:19Z

Thanks for the help @kunal-vaishnavi

I'm unfortunately bound to using AWS Sagemaker for my deployments, so I'll have to proceed with a custom inference script using Optimum like you mentioned.

Are there plans to include optimum or onnxruntime-genai support in the next tgi images within sagemaker?

kunal-vaishnavi · 2024-09-25T23:07:37Z

I'm unfortunately bound to using AWS Sagemaker for my deployments, so I'll have to proceed with a custom inference script using Optimum like you mentioned.

If you create your own image on top of existing TGI images, you can install and use ONNX Runtime GenAI directly instead of Optimum for the best performance in your custom inference script. Here is an example inference script that you can modify for AWS SageMaker.

Are there plans to include optimum or onnxruntime-genai support in the next tgi images within sagemaker?

According to this issue, Hugging Face's TGI currently doesn't support ONNX models. We will discuss internally to see if adding ONNX Runtime GenAI is possible.

nimishbongale · 2024-09-26T17:55:43Z

Thanks once again @kunal-vaishnavi

I've actually used the model-generate.py file for the deployment and that has gone smoothly. However a couple of observations:

The phi3 model prior to conversion supports an environment variable "return_full_text":False, which stops the model from echoing the user prompt. The onnxruntime-genai library however does not allow for setting this variable later on.
At times the qa type (using the token streaming) is faster than just the direct generation, but that ideally shouldn't be the case.

Do let me know if you have any pointers in this regards, appreciate the help!

kunal-vaishnavi · 2024-10-02T18:42:24Z

The phi3 model prior to conversion supports an environment variable "return_full_text":False, which stops the model from echoing the user prompt. The onnxruntime-genai library however does not allow for setting this variable later on.

There isn't an environment variable to set this in ONNX Runtime GenAI. But you can filter out the user prompt with some additional logic in the inference script.

The input tokens are set here.

onnxruntime-genai/examples/python/model-generate.py

Line 40 in bcf55a6

params.input_ids = input_tokens

After the output tokens have been generated, you can go through them to remove the first $N_b$ tokens per batch entry where $N_b$ is the length of the input tokens at batch entry $b$.

onnxruntime-genai/examples/python/model-generate.py

Line 45 in bcf55a6

output_tokens = model.generate(params)

Here is some pseudocode for a naive implementation.

output_tokens_without_user_prompt = []
for b in range(len(output_tokens)):
    N_b = len(input_tokens[b])
    without_user_prompt = output_tokens[b][N_b : ]
    output_tokens_without_user_prompt.append(without_user_prompt)

Then, when you print the generated tokens, the user prompt will not be re-printed.

for i in range(len(prompts)): 
    print(f'Prompt #{i}: {prompts[i]}') 
    print() 
    print(tokenizer.decode(output_tokens_without_user_prompt[i])) 
    print()

Please note that this logic may need to be modified to handle padding in the input tokens when calculating $N_b$.

At times the qa type (using the token streaming) is faster than just the direct generation, but that ideally shouldn't be the case.

We are able to repro this and we will look into it.

nimishbongale · 2024-10-08T17:28:10Z

Thanks a lot for the detailed response @kunal-vaishnavi! Appreciate it, closing this discussion for now 👍

github-actions bot added the model:transformer label Sep 24, 2024

nimishbongale changed the title ~~Phi3 128k instruct conversion to onnx yields model not deployable to AWS Sagemaker~~ Phi3 mini 128k instruct conversion to onnx yields model not deployable to AWS Sagemaker Sep 24, 2024

nimishbongale closed this as completed Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phi3 mini 128k instruct conversion to onnx yields model not deployable to AWS Sagemaker #915

Phi3 mini 128k instruct conversion to onnx yields model not deployable to AWS Sagemaker #915

nimishbongale commented Sep 24, 2024

nimishbongale commented Sep 24, 2024

kunal-vaishnavi commented Sep 24, 2024

nimishbongale commented Sep 25, 2024 •

edited

Loading

kunal-vaishnavi commented Sep 25, 2024

nimishbongale commented Sep 26, 2024 •

edited

Loading

kunal-vaishnavi commented Oct 2, 2024

nimishbongale commented Oct 8, 2024

Phi3 mini 128k instruct conversion to onnx yields model not deployable to AWS Sagemaker #915

Phi3 mini 128k instruct conversion to onnx yields model not deployable to AWS Sagemaker #915

Comments

nimishbongale commented Sep 24, 2024

nimishbongale commented Sep 24, 2024

kunal-vaishnavi commented Sep 24, 2024

nimishbongale commented Sep 25, 2024 • edited Loading

kunal-vaishnavi commented Sep 25, 2024

nimishbongale commented Sep 26, 2024 • edited Loading

kunal-vaishnavi commented Oct 2, 2024

nimishbongale commented Oct 8, 2024

nimishbongale commented Sep 25, 2024 •

edited

Loading

nimishbongale commented Sep 26, 2024 •

edited

Loading