Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

random_seed seems to be ignored (or at least inconsistent) for inflight_batcher_llm #468

Open
2 of 4 tasks
dyoshida-continua opened this issue May 21, 2024 · 4 comments
Open
2 of 4 tasks
Assignees
Labels
bug Something isn't working triaged Issue has been triaged by maintainers

Comments

@dyoshida-continua
Copy link

dyoshida-continua commented May 21, 2024

System Info

I've converted Llama 3 using TensorRT-LLM's convert_checkpoint script, and am serving it with the inflight_batcher_llm template. I'm trying to get diverse samples for a fixed input, but I've found that if I make several requests concurrently, several will have identical outputs.

I'm setting top_p=1, top_k=1024, temperature=1.0, beam_width=1, and generating a unique random seed for each request. The requests are being made over the gRPC API, and I'm using v0.9.0 of TensorRT-LLM and tensorrtllm_backend.

Who can help?

@byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Serve a model (essentially following this guide, with some settings changes: https://developer.nvidia.com/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/)
  2. Make 5 gRPC requests concurrently

Expected behavior

I expect each request with a different seed to yield a different response

actual behavior

Several of the 5 responses are consistently identical

additional notes

I changed the script I'm using for testing to wait for a response before sending another request, and this results in all 5 outputs being distinct, so it seems like the concurrency/inflight batching really is the problem.

@dyoshida-continua dyoshida-continua added the bug Something isn't working label May 21, 2024
@dyoshida-continua
Copy link
Author

Another detail which is interesting is that the identical sequences I observe in the concurrent case are the same run to run, even though I'm sampling the random seed from 1-1,000,000.

For example, with the input of <|begin_of_text|>Hello, my name is, I saw a continuation of of "Ahmed, and I am an experienced Software Engineer with proficiency..." in 3/5 responses, and then 2/5 responses on the next run. I did not observe this prefix at all when making requests serially.

@dyoshida-continua
Copy link
Author

@byshiue I incorrectly typed your name when opening this issue originally. Can you comment on whether there's a workaround for this? It's currently making batch inference effectively useless.

@chiendb97
Copy link

@byshiue I incorrectly typed your name when opening this issue originally. Can you comment on whether there's a workaround for this? It's currently making batch inference effectively useless.

@dyoshida-continua I applied the solution described in this pull request: NVIDIA/TensorRT-LLM#1742, and it resolved the issue for me.

@byshiue
Copy link
Collaborator

byshiue commented Jun 7, 2024

Thank you for the help replying, @chiendb97 . Since the NVIDIA/TensorRT-LLM#1742 is related to fix of random seed setting, it might be related to your issue, @dyoshida-continua . Could you take a try?

@byshiue byshiue self-assigned this Jun 7, 2024
@byshiue byshiue added the triaged Issue has been triaged by maintainers label Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants