[Bugfix] Adds outlines performance improvement#5053
[Bugfix] Adds outlines performance improvement#5053lynkz-matt-psaltis wants to merge 5 commits intovllm-project:mainfrom
Conversation
af2c735 to
93481e2
Compare
93481e2 to
d866130
Compare
| raise TypeError(f"Unsupported instruction type {type(instruction)}") | ||
|
|
||
| # Retrieve allowed tokens from cache using the current state | ||
| cacheKey = instruction.id |
There was a problem hiding this comment.
It seems to me that instruction has no attribute named id
There was a problem hiding this comment.
Yes, with outlines==0.0.46 there doesn't seem to be and id attribute. I'm testing with
cacheKey = hash(tuple(allowed_tokens))instead
| # Cache miss, calculate allowed tokens and cache them | ||
|
|
||
| np_allowed_tokens = np.array(allowed_tokens, dtype=np.int32) | ||
| allowed_tokens_tensor = torch.from_numpy(np_allowed_tokens).pin_memory() |
There was a problem hiding this comment.
I'm doing some of my testing with the cpu backend. I don't know if there is a better way to test for the availability of pin_memory() but I'm running it with:
allowed_tokens_tensor = torch.from_numpy(np_allowed_tokens)
try:
allowed_tokens_tensor = allowed_tokens_tensor.pin_memory()
except NotImplementedError:
pass|
I'm testing this change with this request: I'm using a very small model on purpose so that a potential performance improvement can stand out more against the time it takes to generate to run the model Excluding the first call and averaging the next 3 calls the result is 22s with the baseline and the changes in this PR, so I'm not seeing a big difference. But perhaps this is a case that doesn't benefit from the caching in this PR. FYI, this test was run on a 80GB A100. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Closing as this PR is quite old and conflicts with |
Borrows outlines upgrade from #4109 against Guide and detects state resets to clear cache