Skip to content

[Bugfix] Adds outlines performance improvement#5053

Closed
lynkz-matt-psaltis wants to merge 5 commits intovllm-project:mainfrom
lynkz-matt-psaltis:feature/outlines-latest-perf
Closed

[Bugfix] Adds outlines performance improvement#5053
lynkz-matt-psaltis wants to merge 5 commits intovllm-project:mainfrom
lynkz-matt-psaltis:feature/outlines-latest-perf

Conversation

@lynkz-matt-psaltis
Copy link

Borrows outlines upgrade from #4109 against Guide and detects state resets to clear cache

@lynkz-matt-psaltis lynkz-matt-psaltis force-pushed the feature/outlines-latest-perf branch 2 times, most recently from af2c735 to 93481e2 Compare June 8, 2024 06:55
@lynkz-matt-psaltis lynkz-matt-psaltis force-pushed the feature/outlines-latest-perf branch from 93481e2 to d866130 Compare June 8, 2024 06:59
raise TypeError(f"Unsupported instruction type {type(instruction)}")

# Retrieve allowed tokens from cache using the current state
cacheKey = instruction.id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that instruction has no attribute named id

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, with outlines==0.0.46 there doesn't seem to be and id attribute. I'm testing with

cacheKey = hash(tuple(allowed_tokens))

instead

# Cache miss, calculate allowed tokens and cache them

np_allowed_tokens = np.array(allowed_tokens, dtype=np.int32)
allowed_tokens_tensor = torch.from_numpy(np_allowed_tokens).pin_memory()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm doing some of my testing with the cpu backend. I don't know if there is a better way to test for the availability of pin_memory() but I'm running it with:

allowed_tokens_tensor = torch.from_numpy(np_allowed_tokens)
try:
    allowed_tokens_tensor = allowed_tokens_tensor.pin_memory()
except NotImplementedError:
    pass

@maxdebayser
Copy link
Contributor

I'm testing this change with this request:

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "facebook/opt-125m",
    "prompt": ["Here is an example of a JSON document representing a user record: "],
    "max_tokens": 1000,
    "min_tokens": 900,
    "temperature": 0,
    "guided_decoding_backend": "outlines",
    "response_format": {"type":"json_object"},
    "frequency_penalty": 2.0,
    "presence_penalty": 2.0
  }' 

I'm using a very small model on purpose so that a potential performance improvement can stand out more against the time it takes to generate to run the model forward().

Excluding the first call and averaging the next 3 calls the result is 22s with the baseline and the changes in this PR, so I'm not seeing a big difference. But perhaps this is a case that doesn't benefit from the caching in this PR. FYI, this test was run on a 80GB A100.

@mergify
Copy link

mergify bot commented Jan 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lynkz-matt-psaltis.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 16, 2025
@hmellor
Copy link
Member

hmellor commented Feb 18, 2025

Closing as this PR is quite old and conflicts with main. If you want to continue working in this PR (or open a new one) please feel free to do so!

@hmellor hmellor closed this Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants