Truncated output with local backend? #125

remmen-io · 2024-07-09T07:17:33Z

➜ aiac --version
aiac version 5.2.1

We are using local backends provided by huggingface TGI

[backends.phi3]
type = "openai"
default_model = "Phi-3"
url = "https://phi3.ourcluster/v1"

[backends.deepseek2]
type = "openai"
default_model = "DeepSeekv2"
url = "https://deepseekv2.ourcluster/v1"

Invoking aiac results in a truncated response

➜ aiac dockerfile for nodejs with comments
```Dockerfile
# Use the official Node.js 14 image as a parent image
FROM node:14

# Set the working directory in the container
WORKDIR /usr/src/app

# Copy package.json and package-lock.json (if available) to the working directory
COPY package*.json ./

# Install project dependencies
RUN npm install

# Copy the rest of the application code to the working directory

[S/s]: save and exit
[W/w]: save and chat
[C/c]: continue chatting
[R/r]: retry same prompt
[Y/y]: copy to clipboard
[Q/q]: quit
Choice: c
✔ New message: continue█
# Copy the rest of the application code to the working directory
COPY . .

# Build the application if necessary (e.g., if using a build step)
RUN npm run build

# Expose the port the app runs on
EXPOSE 3000

# Define the command to run the app
CMD ["npm", "start"]

This Dockerfile is designed for a Node.js application that listens on

[S/s]: save and exit
[W/w]: save and chat
[C/c]: continue chatting
[R/r]: retry same prompt
[Y/y]: copy to clipboard
[Q/q]: quit
Choice: c
✔ New message: continue█
# Define the command to run the app
CMD ["npm", "start"]

# Add a health check to ensure the application is running properly
HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:3000/ || exit 1

# Optional: Set up a volume for persistent data
VOLUME [

[S/s]: save and exit
[W/w]: save and chat
[C/c]: continue chatting
[R/r]: retry same prompt
[Y/y]: copy to clipboard
[Q/q]: quit
✔ Choice: q█

The endpoint provides information about the ma_tokens, but I guess they are not used?

{"model_id":"/models-cache/deepseek-coder-v2-lite","model_sha":null,"model_dtype":"torch.float16","model_device_type":"cuda","model_pipeline_tag":null,"max_concurrent_requests":128,"max_best_of":2,"max_stop_sequences":4,"max_input_tokens":4096,"max_total_tokens":6124,"waiting_served_ratio":0.3,"max_batch_total_tokens":16000,"max_waiting_tokens":20,"max_batch_size":null,"validation_workers":2,"max_client_batch_size":4,"router":"text-generation-router","version":"2.1.1","sha":"4dfdb481fb1f9cf31561c056061d693f38ba4168","docker_label":"sha-4dfdb48"}

Is it somehow possible to add/modify the max_new_tokens parameter?

Like curl -N https://deepseekv2.mycluster/generate -X POST -d '{"inputs":"dockerfile for nodejs with comments?","parameters":{"max_new_tokens":200}}' -H 'Content-Type: application/json'

The text was updated successfully, but these errors were encountered:

ido50 · 2024-07-10T08:56:16Z

I'm not familiar with Huggingface, but I see that it implements the OpenAI API (although your examples seem to be its own API?) with a relatively small default for "max_tokens". I suppose we can expose other parameters too, do you think it would make sense though to add something like max_tokens to the backend configuration, rather than doing this on a per-request basis (e.g. as a flag in the CLI or a parameter in the library)? Which would make more sense for your use case?

remmen-io · 2024-07-10T11:04:33Z

Hi @ido50
Unfortunately no as TGI currently is not supporting this on the server side

There is a open issue: huggingface/text-generation-inference#870

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truncated output with local backend? #125

Truncated output with local backend? #125

remmen-io commented Jul 9, 2024 •

edited

Loading

ido50 commented Jul 10, 2024

remmen-io commented Jul 10, 2024

Truncated output with local backend? #125

Truncated output with local backend? #125

Comments

remmen-io commented Jul 9, 2024 • edited Loading

ido50 commented Jul 10, 2024

remmen-io commented Jul 10, 2024

remmen-io commented Jul 9, 2024 •

edited

Loading