Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docker/k8s-sglang-service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ spec:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: hf-cache
mountPath: /root/.cache/huggingface
readOnly: true
Expand All @@ -52,6 +54,10 @@ spec:
initialDelaySeconds: 30
periodSeconds: 10
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 10Gi
- name: hf-cache
hostPath:
path: /root/.cache/huggingface
Expand Down
1 change: 1 addition & 0 deletions docs/backend/server_arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
```

- See [hyperparameter tuning](hyperparameter_tuning.md) on tuning hyperparameters for better performance.
- For docker and Kubernetes runs, you need to set up shared memory which is used for communication between processes. See `--shm-size` for docker and `/dev/shm` size update for Kubernetes manifests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good summary. but it seem this doc server_arguments.md is majorly for CLI (python run) use case.
if we aim for "docker"/"k8s", maybe this file is not suitable?
https://github.com/sgl-project/sglang/blob/main/docs/start/install.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, server_arguments.md is the only place where it summarize the parallelism runs, so I believe it should be mentioned thare (maybe among other places?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, as you wrote, in Docker part of https://github.com/sgl-project/sglang/blob/main/docs/start/install.md shared memory is already mentioned. And in Kubernetes section there's just a link to manifest, so I'm suggesting:

  • to modify single node manifest
  • to update the docs in the place where --tp is actually mentioned as server argument

- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.

```bash
Expand Down
Loading