guide : setting up NVIDIA DGX Spark with ggml #16514

ggerganov · 2025-10-11T10:36:52Z

ggerganov
Oct 11, 2025
Maintainer

Overview

In this guide we will configure the NVIDIA DGX™ Spark as a local and private AI assistant using the ggml software stack. The guide is geared towards developers and builders. We are going to setup the following AI capabilities:

AI chat
AI coding agent
Inline text completion service
Embeddings service
Vision service
Speech-to-text (STT) service

These features will run simultaneously, in your local network, allowing you to fully utilize the power of your device at home or in the office.

Software

We are going to use the following open-source software:

ggml-org/llama.cpp
ggml-org/whisper.cpp
ggml-org/llama.vim (vim plugin)
ggml-org/llama.vscode (VSCode extension)

Setup

Simply run the following command in a terminal on your NVIDIA DGX™ Spark:

bash <(curl -s https://ggml.ai/dgx-spark.sh)

Note

The dgx-spark.sh script above is quite basic and is merely one of the many possible ways you can configure your device for AI use cases. It is provided here mainly for convenience and as an example. Feel free to inspect it and adjust it for your needs.

The command downloads and builds the latest version of the ggml software stack and starts multiple HTTP REST services as shown in the following table:

Port	Base URL	Model	Typical Use‑Case
8021	`http://localhost:8021`	Gemma	Generate text embeddings
8022	`http://localhost:8022`	Qwen	Fill‑in‑the‑middle (in‑fill) text generation
8023	`http://localhost:8023`	GPT-OSS	General‑purpose LLM completions, chat and tool use
8024	`http://localhost:8024`	Gemma (Vision)	Vision tasks – image‑to‑text, multimodal inference
8025	`http://localhost:8025`	Whisper	Speech‑to‑text transcription

The first time running the command can take a few minutes to download the model weights. If everything goes OK, you should see the following output:

At this point, the machine is fully configured and ready to be used. Internet connection is no longer necessary.

Here's sample output of nvidia-smi while the ggml services are running:

Use cases

Here is a small fraction of the AI use cases that are possible with this configuration.

Basic chat

Simply point your browser to the chat endpoint http://localhost:8023:

Inline code completions (FIM)

Install the llama.vim plugin in your Vim/Neovim editor and configure it to use the FIM endpoint http://localhost:8022:
In VSCode, install the llama.vscode extension and configure it in a similar way to use the FIM endpoint:

Coding agent

In VSCode, configure the llama.vscode extension to use the endpoints for completions, chat, embeddings and tools:

Document and image processing

Submit PDFs and image documents in the WebUI to analyze them with a multimodal LLM. For visuals, use the vision endpoint http://localhost:8024:

Audio transcription

Use the speech-to-text endpoint at http://localhost:8025 to quickly transcribe audio files:

Performance

For performance numbers, see Performance of llama.cpp on NVIDIA DGX Spark

Conclusion

The new NVIDIA DGX Spark is a great choice for serving the latest AI models locally and privately. With 128GB of unified system memory it has the capacity to host multiple AI services simultaneously. And the ggml software stack is the best way to do that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

guide : setting up NVIDIA DGX Spark with ggml #16514

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

guide : setting up NVIDIA DGX Spark with ggml #16514

Uh oh!

Uh oh!

ggerganov Oct 11, 2025 Maintainer

Overview

Software

Setup

Use cases

Basic chat

Inline code completions (FIM)

Coding agent

Document and image processing

Audio transcription

Performance

Conclusion

Replies: 0 comments

ggerganov
Oct 11, 2025
Maintainer