Skip to main content
Version: dev

vLLM

Configure DB-GPT to use vLLM for high-throughput local model inference on NVIDIA GPUs.

Prerequisites​

  • NVIDIA GPU with CUDA 12.1+
  • Sufficient VRAM for your chosen model (8 GB+ for 7B models)
  • DB-GPT installed with vllm extra

Install dependencies​

uv sync --all-packages \
--extra "base" \
--extra "hf" \
--extra "cuda121" \
--extra "vllm" \
--extra "rag" \
--extra "storage_chromadb" \
--extra "quant_bnb" \
--extra "dbgpts"

Configuration​

Edit configs/dbgpt-local-vllm.toml:

[models]
[[models.llms]]
name = "DeepSeek-R1-Distill-Qwen-1.5B"
provider = "vllm"
# Download from HuggingFace automatically, or specify local path:
# path = "models/DeepSeek-R1-Distill-Qwen-1.5B"

[[models.embeddings]]
name = "BAAI/bge-large-zh-v1.5"
provider = "hf"
# path = "models/bge-large-zh-v1.5"
Model download

If you don't specify a path, the model will be downloaded from HuggingFace Hub automatically. For large models, pre-downloading is recommended:

# Using huggingface-cli
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --local-dir models/DeepSeek-R1-Distill-Qwen-1.5B
ModelVRAM RequiredNotes
DeepSeek-R1-Distill-Qwen-1.5B~4 GBSmall, good for testing
GLM-4-9B-Chat~20 GBStrong Chinese & English
Qwen2.5-7B-Instruct~16 GBGood balance
Qwen2.5-Coder-7B-Instruct~16 GBCode-focused

Start the server​

uv run dbgpt start webserver --config configs/dbgpt-local-vllm.toml
GPU selection

To use a specific GPU:

CUDA_VISIBLE_DEVICES=0 uv run dbgpt start webserver --config configs/dbgpt-local-vllm.toml

Troubleshooting​

IssueSolution
CUDA not foundInstall CUDA 12.1+ and verify with nvidia-smi
Out of GPU memoryUse a smaller model or enable quantization (quant_bnb)
Model download failsPre-download the model or configure a HuggingFace mirror
Slow first requestvLLM compiles kernels on first run — subsequent requests are fast

What's next​