Skip to main content
Version: dev

LlamaServerParameters Configuration

LlamaServerParameters(name: str, provider: str = 'llama.cpp.server', verbose: Optional[bool] = False, concurrency: Optional[int] = 20, backend: Optional[str] = None, prompt_template: Optional[str] = None, context_length: Optional[int] = None, path: Optional[str] = None, model_hf_repo: Optional[str] = None, model_hf_file: Optional[str] = None, device: Optional[str] = None, server_bin_path: Optional[str] = None, server_host: str = '127.0.0.1', server_port: int = 0, temperature: float = 0.8, seed: int = 42, debug: bool = False, model_url: Optional[str] = None, model_draft: Optional[str] = None, threads: Optional[int] = None, n_gpu_layers: Optional[int] = None, batch_size: Optional[int] = None, ubatch_size: Optional[int] = None, ctx_size: Optional[int] = None, grp_attn_n: Optional[int] = None, grp_attn_w: Optional[int] = None, n_predict: Optional[int] = None, slot_save_path: Optional[str] = None, n_slots: Optional[int] = None, cont_batching: bool = False, embedding: bool = False, reranking: bool = False, metrics: bool = False, slots: bool = False, draft: Optional[int] = None, draft_max: Optional[int] = None, draft_min: Optional[int] = None, api_key: Optional[str] = None, lora_files: List[str] = <factory>, no_context_shift: bool = False, no_webui: Optional[bool] = None, startup_timeout: Optional[int] = None)

Parameters

NameTypeRequiredDescription
namestring
The name of the model.
pathstring
Local model file path
backendstring
The real model name to pass to the provider, default is None. If backend is None, use name as the real model name.
devicestring
Device to run model. If None, the device is automatically determined
providerstring
The provider of the model. If model is deployed in local, this is the inference type. If model is deployed in third-party service, this is platform name('proxy/<platform>')
Defaults:llama.cpp.server
verboseboolean
Show verbose output.
Defaults:False
concurrencyinteger
Model concurrency limit
Defaults:20
prompt_templatestring
Prompt template. If None, the prompt template is automatically determined from model. Just for local deployment.
context_lengthinteger
The context length of the model. If None, it is automatically determined from model.
model_hf_repostring
Hugging Face repository for model download
model_hf_filestring
Model file name in the Hugging Face repository
server_bin_pathstring
Path to the server binary executable
server_hoststring
Host address to bind the server
Defaults:127.0.0.1
server_portinteger
Port to bind the server. 0 for random available port
Defaults:0
temperaturenumber
Sampling temperature for text generation
Defaults:0.8
seedinteger
Random seed for reproducibility
Defaults:42
debugboolean
Enable debug mode
Defaults:False
model_urlstring
Model download URL (env: LLAMA_ARG_MODEL_URL)
model_draftstring
Draft model file path
threadsinteger
Number of threads to use during generation (default: -1) (env: LLAMA_ARG_THREADS)
n_gpu_layersinteger
Number of layers to store in VRAM (env: LLAMA_ARG_N_GPU_LAYERS), set 1000000000 to use all layers
batch_sizeinteger
Logical maximum batch size (default: 2048) (env: LLAMA_ARG_BATCH)
ubatch_sizeinteger
Physical maximum batch size (default: 512) (env: LLAMA_ARG_UBATCH)
ctx_sizeinteger
Size of the prompt context (default: 4096, 0 = loaded from model) (env: LLAMA_ARG_CTX_SIZE)
grp_attn_ninteger
Group-attention factor (default: 1)
grp_attn_winteger
Group-attention width (default: 512)
n_predictinteger
Number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled) (env: LLAMA_ARG_N_PREDICT)
slot_save_pathstring
Path to save slot kv cache (default: disabled)
n_slotsinteger
Number of slots for KV cache
cont_batchingboolean
Enable continuous batching (a.k.a dynamic batching)
Defaults:False
embeddingboolean
Restrict to only support embedding use case; use only with dedicated embedding models (env: LLAMA_ARG_EMBEDDINGS)
Defaults:False
rerankingboolean
Enable reranking endpoint on server (env: LLAMA_ARG_RERANKING)
Defaults:False
metricsboolean
Enable prometheus compatible metrics endpoint (env: LLAMA_ARG_ENDPOINT_METRICS)
Defaults:False
slotsboolean
Enable slots monitoring endpoint (env: LLAMA_ARG_ENDPOINT_SLOTS)
Defaults:False
draftinteger
Number of tokens to draft for speculative decoding (default: 16) (env: LLAMA_ARG_DRAFT_MAX)
draft_maxinteger
Same as draft
draft_mininteger
Minimum number of draft tokens to use for speculative decoding (default: 5)
api_keystring
API key to use for authentication (env: LLAMA_API_KEY)
lora_filesstring
Path to LoRA adapter (can be repeated to use multiple adapters)
Defaults:[]
no_context_shiftboolean
Disables context shift on infinite text generation
Defaults:False
no_webuiboolean
Disable web UI
startup_timeoutinteger
Server startup timeout in seconds