LlamaServerParameters Configuration
LlamaServerParameters(name: str, provider: str = 'llama.cpp.server', verbose: Optional[bool] = False, concurrency: Optional[int] = 20, backend: Optional[str] = None, prompt_template: Optional[str] = None, context_length: Optional[int] = None, path: Optional[str] = None, model_hf_repo: Optional[str] = None, model_hf_file: Optional[str] = None, device: Optional[str] = None, server_bin_path: Optional[str] = None, server_host: str = '127.0.0.1', server_port: int = 0, temperature: float = 0.8, seed: int = 42, debug: bool = False, model_url: Optional[str] = None, model_draft: Optional[str] = None, threads: Optional[int] = None, n_gpu_layers: Optional[int] = None, batch_size: Optional[int] = None, ubatch_size: Optional[int] = None, ctx_size: Optional[int] = None, grp_attn_n: Optional[int] = None, grp_attn_w: Optional[int] = None, n_predict: Optional[int] = None, slot_save_path: Optional[str] = None, n_slots: Optional[int] = None, cont_batching: bool = False, embedding: bool = False, reranking: bool = False, metrics: bool = False, slots: bool = False, draft: Optional[int] = None, draft_max: Optional[int] = None, draft_min: Optional[int] = None, api_key: Optional[str] = None, lora_files: List[str] = <factory>, no_context_shift: bool = False, no_webui: Optional[bool] = None, startup_timeout: Optional[int] = None)
Parameters
Name | Type | Required | Description |
---|---|---|---|
name | string | ✅ | The name of the model. |
path | string | ❌ | Local model file path |
backend | string | ❌ | The real model name to pass to the provider, default is None. If backend is None, use name as the real model name. |
device | string | ❌ | Device to run model. If None, the device is automatically determined |
provider | string | ❌ | The provider of the model. If model is deployed in local, this is the inference type. If model is deployed in third-party service, this is platform name('proxy/<platform>') Defaults: llama.cpp.server |
verbose | boolean | ❌ | Show verbose output. Defaults: False |
concurrency | integer | ❌ | Model concurrency limit Defaults: 20 |
prompt_template | string | ❌ | Prompt template. If None, the prompt template is automatically determined from model. Just for local deployment. |
context_length | integer | ❌ | The context length of the model. If None, it is automatically determined from model. |
model_hf_repo | string | ❌ | Hugging Face repository for model download |
model_hf_file | string | ❌ | Model file name in the Hugging Face repository |
server_bin_path | string | ❌ | Path to the server binary executable |
server_host | string | ❌ | Host address to bind the server Defaults: 127.0.0.1 |
server_port | integer | ❌ | Port to bind the server. 0 for random available port Defaults: 0 |
temperature | number | ❌ | Sampling temperature for text generation Defaults: 0.8 |
seed | integer | ❌ | Random seed for reproducibility Defaults: 42 |
debug | boolean | ❌ | Enable debug mode Defaults: False |
model_url | string | ❌ | Model download URL (env: LLAMA_ARG_MODEL_URL) |
model_draft | string | ❌ | Draft model file path |
threads | integer | ❌ | Number of threads to use during generation (default: -1) (env: LLAMA_ARG_THREADS) |
n_gpu_layers | integer | ❌ | Number of layers to store in VRAM (env: LLAMA_ARG_N_GPU_LAYERS), set 1000000000 to use all layers |
batch_size | integer | ❌ | Logical maximum batch size (default: 2048) (env: LLAMA_ARG_BATCH) |
ubatch_size | integer | ❌ | Physical maximum batch size (default: 512) (env: LLAMA_ARG_UBATCH) |
ctx_size | integer | ❌ | Size of the prompt context (default: 4096, 0 = loaded from model) (env: LLAMA_ARG_CTX_SIZE) |
grp_attn_n | integer | ❌ | Group-attention factor (default: 1) |
grp_attn_w | integer | ❌ | Group-attention width (default: 512) |
n_predict | integer | ❌ | Number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled) (env: LLAMA_ARG_N_PREDICT) |
slot_save_path | string | ❌ | Path to save slot kv cache (default: disabled) |
n_slots | integer | ❌ | Number of slots for KV cache |
cont_batching | boolean | ❌ | Enable continuous batching (a.k.a dynamic batching) Defaults: False |
embedding | boolean | ❌ | Restrict to only support embedding use case; use only with dedicated embedding models (env: LLAMA_ARG_EMBEDDINGS) Defaults: False |
reranking | boolean | ❌ | Enable reranking endpoint on server (env: LLAMA_ARG_RERANKING) Defaults: False |
metrics | boolean | ❌ | Enable prometheus compatible metrics endpoint (env: LLAMA_ARG_ENDPOINT_METRICS) Defaults: False |
slots | boolean | ❌ | Enable slots monitoring endpoint (env: LLAMA_ARG_ENDPOINT_SLOTS) Defaults: False |
draft | integer | ❌ | Number of tokens to draft for speculative decoding (default: 16) (env: LLAMA_ARG_DRAFT_MAX) |
draft_max | integer | ❌ | Same as draft |
draft_min | integer | ❌ | Minimum number of draft tokens to use for speculative decoding (default: 5) |
api_key | string | ❌ | API key to use for authentication (env: LLAMA_API_KEY) |
lora_files | string | ❌ | Path to LoRA adapter (can be repeated to use multiple adapters) Defaults: [] |
no_context_shift | boolean | ❌ | Disables context shift on infinite text generation Defaults: False |
no_webui | boolean | ❌ | Disable web UI |
startup_timeout | integer | ❌ | Server startup timeout in seconds |