Version: dev

LlamaServerParameters Configuration

LlamaServerParameters(name: str, provider: str = 'llama.cpp.server', verbose: Optional[bool] = False, concurrency: Optional[int] = 20, backend: Optional[str] = None, prompt_template: Optional[str] = None, context_length: Optional[int] = None, path: Optional[str] = None, model_hf_repo: Optional[str] = None, model_hf_file: Optional[str] = None, device: Optional[str] = None, server_bin_path: Optional[str] = None, server_host: str = '127.0.0.1', server_port: int = 0, temperature: float = 0.8, seed: int = 42, debug: bool = False, model_url: Optional[str] = None, model_draft: Optional[str] = None, threads: Optional[int] = None, n_gpu_layers: Optional[int] = None, batch_size: Optional[int] = None, ubatch_size: Optional[int] = None, ctx_size: Optional[int] = None, grp_attn_n: Optional[int] = None, grp_attn_w: Optional[int] = None, n_predict: Optional[int] = None, slot_save_path: Optional[str] = None, n_slots: Optional[int] = None, cont_batching: bool = False, embedding: bool = False, reranking: bool = False, metrics: bool = False, slots: bool = False, draft: Optional[int] = None, draft_max: Optional[int] = None, draft_min: Optional[int] = None, api_key: Optional[str] = None, lora_files: List[str] = <factory>, no_context_shift: bool = False, no_webui: Optional[bool] = None, startup_timeout: Optional[int] = None)

Parameters

Name	Type	Required	Description
`name`	string	✅	The name of the model.
`path`	string	❌	Local model file path
`backend`	string	❌	The real model name to pass to the provider, default is None. If backend is None, use name as the real model name.
`device`	string	❌	Device to run model. If None, the device is automatically determined
`provider`	string	❌	The provider of the model. If model is deployed in local, this is the inference type. If model is deployed in third-party service, this is platform name('proxy/<platform>') Defaults：`llama.cpp.server`
`verbose`	boolean	❌	Show verbose output. Defaults：`False`
`concurrency`	integer	❌	Model concurrency limit Defaults：`20`
`prompt_template`	string	❌	Prompt template. If None, the prompt template is automatically determined from model. Just for local deployment.
`context_length`	integer	❌	The context length of the model. If None, it is automatically determined from model.
`model_hf_repo`	string	❌	Hugging Face repository for model download
`model_hf_file`	string	❌	Model file name in the Hugging Face repository
`server_bin_path`	string	❌	Path to the server binary executable
`server_host`	string	❌	Host address to bind the server Defaults：`127.0.0.1`
`server_port`	integer	❌	Port to bind the server. 0 for random available port Defaults：`0`
`temperature`	number	❌	Sampling temperature for text generation Defaults：`0.8`
`seed`	integer	❌	Random seed for reproducibility Defaults：`42`
`debug`	boolean	❌	Enable debug mode Defaults：`False`
`model_url`	string	❌	Model download URL (env: LLAMA_ARG_MODEL_URL)
`model_draft`	string	❌	Draft model file path
`threads`	integer	❌	Number of threads to use during generation (default: -1) (env: LLAMA_ARG_THREADS)
`n_gpu_layers`	integer	❌	Number of layers to store in VRAM (env: LLAMA_ARG_N_GPU_LAYERS), set 1000000000 to use all layers
`batch_size`	integer	❌	Logical maximum batch size (default: 2048) (env: LLAMA_ARG_BATCH)
`ubatch_size`	integer	❌	Physical maximum batch size (default: 512) (env: LLAMA_ARG_UBATCH)
`ctx_size`	integer	❌	Size of the prompt context (default: 4096, 0 = loaded from model) (env: LLAMA_ARG_CTX_SIZE)
`grp_attn_n`	integer	❌	Group-attention factor (default: 1)
`grp_attn_w`	integer	❌	Group-attention width (default: 512)
`n_predict`	integer	❌	Number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled) (env: LLAMA_ARG_N_PREDICT)
`slot_save_path`	string	❌	Path to save slot kv cache (default: disabled)
`n_slots`	integer	❌	Number of slots for KV cache
`cont_batching`	boolean	❌	Enable continuous batching (a.k.a dynamic batching) Defaults：`False`
`embedding`	boolean	❌	Restrict to only support embedding use case; use only with dedicated embedding models (env: LLAMA_ARG_EMBEDDINGS) Defaults：`False`
`reranking`	boolean	❌	Enable reranking endpoint on server (env: LLAMA_ARG_RERANKING) Defaults：`False`
`metrics`	boolean	❌	Enable prometheus compatible metrics endpoint (env: LLAMA_ARG_ENDPOINT_METRICS) Defaults：`False`
`slots`	boolean	❌	Enable slots monitoring endpoint (env: LLAMA_ARG_ENDPOINT_SLOTS) Defaults：`False`
`draft`	integer	❌	Number of tokens to draft for speculative decoding (default: 16) (env: LLAMA_ARG_DRAFT_MAX)
`draft_max`	integer	❌	Same as draft
`draft_min`	integer	❌	Minimum number of draft tokens to use for speculative decoding (default: 5)
`api_key`	string	❌	API key to use for authentication (env: LLAMA_API_KEY)
`lora_files`	string	❌	Path to LoRA adapter (can be repeated to use multiple adapters) Defaults：`[]`
`no_context_shift`	boolean	❌	Disables context shift on infinite text generation Defaults：`False`
`no_webui`	boolean	❌	Disable web UI
`startup_timeout`	integer	❌	Server startup timeout in seconds