跳到主要内容
版本:dev

LlamaCppModelParameters Configuration

LlamaCppModelParameters(name: str, provider: str = 'llama.cpp', verbose: Optional[bool] = False, concurrency: Optional[int] = 5, backend: Optional[str] = None, prompt_template: Optional[str] = None, context_length: Optional[int] = None, path: Optional[str] = None, device: Optional[str] = None, seed: Optional[int] = -1, n_threads: Optional[int] = None, n_batch: Optional[int] = 512, n_gpu_layers: Optional[int] = 1000000000, n_gqa: Optional[int] = None, rms_norm_eps: Optional[float] = 5e-06, cache_capacity: Optional[str] = None, prefer_cpu: Optional[bool] = False)

Parameters

NameTypeRequiredDescription
namestring
The name of the model.
pathstring
The path of the model, if you want to deploy a local model.
backendstring
The real model name to pass to the provider, default is None. If backend is None, use name as the real model name.
devicestring
Device to run model. If None, the device is automatically determined
providerstring
The provider of the model. If model is deployed in local, this is the inference type. If model is deployed in third-party service, this is platform name('proxy/<platform>')
Defaults:llama.cpp
verboseboolean
Show verbose output.
Defaults:False
concurrencyinteger
Model concurrency limit
Defaults:5
prompt_templatestring
Prompt template. If None, the prompt template is automatically determined from model. Just for local deployment.
context_lengthinteger
The context length of the model. If None, it is automatically determined from model.
seedinteger
Random seed for llama-cpp models. -1 for random
Defaults:-1
n_threadsinteger
Number of threads to use. If None, the number of threads is automatically determined
n_batchinteger
Maximum number of prompt tokens to batch together when calling llama_eval
Defaults:512
n_gpu_layersinteger
Number of layers to offload to the GPU, Set this to 1000000000 to offload all layers to the GPU.
Defaults:1000000000
n_gqainteger
Grouped-query attention. Must be 8 for llama-2 70b.
rms_norm_epsnumber
5e-6 is a good value for llama-2 models.
Defaults:5e-06
cache_capacitystring
Maximum cache capacity. Examples: 2000MiB, 2GiB. When provided without units, bytes will be assumed.
prefer_cpuboolean
If a GPU is available, it will be preferred by default, unless prefer_cpu=False is configured.
Defaults:False