LlamaCppModelParameters Configuration
LlamaCppModelParameters(name: str, provider: str = 'llama.cpp', verbose: Optional[bool] = False, concurrency: Optional[int] = 5, backend: Optional[str] = None, prompt_template: Optional[str] = None, context_length: Optional[int] = None, path: Optional[str] = None, device: Optional[str] = None, seed: Optional[int] = -1, n_threads: Optional[int] = None, n_batch: Optional[int] = 512, n_gpu_layers: Optional[int] = 1000000000, n_gqa: Optional[int] = None, rms_norm_eps: Optional[float] = 5e-06, cache_capacity: Optional[str] = None, prefer_cpu: Optional[bool] = False)
Parameters
Name | Type | Required | Description |
---|---|---|---|
name | string | ✅ | The name of the model. |
path | string | ❌ | The path of the model, if you want to deploy a local model. |
backend | string | ❌ | The real model name to pass to the provider, default is None. If backend is None, use name as the real model name. |
device | string | ❌ | Device to run model. If None, the device is automatically determined |
provider | string | ❌ | The provider of the model. If model is deployed in local, this is the inference type. If model is deployed in third-party service, this is platform name('proxy/<platform>') Defaults: llama.cpp |
verbose | boolean | ❌ | Show verbose output. Defaults: False |
concurrency | integer | ❌ | Model concurrency limit Defaults: 5 |
prompt_template | string | ❌ | Prompt template. If None, the prompt template is automatically determined from model. Just for local deployment. |
context_length | integer | ❌ | The context length of the model. If None, it is automatically determined from model. |
seed | integer | ❌ | Random seed for llama-cpp models. -1 for random Defaults: -1 |
n_threads | integer | ❌ | Number of threads to use. If None, the number of threads is automatically determined |
n_batch | integer | ❌ | Maximum number of prompt tokens to batch together when calling llama_eval Defaults: 512 |
n_gpu_layers | integer | ❌ | Number of layers to offload to the GPU, Set this to 1000000000 to offload all layers to the GPU. Defaults: 1000000000 |
n_gqa | integer | ❌ | Grouped-query attention. Must be 8 for llama-2 70b. |
rms_norm_eps | number | ❌ | 5e-6 is a good value for llama-2 models. Defaults: 5e-06 |
cache_capacity | string | ❌ | Maximum cache capacity. Examples: 2000MiB, 2GiB. When provided without units, bytes will be assumed. |
prefer_cpu | boolean | ❌ | If a GPU is available, it will be preferred by default, unless prefer_cpu=False is configured. Defaults: False |