HFLLMDeployModelParameters Configuration
Local deploy model parameters.
Parameters
Name | Type | Required | Description |
---|---|---|---|
name | string | ✅ | The name of the model. |
path | string | ❌ | The path of the model, if you want to deploy a local model. |
backend | string | ❌ | The real model name to pass to the provider, default is None. If backend is None, use name as the real model name. |
device | string | ❌ | Device to run model. If None, the device is automatically determined |
provider | string | ❌ | The provider of the model. If model is deployed in local, this is the inference type. If model is deployed in third-party service, this is platform name('proxy/<platform>') Defaults: hf |
verbose | boolean | ❌ | Show verbose output. Defaults: False |
concurrency | integer | ❌ | Model concurrency limit Defaults: 5 |
prompt_template | string | ❌ | Prompt template. If None, the prompt template is automatically determined from model. Just for local deployment. |
context_length | integer | ❌ | The context length of the model. If None, it is automatically determined from model. |
trust_remote_code | boolean | ❌ | Trust remote code or not. Defaults: True |
quantization | BaseHFQuantization (bitsandbytes configuration, bitsandbytes_8bits configuration, bitsandbytes_4bits configuration) | ❌ | The quantization parameters. |
low_cpu_mem_usage | boolean | ❌ | Whether to use low CPU memory usage mode. It can reduce the memory when loading the model, if you load your model with quantization, it will be True by default. You must install `accelerate` to make it work. |
num_gpus | integer | ❌ | The number of gpus you expect to use, if it is empty, use all of them as much as possible |
max_gpu_memory | string | ❌ | The maximum memory limit of each GPU, only valid in multi-GPU configuration, eg: 10GiB, 24GiB |
torch_dtype | string | ❌ | The dtype of the model, default is None. Valid values: auto , float16 , bfloat16 , float , float32 |