VLLMDeployModelParameters Configuration
Local deploy model parameters.
Parameters
Name | Type | Required | Description |
---|---|---|---|
name | string | ✅ | The name of the model. |
path | string | ❌ | The path of the model, if you want to deploy a local model. |
backend | string | ❌ | The real model name to pass to the provider, default is None. If backend is None, use name as the real model name. |
device | string | ❌ | Device to run model. If None, the device is automatically determined Defaults: auto |
provider | string | ❌ | The provider of the model. If model is deployed in local, this is the inference type. If model is deployed in third-party service, this is platform name('proxy/<platform>') Defaults: vllm |
verbose | boolean | ❌ | Show verbose output. Defaults: False |
concurrency | integer | ❌ | Model concurrency limit Defaults: 100 |
prompt_template | string | ❌ | Prompt template. If None, the prompt template is automatically determined from model. Just for local deployment. |
context_length | integer | ❌ | The context length of the model. If None, it is automatically determined from model. |
trust_remote_code | boolean | ❌ | Trust remote code or not. Defaults: True |
download_dir | string | ❌ | Directory to download and load the weights, default to the default cache dir of huggingface. |
load_format | string | ❌ | The format of the model weights to load.
* "auto" will try to load the weights in the safetensors format and fall back to the pytorch bin format if safetensors format is not available.
* "pt" will load the weights in the pytorch bin format.
* "safetensors" will load the weights in the safetensors format.
* "npcache" will load the weights in pytorch format and store a numpy cache to speed up the loading.
* "dummy" will initialize the weights with random values, which is mainly for profiling.
* "tensorizer" will load the weights using tensorizer from CoreWeave. See the Tensorize vLLM Model script in the Examples section for more information.
* "runai_streamer" will load the Safetensors weights using Run:aiModel Streamer
* "bitsandbytes" will load the weights using bitsandbytes quantization.
Valid values: auto , pt , safetensors , npcache , dummy , tensorizer , runai_streamer , bitsandbytes , sharded_state , gguf , mistral Defaults: auto |
config_format | string | ❌ | The format of the model config to load.
* "auto" will try to load the config in hf format if available else it will try to load in mistral format Valid values: auto , hf , mistral Defaults: auto |
dtype | string | ❌ | Data type for model weights and activations.
* "auto" will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
* "half" for FP16. Recommended for AWQ quantization.
* "float16" is the same as "half".
* "bfloat16" for a balance between precision and range.
* "float" is shorthand for FP32 precision.
* "float32" for FP32 precision. Valid values: auto , half , float16 , bfloat16 , float , float32 Defaults: auto |
kv_cache_dtype | string | ❌ | Data type for kv cache storage. If "auto", will use model data type. CUDA 11.8+ supports fp8 (=fp8_e4m3) and fp8_e5m2. ROCm (AMD GPU) supports fp8 (=fp8_e4m3) Valid values: auto , fp8 , fp8_e5m2 , fp8_e4m3 Defaults: auto |
seed | integer | ❌ | Random seed for operations. Defaults: 0 |
max_model_len | integer | ❌ | Model context length. If unspecified, will be automatically derived from the model config. |
distributed_executor_backend | string | ❌ | Backend to use for distributed model workers, either "ray" or "mp" (multiprocessing). If the product of pipeline_parallel_size and tensor_parallel_size is less than or equal to the number of GPUs available, "mp" will be used to keep processing on a single host. Otherwise, this will default to "ray" if Ray is installed and fail otherwise. Note that tpu only supports Ray for distributed inference. Valid values: ray , mp , uni , external_launcher |
pipeline_parallel_size | integer | ❌ | Number of pipeline stages. Defaults: 1 |
tensor_parallel_size | integer | ❌ | Number of tensor parallel replicas. Defaults: 1 |
max_parallel_loading_workers | integer | ❌ | Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models. |
block_size | integer | ❌ | Token block size for contiguous chunks of tokens. This is ignored on neuron devices and set to ``--max-model-len``. On CUDA devices, only block sizes up to 32 are supported. On HPU devices, block size defaults to 128. Valid values: 8 , 16 , 32 , 64 , 128 |
enable_prefix_caching | boolean | ❌ | Enables automatic prefix caching. |
swap_space | number | ❌ | CPU swap space size (GiB) per GPU. Defaults: 4 |
cpu_offload_gb | number | ❌ | The space in GiB to offload to CPU, per GPU. Default is 0, which means no offloading. Intuitively, this argument can be seen as a virtual way to increase the GPU memory size. For example, if you have one 24 GB GPU and set this to 10, virtually you can think of it as a 34 GB GPU. Then you can load a 13B model with BF16 weight, which requires at least 26GB GPU memory. Note that this requires fast CPU-GPU interconnect, as part of the model is loaded from CPU memory to GPU memory on the fly in each model forward pass. Defaults: 0 |
gpu_memory_utilization | number | ❌ | The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50%% GPU memory utilization. If unspecified, will use the default value of 0.9. This is a per-instance limit, and only applies to the current vLLM instance.It does not matter if you have another vLLM instance running on the same GPU. For example, if you have two vLLM instances running on the same GPU, you can set the GPU memory utilization to 0.5 for each instance. Defaults: 0.9 |
max_num_batched_tokens | integer | ❌ | Maximum number of batched tokens per iteration. |
max_num_seqs | integer | ❌ | Maximum number of sequences per iteration. |
max_logprobs | integer | ❌ | Max number of log probs to return logprobs is specified in SamplingParams. Defaults: 20 |
revision | string | ❌ | The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version. |
code_revision | string | ❌ | The specific revision to use for the model code on Hugging Face Hub. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version. |
tokenizer_revision | string | ❌ | Revision of the huggingface tokenizer to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version. |
tokenizer_mode | string | ❌ | The tokenizer mode.
* "auto" will use the fast tokenizer if available.
* "slow" will always use the slow tokenizer.
* "mistral" will always use the `mistral_common` tokenizer. Valid values: auto , slow , mistral Defaults: auto |
quantization | string | ❌ | Method used to quantize the weights. If None, we first check the `quantization_config` attribute in the model config file. If that is None, we assume the model weights are not quantized and use `dtype` to determine the data type of the weights. Valid values: aqlm , awq , deepspeedfp , tpu_int8 , fp8 , ptpc_fp8 , fbgemm_fp8 , modelopt , marlin , gguf , gptq_marlin_24 , gptq_marlin , awq_marlin , gptq , compressed-tensors , bitsandbytes , qqq , hqq , experts_int8 , neuron_quant , ipex , quark , moe_wna16 |
max_seq_len_to_capture | integer | ❌ | Maximum sequence length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode. Additionally for encoder-decoder models, if the sequence length of the encoder input is larger than this, we fall back to the eager mode. Defaults: 8192 |
worker_cls | string | ❌ | The worker class to use for distributed execution. Defaults: auto |
extras | object | ❌ | Extra parameters, it will be passed to the vllm engine. |