跳到主要内容
版本:dev

BitsandbytesQuantization8bits Configuration

Bits and bytes quantization 8 bits parameters.

Parameters

NameTypeRequiredDescription
load_in_4bitsboolean
Whether to load the model in 4 bits, default is False.
Defaults:False
llm_int8_enable_fp32_cpu_offloadboolean
8-bit models can offload weights between the CPU and GPU to support fitting very large models into memory. The weights dispatched to the CPU are actually stored in float32, and aren’t converted to 8-bit.
Defaults:False
llm_int8_thresholdnumber
An “outlier” is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning).
Defaults:6.0
llm_int8_skip_modulesstring
An explicit list of the modules that we do not want to convert in 8-bit. This is useful for models such as Jukebox that has several heads in different places and not necessarily at the last position. For example for `CausalLM` models, the last `lm_head` is kept in its original `dtype`
Defaults:[]
load_in_8bitsboolean
Whether to load the model in 8 bits(LLM.int8() algorithm).
Defaults:True