BitsandbytesQuantization8bits Configuration
Bits and bytes quantization 8 bits parameters.
Parameters
Name | Type | Required | Description |
---|---|---|---|
load_in_4bits | boolean | ❌ | Whether to load the model in 4 bits, default is False. Defaults: False |
llm_int8_enable_fp32_cpu_offload | boolean | ❌ | 8-bit models can offload weights between the CPU and GPU to support fitting very large models into memory. The weights dispatched to the CPU are actually stored in float32, and aren’t converted to 8-bit. Defaults: False |
llm_int8_threshold | number | ❌ | An “outlier” is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning). Defaults: 6.0 |
llm_int8_skip_modules | string | ❌ | An explicit list of the modules that we do not want to convert in 8-bit. This is useful for models such as Jukebox that has several heads in different places and not necessarily at the last position. For example for `CausalLM` models, the last `lm_head` is kept in its original `dtype` Defaults: [] |
load_in_8bits | boolean | ❌ | Whether to load the model in 8 bits(LLM.int8() algorithm). Defaults: True |