Releases: intel/auto-round
v0.4.4 release
Highlights:
1 Fix install issue in #387
2 support to export gguf q4_0 and q4_1 format in #393
3 fix llm cmd line seqlen issue in #399
What's Changed
- fix a critic bug of static activation quantization by @wenhuach21 in #392
- vlm 70B+ in single card by @n1ck-guo in #395
- enhance calibration dataset and add awq pre quantization warning by @wenhuach21 in #396
- support awq format for vlms by @WeiweiZhang1 in #398
- [critic bug]fix llm example seqlen issue by @WeiweiZhang1 in #399
- fix device auto issue by @wenhuach21 in #400
- Fix auto-round install & bump into 0.4.4 by @XuehaoSun in #387
- fix dtype converting issue by @wenhuach21 in #403
- support for deepseek vl2 by @n1ck-guo in #401
- llm_layer_config_bugfix by @WeiweiZhang1 in #406
- support awq with qbits, only support sym by @wenhuach21 in #402
- support to export gguf q4_0 and q4_1 format by @n1ck-guo in #393
Full Changelog: v0.4.3...v0.4.4
v0.4.3: bug fix release
Highlights:
fix incorrect device setting in autoround format inference by @WeiweiZhang1 in #383
remove the dependency on AutoGPTQ by @XuehaoSun in #380
What's Changed
- support_llava_hf_vlm_example by @WeiweiZhang1 in #381
- fix block_name_to_quantize by @WeiweiZhang1 in #382
- fix incorrect device setting in autoround format inference by @WeiweiZhang1 in #383
- refine homepage, update model links by @WeiweiZhang1 in #385
- update eval basic usage by @n1ck-guo in #384
- refine error msg and dump more log in the tuning by @wenhuach21 in #386
- remove the dependency on AutoGPTQ for CPU and bump to V0.4.3 by @XuehaoSun in #380
Full Changelog: v0.4.2...v0.4.3
v0.4.2: bug fix release
Highlights
1 Fix autoawq exporting issue
2 remove bias exporting if possible in autogptq format
What's Changed
- bump version into v0.4.1 by @XuehaoSun in #350
- Update docker user and remove baseline UT by @XuehaoSun in #347
- delete llm example and refine readme by @wenhuach21 in #354
- Simulated W4Afp8 Quantization by @wenhuach21 in #331
- add QWQ-32B, VLM, Qwen2.5, Llama3.1 int4 models by @wenhuach21 in #356
- fix awq exporting by @wenhuach21 in #358
- Tensor reshape bugfix by @WeiweiZhang1 in #364
- fix awq backend and fp_layers issue by @wenhuach21 in #363
- fix awq exporting bugs by @wenhuach21 in #365
- fix bug of only_text_test check due to inference issue on cpu by @n1ck-guo in #362
- add gpu test by @wenhuach21 in #367
- using multicard when device set to "auto" by @n1ck-guo in #368
- quant_block_names enhancement by @WeiweiZhang1 in #369
- [HPU] Add lazy mode back by @yiliu30 in #371
- remove bias exporting if possible in autogptq format by @wenhuach21 in #375
- save processor automatically by @n1ck-guo in #372
- Add gpu ut by @wenhuach21 in #370
- fix gpu ut by @n1ck-guo in #376
- fix typos by @wenhuach21 in #377
Full Changelog: v0.4.1...v0.4.2
v0.4.1: bug fix release
Highlights:
- Fixed vllm calibration infinite loop issue
- Corrected the default value for the sym argument in the API configuration.
What's Changed
- fix typo by @wenhuach21 in #342
- vllm/llama-vision llava calibration infinite loop fix by @WeiweiZhang1 in #343
- [HPU]Enhance
numba
check by @yiliu30 in #345 - [VLM]fix bs and grad reset by @n1ck-guo in #344
- [HPU]Enhance installation check by @yiliu30 in #346
- [Critical Bug]API use sym as default by @wenhuach21 in #349
- triton backend requires< 3.0 by @wenhuach21 in #348
Full Changelog: v0.4...v0.4.1
v0.4
Highlights
[Experimental Feature] We provide API support for VLM models
[Kernel] We add ipex support for intel cpu
[Bug fix] We fix tuning bug for glm4 model
[Enhancement] better align gradient_accumulate_steps
behavior for varied length input
What's Changed
- refine AuoRound format and support marlin repacking by @wenhuach21 in #280
- update readme for v0.3.1 release by @wenhuach21 in #283
- update readme for cpu inference by @wenhuach21 in #284
- avoid deterministic algorithm warning in inference by @wenhuach21 in #285
- fix mx_fp issues by @wenhuach21 in #286
- update torch ao integration information by @wenhuach21 in #287
- Refine code by @wenhuach21 in #291
- Add ipex support for intel cpu by @wenhuach21 in #292
- fix ipex tqdm mismatch issue by @wenhuach21 in #293
- fix bug of backend by @wenhuach21 in #294
- [Experimental Feature]support for common hf multimodel by @n1ck-guo in #276
- use torch.compile by default for PyTorch versions 2.6 and above by @wenhuach21 in #295
- refine forward hook by @WeiweiZhang1 in #290
- eval for MLLMs by @n1ck-guo in #296
- mllm eval bug fix by @n1ck-guo in #297
- Port Numba-based packing from INC by @yiliu30 in #301
- refine model config file for mixed precision quantization by @wenhuach21 in #300
- fix glm4-9b batch dim issue by @wenhuach21 in #304
- better align gradient_accumulate_steps for varied length input by @wenhuach21 in #309
- Enable torch.compile on HPU by @yiliu30 in #307
- Update autogptq exporting by @wenhuach21 in #310
- fix typo by @wenhuach21 in #311
- qwen2 vision quantization bugfix by @WeiweiZhang1 in #313
- multiple gpu evaluation/calibration refine by @wenhuach21 in #312
- HPU only release binary by @yiliu30 in #302
- patch 1 for mllm by @n1ck-guo in #298
- add torch compile arg by @wenhuach21 in #314
- fix merge error by @n1ck-guo in #316
- Update the check for HPU by @yiliu30 in #318
- fix eval device issue by @wenhuach21 in #319
- fix multiple device bug by @wenhuach21 in #321
- add warning for no gptq exllamav2 kernel by @wenhuach21 in #324
- add pile calib, rename quant_block_list to to_quant_block_names by @WeiweiZhang1 in #322
- fix autogptq version error by @wenhuach21 in #325
- new mllm eval by @n1ck-guo in #317
- Add cpu only version by @XuehaoSun in #315
- set default mllm dataset by @n1ck-guo in #327
- fix fp_layers issue and force to FP16 on cuda for autoround format inference by @wenhuach21 in #326
- fix the bug of test model support for test-only by @n1ck-guo in #328
- Increase unit test timeout to 120 minutes by @XuehaoSun in #330
- fix mllm dataset config bug and add gptq cuda backend by @wenhuach21 in #329
- add tips and tricks for llm&mllm quantization by @wenhuach21 in #333
- fix eval_bs in fake format and reset auto-gptq exporting max_shard_size by @wenhuach21 in #332
- fix model_dtype issue and reformat mllm code by @wenhuach21 in #335
- Exclude markdown files from unit test pipelines by @XuehaoSun in #337
- refine mllm docs by @WeiweiZhang1 in #336
- cogvlm doc by @n1ck-guo in #339
- add qwen2.5 recipe and refine readme by @WeiweiZhang1 in #338
- add cogvlm recipe and refine readme by @WeiweiZhang1 in #340
- refine mllm API and add help info by @n1ck-guo in #334
Full Changelog: v0.3.1...v0.4
Intel® auto-round v0.3.1 Release
Release Highlights:
New Features:
Full-Range Symmetric Quantization: We’ve introduced full-range symmetric quantization, which often matches or even exceeds the performance of asymmetric quantization, especially at lower bit widths, such as 2.
Command-Line Support: You can now quantize models using the command auto-round --model xxx --format xxx
Default Exporting Format Change: The default format has been updated to auto_round instead of auto_gptq.
Muiti-thread packing: up to 2X speed up on packing phase
Bug Fixes:
Resolved Missing Cached Position Embeddings: Fixed an issue with missing cached position embeddings in Transformer version 4.45.2.
Mutable Default Values Issue: Addressed problems related to mutable default values.
3 bit packing bug for AutoGPTQ format
What's Changed
- Add setseed in autoround by @WeiweiZhang1 in #201
- support autoawq format by @yintong-lu in #115
- Remove UT coverage check by @XuehaoSun in #202
- set autoround format as default to unify CPU/HPU/CUDA by @wenhuach21 in #205
- add local file of pile-10k by @WeiweiZhang1 in #198
- modify setup.py by @n1ck-guo in #206
- limit the scale minimum value not to 0 by @WeiweiZhang1 in #211
- fix example dataset regression by @WeiweiZhang1 in #212
- remove local pile file by @WeiweiZhang1 in #213
- update xpu format exporting by @WeiweiZhang1 in #214
- fix a bug in autoround format inference by @wenhuach21 in #215
- avoid underflow and overflow for exllamav2 by @wenhuach21 in #218
- add qwen int4 model, refine example by @WeiweiZhang1 in #217
- [Experimental Feature]fast tuning norm/bias at 2 bits by @wenhuach21 in #208
- update readme by @wenhuach21 in #220
- refine eval_042 to enable parallelize evaluation by @WeiweiZhang1 in #221
- Enable phi3v tuning by @WeiweiZhang1 in #197
- Bump setuptools from 69.5.1 to 70.0.0 in /examples/multimodal-modeling/Phi-3-vision by @dependabot in #223
- refine example by @WeiweiZhang1 in #224
- change the scale thresh generally by @WeiweiZhang1 in #229
- add quantized models by 3rd party by @WeiweiZhang1 in #230
- add meta3.1-70B-instruct model, refine docs by @WeiweiZhang1 in #231
- fix model link by @WeiweiZhang1 in #232
- refine docs, add accuracy data, add receip and eval scripts by @WeiweiZhang1 in #226
- add brief formats introduction by @wenhuach21 in #236
- update readme and add itrex in the requirements.txt by @wenhuach21 in #238
- add tritonv2, improve packing and pbar by @wenhuach21 in #239
- refine the code and the speedup is notable by @wenhuach21 in #240
- move some settings from example to main by @wenhuach21 in #241
- add runable script for autoround by @n1ck-guo in #225
- update readme by @n1ck-guo in #242
- Add MANIFEST.in file to include requirements.txt by @XuehaoSun in #243
- fix example bug by @n1ck-guo in #245
- enable llava int4 inference with autoround format by @WeiweiZhang1 in #237
- remove autoawq requirement at packing stage by @n1ck-guo in #249
- remove unused log by @n1ck-guo in #252
- support INC API by @WeiweiZhang1 in #255
- avoid potential bug for auto-gptq 0.8 by @wenhuach21 in #250
- fix example by @n1ck-guo in #256
- fix preci by @n1ck-guo in #258
- enable_qwen2-vl_quantization by @WeiweiZhang1 in #248
- update eval and fix example by @n1ck-guo in #260
- refine autoawq exporting code by @wenhuach21 in #261
- better support quant_lm_head for larger models by @wenhuach21 in #263
- Fix 3bit packing for auto-gptq format by @wenhuach21 in #264
- Add a warning for improper export formats. by @wenhuach21 in #265
- Update readme for VLM support and integration by @wenhuach21 in #266
- remove g_idx in gptq format by @wenhuach21 in #267
- keep the dtype after qdq by @wenhuach21 in #268
- enable llama3.2-vision model quantization by @WeiweiZhang1 in #269
- fix mutable default value by @wenhuach21 in #272
- change to even rounding for mantissa of mx_fp by @wenhuach21 in #277
- adamround bugfix, refine import by @WeiweiZhang1 in #275
- [Important Change]set full range sym as the default by @wenhuach21 in #278
- refine eval by @wenhuach21 in #282
- qwen2_bugfix, add adamround vision UT by @WeiweiZhang1 in #281
New Contributors
- @dependabot made their first contribution in #223
Full Changelog: v0.3...v0.3.1
Intel® auto-round v0.3 Release
-
Highlights:
- Broader Device Support:
- Expanded support for CPU, HPU, and CUDA inference in the AutoRound format, resolving the 2-bit accuracy issue.
- New Recipes and Model Releases:
- Published numerous recipes on the Low Bit Open LLM Leaderboard, showcasing impressive results on LLaMa 3.1 and other leading models.
- Experimental Features:
- Introduced several experimental features, including activation quantization and
mx_fp
, with promising outcomes with AutoRound.
- Introduced several experimental features, including activation quantization and
- Multimodal Model Support:
- Extended capabilities for tuning and inference across several multimodal models.
Lowlights:
- Implemented support for
low_cpu_mem_usage
,auto_awq
format, calibration dataset concatenation, and calibration datasets with chat templates.
- Broader Device Support:
Intel® auto-round v0.2 Release
Overview
We supported the Intel XPU format and implemented lm-head quantization and inference, reducing the model size from 5.4GB to 4.7GB for LLAMA3 at W4G128. Additionally, we supported both local and mixed online datasets for calibration. By optimizing memory usage and tuning costs, the calibration process now takes approximately 20 minutes for 7B models and 2.5 hours for 70B models with 512 samples by setting disable_low_gpu_mem_usage
.
Others:
More accuracy data as presented in [paper](https://arxiv.org/pdf/2309.05516) and [low_bit_open_llm_leaderboard](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard)
More technical details as presented in [paper](https://arxiv.org/pdf/2309.05516)
Known issues:
Large discrepancy between gptq model and qdq model for asymmetric quantization in some scenarios. We are working on it.
Intel® auto-round v0.1 Release
Overview
AutoRound introduces an innovative weight-only quantization algorithm designed specifically for low-bit LLM inference, approaching near-lossless compression for a range of popular models including gemma-7B, Mistral-7b, Mixtral-8x7B-v0.1, Mixtral-8x7B-Instruct-v0.1, Phi2, LLAMA2 and more at W4G128. AutoRound consistently outperforms established methods across the majority of scenarios at W4G128, W4G-1, W3G128, and W2G128 .
Key Features
- Wide Model Support: AutoRound caters to a diverse range of model families. About 20 model families have been verified.
- Export Flexibility: Effortlessly export quantized models to ITREX[1] and AutoGPTQ[2] formats for seamless deployment on Intel CPU and Nvidia GPU platforms respectively.
- Device Compatibility: Compatible with tuning devices including Intel CPUs, Intel Guadi2, and Nvidia GPUs.
- Dataset Flexibility: AutoRound supports calibration with Pile10k and MBPP datasets, with easy extensibility to incorporate additional datasets.
Examples
- Explore language modeling and code generation examples to unlock the full potential of AutoRound.
Additional Benefits
- PreQuantized Models: Access a variety of pre-quantized models on Hugging Face for immediate integration into your projects, with more models under review and coming soon.
- Comprehensive Accuracy Data: Simplified user deployment with extensive accuracy data provided.
Known issues:
- baichuan-inc/Baichuan2-13B-Chat has some issues, we will support it soon
Reference:
[1] https://github.com/intel/intel-extension-for-transformers