TensorRT-LLM 0.7.1 Release #749

kaiyux · 2023-12-26T12:14:55Z

kaiyux
Dec 26, 2023
Maintainer

Hi,

We are very pleased to announce the 0.7.1 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.

This update includes:

Models
- BART and mBART support in encoder-decoder models
- FairSeq Neural Machine Translation (NMT) family
- Mixtral-8x7B model
  - Support weight loading for HuggingFace Mixtral model
- OpenAI Whisper
- Mixture of Experts support
- MPT - Int4 AWQ / SmoothQuant support
- Baichuan FP8 quantization support
Features
- [Preview] Speculative decoding
- Add Python binding for GptManager
- Add a Python class ModelRunnerCpp that wraps C++ gptSession
- System prompt caching
- Enable split-k for weight-only cutlass kernels
- FP8 KV cache support for XQA kernel
- New Python builder API and trtllm-build command(already applied to blip2 and OPT )
- Support StoppingCriteria and LogitsProcessor in Python generate API (thanks to the contribution from @zhang-ge-hao)
- fMHA support for chunked attention and paged kv cache
Bug fixes
- Fix tokenizer usage in quantize.py Llama 2 FP8 quantization OOM #288, thanks to the contribution from @0xymoro
- Fix LLaMa with LoRA error [Llama Model] Run LLaMa with LoRA Error #637
- Fix LLaMA GPTQ failure [0.6.1] llama 13b gptq the value update is not the same shape as the original. updated: (2560, 3840), original (5120, 3840) #580
- Fix Python binding for InferenceRequest issue Issue in python binding for InferenceRequest for GptMananger #528
- Fix CodeLlama SQ accuracy issue Why are the human eval scores of smoothquant and int8_weight_only very very low? #453
- Minor bug fixes
Performance
- MMHA optimization for MQA and GQA
- LoRA optimization: cutlass grouped gemm
- Optimize Hopper warp specialized kernels
- Optimize AllReduce for parallel attention on Falcon and GPT-J
- Enable split-k for weight-only cutlass kernel when SM>=75
Documentation
- Add documentation for new builder workflow

Currently, there are two key branches in the project:

The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
The main branch is the dev branch. It is more experimental.

We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,

The TensorRT-LLM Engineering Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM 0.7.1 Release #749

{{title}}

Replies: 0 comments

Select a reply

TensorRT-LLM 0.7.1 Release #749

kaiyux Dec 26, 2023 Maintainer

Replies: 0 comments

kaiyux
Dec 26, 2023
Maintainer