Update on the development branch #1316

kaiyux · 2024-03-19T09:43:53Z

kaiyux
Mar 19, 2024
Maintainer

Hi,

The TensorRT-LLM team is pleased to announce that we are pushing an update to the development branch (and the Triton backend) this March 19, 2024.

This update includes:

Features
- Support run GptSession without OpenMPI Run GptSession without openmpi? #1220
- Add Python bindings for new C++ executor API, see documentation and examples in examples/bindings
- [BREAKING CHANGE] TopP sampling optimization with deterministic AIR TopP algorithm is enabled by default
API
- [BREAKING CHANGE] Refactor GPT with unified building workflow, see examples/gpt/README.md for the latest commands
- [BREAKING CHANGE] Refactored Qwen model to the unified build workflow, see examples/qwen/README.md for the latest commands.
- [BREAKING CHANGE] Roved all the lora related flags from convert_checkpoint.py script and the checkpoint content to trtllm-build command, to generalize the feature better to more models.
- [BREAKING CHANGE] Removed the use_prompt_tuning flag and options from convert_checkpoint.py script and the checkpoint content, to generalize the feature better to more models. Use the trtllm-build --max_prompt_embedding_table_size instead.
- [BREAKING CHANGE] Changed the trtllm-build --world_size flag to --auto_parallel flag, the option is used for auto parallel planner only.
- [BREAKING CHANGE] AsyncLLMEngine is removed, tensorrt_llm.GenerationExecutor class is refactored to work with both explicitly launching with mpirun in the application level, and accept an MPI communicator created by mpi4py
- [BREAKING CHANGE] examples/server are removed, see examples/app instead.
Bug fixes
- Fix wrong SamplingConfig tensors in ModelRunnerCpp ModelRunnerCpp does not transfer SamplingConfig Tensor fields correctly #1183
- Fix error when converting SmoothQuant LLaMA Smoothquant LLaMA builds not working on 0.8.0 release #1267
- Fix the issue that examples/run.py only load one line from --input_file
Benchmark
- Support arbitrary dataset from HuggingFace for C++ benchmarks, see “Prepare dataset” section in benchmarks/cpp/README.md
Infra
- Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.02-py3
  - The dependent PyTorch version is updated to 2.2.
- Base Docker image for TensorRT-LLM backend is updated to nvcr.io/nvidia/tritonserver:24.02-py3
- The dependent CUDA version is updated to 12.3.2 (a.k.a. 12.3 Update 2)
Documentation
- Add documents for new C++ executor API, see docs/source/executor.md

Thanks,
The TensorRT-LLM Engineering Team

akhoroshev · 2024-03-21T09:17:52Z

akhoroshev
Mar 21, 2024

Does executor API support multi gpu?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update on the development branch #1316

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Update on the development branch #1316

kaiyux Mar 19, 2024 Maintainer

Replies: 1 comment

akhoroshev Mar 21, 2024

kaiyux
Mar 19, 2024
Maintainer

akhoroshev
Mar 21, 2024