Skip to content

Commit

Permalink
Merge branch 'main' of https://github.com/microsoft/Olive into samuel…
Browse files Browse the repository at this point in the history
…100/readme-update
  • Loading branch information
samuel100 committed Dec 11, 2024
2 parents 52cfb3c + 9191ba6 commit 9ef08c3
Show file tree
Hide file tree
Showing 11 changed files with 568 additions and 185 deletions.
15 changes: 11 additions & 4 deletions docs/source/how-to/configure-workflows/pass/convert-onnx.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,19 +61,26 @@ b. More fine-grained control of the conversion conditions is also possible:

See [Float16 Conversion](https://onnxruntime.ai/docs/performance/model-optimizations/float16.html#float16-conversion) for more detailed description of the available configuration parameters.

## Inputs/Outputs Float16 to Float32 Conversion
## Inputs/Outputs DataType Conversion

Certain environments such as Onnxruntime WebGPU prefers Float32 logits. The `OnnxIOFloat16ToFloat32` pass converts the inputs and outputs to use Float32 instead of Float16.
In certain environments, such as Onnxruntime WebGPU, Float32 logits are preferred. The `OnnxIODataTypeConverter` pass enables conversion of model inputs and outputs to a specified data type. This is particularly useful for converting between data types such as Float16 and Float32, or any other supported ONNX data types.

### Example Configuration

a. The most basic configuration, which is suitable for many models, leaves all configuration options set to their default values:
The simplest configuration converts all inputs and outputs from Float16 (source_dtype = 10) to Float32 (target_dtype = 1), which is suitable for many models:

```json
{
"type": "OnnxIOFloat16ToFloat32"
"type": "OnnxIODataTypeConverter",
"source_dtype": 10,
"target_dtype": 1
}
```

### Datatype Mapping

The `source_dtype` and `target_dtype` are integers corresponding to ONNX data types. You can find the complete mapping in the ONNX protobuf definition [here](https://github.com/onnx/onnx/blob/96a0ca4374d2198944ff882bd273e64222b59cb9/onnx/onnx.proto3#L503-L551).

## Mixed Precision Conversion
Converting model to mixed precision.

Expand Down
6 changes: 3 additions & 3 deletions docs/source/reference/pass.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,9 @@ OnnxFloatToFloat16

.. _onnx_io_float16_to_float32:

OnnxIOFloat16ToFloat32
----------------------
.. autoconfigclass:: olive.passes.OnnxIOFloat16ToFloat32
OnnxIODataTypeConverter
------------------------
.. autoconfigclass:: olive.passes.OnnxIODataTypeConverter

.. _ort_mixed_precision:

Expand Down
290 changes: 290 additions & 0 deletions examples/getting_started/olive-awq-ft-llama.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "tv6vx7wooDfk"
},
"source": [
"# ✨ Quantize & Finetune an SLM with Olive\n",
"\n",
"> ⚠️ **This notebook will quantize an Small Language Model (SLM) using the AWQ algorithm, which requires an Nvidia A10 or A100 GPU device.**\n",
"\n",
"In this notebook, you will:\n",
"\n",
"1. Quantize Llama-3.2-1B-Instruct model using the [AWQ Algorithm](https://ar5iv.labs.arxiv.org/html/2306.00978).\n",
"1. Fine-tune the quantized model to classify English phrases into Surprise/Joy/Fear/Sadness.\n",
"1. Optimize the fine-tuned model for the ONNX Runtime.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 🐍 Install Python dependencies\n",
"\n",
"The following cells create a pip requirements file and then install the libraries."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile requirements.txt\n",
"\n",
"olive-ai==0.7.1\n",
"transformers==4.44.2\n",
"autoawq==0.2.6\n",
"optimum==1.23.1\n",
"peft==0.13.2\n",
"accelerate>=0.30.0\n",
"scipy==1.14.1\n",
"onnxruntime-genai==0.5.0\n",
"torchvision==0.18.1\n",
"tabulate==0.9.0"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ZtY3VYxCoDfm"
},
"outputs": [],
"source": [
"%%capture\n",
"\n",
"%pip install -r requirements.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 🤗 Login to Hugging Face\n",
"\n",
"In this notebook you'll be finetuning [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), which is *gated* on Hugging Face and therefore you will need to request access to the model. Once you have access to the model, you'll need to log-in to Hugging Face with a [user access token](https://huggingface.co/docs/hub/security-tokens) so that Olive can download it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!huggingface-cli login --token USER_ACCESS_TOKEN"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 🗜️ Quantize the model using AWQ\n",
"First, you'll quantize the [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) model using the [AWQ Algorithm](https://ar5iv.labs.arxiv.org/html/2306.00978). Olive also supports other quantization algorithms, such as GPTQ, HQQ, and RTN.\n",
"\n",
"You can choose a different model to quantize from Hugging-Face, just update the `--model_name_or_path` argument.\n",
"> ⏳ **It takes approximately ~6mins to complete the AWQ quantization**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!olive quantize \\\n",
" --model_name_or_path \"meta-llama/Llama-3.2-1B-Instruct\" \\\n",
" --trust_remote_code \\\n",
" --algorithm awq \\\n",
" --output_path models/llama/awq \\\n",
" --log_level 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nxJCT5wioDfp"
},
"source": [
"## 🏃 Train the model\n",
"\n",
"Fine-tuning language models helps when we desire very specific outputs. In this example, you'll fine-tune the **AWQ quantized model variant** of Llama-3.2-1B-instruct from the previous cell to respond to an English phrase with a single word answer that classifies the phrases into one of surprise/fear/joy/sadness categories. Here is a sample of the data used for fine-tuning:\n",
"\n",
"```jsonl\n",
"{\"phrase\": \"The sudden thunderstorm caught me off guard.\", \"tone\": \"surprise\"}\n",
"{\"phrase\": \"The creaking door at night is quite spooky.\", \"tone\": \"fear\"}\n",
"{\"phrase\": \"Celebrating my birthday with friends is always fun.\", \"tone\": \"joy\"}\n",
"{\"phrase\": \"Saying goodbye to my pet was heart-wrenching.\", \"tone\": \"sadness\"}\n",
"```\n",
"\n",
"Fine-tuning *after* quantization provides an opportunity to recover some of the loss from the quantization process and enhance the model quality. For more details on quantization and finetuning, read [Is it better to quantize before or after finetuning?](https://onnxruntime.ai/blogs/olive-quant-ft).\n",
"\n",
"In the following `olive finetune` command the `--data_name` argument is a Hugging Face dataset [xxyyzzz/phrase_classification](https://huggingface.co/datasets/xxyyzzz/phrase_classification). You can also provide your own data from local disk using the `--data_files` argument.\n",
"\n",
"> ⏳ **It takes ~6mins to complete the Finetuning**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8t36pRF2oDfq"
},
"outputs": [],
"source": [
"!olive finetune \\\n",
" --method lora \\\n",
" --model_name_or_path models/llama/awq \\\n",
" --trust_remote_code \\\n",
" --data_name xxyyzzz/phrase_classification \\\n",
" --text_template \"<|start_header_id|>user<|end_header_id|>\\n{phrase}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n{tone}<|eot_id|>\" \\\n",
" --max_steps 300 \\\n",
" --output_path models/llama/ft \\\n",
" --log_level 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7woNXLDF0bhh"
},
"source": [
"## 🪄 Automatic model optimization with Olive\n",
"\n",
"Next, you'll execute Olive's automatic optimizer using the `auto-opt` CLI command, which will:\n",
"\n",
"1. Capture the fine-tuned model into an ONNX graph and convert the weights into the ONNX format.\n",
"1. Optimize the ONNX graph (e.g. fuse nodes, reshape, etc).\n",
"1. Extract the fine-tuned LoRA weights and place them into a separate file.\n",
"\n",
"> ⏳**It takes ~2mins for the automatic optimization to complete**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "M-prKBy20U5m"
},
"outputs": [],
"source": [
"!olive auto-opt \\\n",
" --model_name_or_path models/llama/ft/model \\\n",
" --adapter_path models/llama/ft/adapter \\\n",
" --device cpu \\\n",
" --provider CPUExecutionProvider \\\n",
" --use_ort_genai \\\n",
" --output_path models/llama/onnx-ao \\\n",
" --log_level 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8Uwm432loDfr"
},
"source": [
"## 🧠 Inference\n",
"\n",
"The code below creates a test app that consumes the model in a simple console chat interface. You will be prompted to enter an English phrase (for example: \"Cricket is a wonderful game\") and the app will output a chat completion using:\n",
"\n",
"1. The base model only (no adapter). You should notice that the model gives a verbose response.\n",
"1. The base model **plus adapter**. You should notice that we get one word classification. \n",
"\n",
"In the code, you'll notice that ONNX Runtime allows you to hot-swap adapters for different tasks, which is often referred to as *multi-LoRA* serving.\n",
"\n",
"Whilst the inference code uses the Python API for the ONNX Runtime, other language bindings are available in [Java, C#, C++](https://github.com/microsoft/onnxruntime-genai/tree/main/examples).\n",
"\n",
"To exit the chat interface, enter `exit` or select `Ctrl+c`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "puMdoAxjoDfr"
},
"outputs": [],
"source": [
"import onnxruntime_genai as og\n",
"\n",
"model_path = \"models/llama/onnx-ao/model\"\n",
"\n",
"model = og.Model(f'{model_path}')\n",
"adapters = og.Adapters(model)\n",
"adapters.load(f'{model_path}/adapter_weights.onnx_adapter', \"classifier\")\n",
"tokenizer = og.Tokenizer(model)\n",
"tokenizer_stream = tokenizer.create_stream()\n",
"\n",
"# Keep asking for input prompts in a loop\n",
"while True:\n",
" phrase = input(\"Phrase: \")\n",
" prompt = f\"<|start_header_id|>user<|end_header_id|>\\n{phrase}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n\"\n",
" input_tokens = tokenizer.encode(prompt)\n",
" \n",
" # first run without the adapter\n",
" params = og.GeneratorParams(model)\n",
" params.set_search_options(past_present_share_buffer=False)\n",
" params.input_ids = input_tokens\n",
" generator = og.Generator(model, params)\n",
"\n",
" print()\n",
" print(\"Output from Base Model (notice verbosity): \", end='', flush=True)\n",
"\n",
" while not generator.is_done():\n",
" generator.compute_logits()\n",
" generator.generate_next_token()\n",
"\n",
" new_token = generator.get_next_tokens()[0]\n",
" print(tokenizer_stream.decode(new_token), end='', flush=True)\n",
" print()\n",
" print()\n",
" \n",
" # Delete the generator to free the captured graph for the next generator, if graph capture is enabled\n",
" del generator\n",
" \n",
" # now run with adapter\n",
" generator = og.Generator(model, params)\n",
" # set the adapter to active for this response\n",
" generator.set_active_adapter(adapters, \"classifier\")\n",
"\n",
" print()\n",
" print(\"Output from Base Model + Adapter (notice single word response): \", end='', flush=True)\n",
"\n",
" while not generator.is_done():\n",
" generator.compute_logits()\n",
" generator.generate_next_token()\n",
"\n",
" new_token = generator.get_next_tokens()[0]\n",
" print(tokenizer_stream.decode(new_token), end='', flush=True)\n",
" print()\n",
" print()\n",
" del generator"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"gpuType": "A100",
"provenance": [],
"toc_visible": true
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
2 changes: 1 addition & 1 deletion examples/phi2/phi2.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ def main(raw_args=None):
template_json["systems"]["local_system"]["accelerators"] = [
{"device": "GPU", "execution_providers": ["JsExecutionProvider"]}
]
fl_type = {"type": "OnnxIOFloat16ToFloat32"}
fl_type = {"type": "OnnxIODataTypeConverter"}
template_json["passes"]["fp32_logits"] = fl_type
new_json_file = "phi2_web.json"
with open(new_json_file, "w") as f:
Expand Down
2 changes: 1 addition & 1 deletion examples/phi3/phi3_template.json
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@
"merge_adapter_weights": { "type": "MergeAdapterWeights" },
"awq": { "type": "AutoAWQQuantizer" },
"builder": { "type": "ModelBuilder", "precision": "<place_holder>" },
"fp32_logits": { "type": "OnnxIOFloat16ToFloat32" },
"fp32_logits": { "type": "OnnxIODataTypeConverter" },
"tune_session_params": {
"type": "OrtSessionParamsTuning",
"data_config": "gqa_transformer_prompt_dummy_data",
Expand Down
2 changes: 1 addition & 1 deletion olive/cli/auto_opt.py
Original file line number Diff line number Diff line change
Expand Up @@ -440,7 +440,7 @@ def _get_passes_config(self, config: Dict[str, Any], olive_config: OlivePackageC
),
("peephole_optimizer", {"type": "OnnxPeepholeOptimizer"}),
# change io types to fp32
("fp16_to_fp32", {"type": "OnnxIOFloat16ToFloat32"}),
("fp16_to_fp32", {"type": "OnnxIODataTypeConverter"}),
# qnn preparation passes
("to_fixed_shape", {"type": "DynamicToFixedShape", "dim_param": None, "dim_value": None}),
("qnn_preprocess", {"type": "QNNPreprocess"}),
Expand Down
10 changes: 5 additions & 5 deletions olive/olive_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -94,11 +94,11 @@
"supported_accelerators": [ "cpu" ],
"supported_precisions": [ "fp16" ]
},
"OnnxIOFloat16ToFloat32": {
"module_path": "olive.passes.onnx.float32_conversion.OnnxIOFloat16ToFloat32",
"supported_providers": [ "CPUExecutionProvider" ],
"supported_accelerators": [ "cpu" ],
"supported_precisions": [ "fp32" ]
"OnnxIODataTypeConverter": {
"module_path": "olive.passes.onnx.io_datatype_converter.OnnxIODataTypeConverter",
"supported_providers": [ "*" ],
"supported_accelerators": [ "*" ],
"supported_precisions": [ "*" ]
},
"OnnxMatMul4Quantizer": {
"module_path": "olive.passes.onnx.quantization.OnnxMatMul4Quantizer",
Expand Down
Loading

0 comments on commit 9ef08c3

Please sign in to comment.