Merge branch 'main' of https://github.com/microsoft/Olive into samuel…

…100/readme-update
microsoft · Dec 11, 2024 · 9ef08c3 · 9ef08c3
2 parents 52cfb3c + 9191ba6
commit 9ef08c3
Show file tree

Hide file tree

Showing 11 changed files with 568 additions and 185 deletions.
diff --git a/docs/source/how-to/configure-workflows/pass/convert-onnx.md b/docs/source/how-to/configure-workflows/pass/convert-onnx.md
@@ -61,19 +61,26 @@ b. More fine-grained control of the conversion conditions is also possible:
 
 See [Float16 Conversion](https://onnxruntime.ai/docs/performance/model-optimizations/float16.html#float16-conversion) for more detailed description of the available configuration parameters.
 
-## Inputs/Outputs Float16 to Float32 Conversion
+## Inputs/Outputs DataType Conversion
 
-Certain environments such as Onnxruntime WebGPU prefers Float32 logits. The `OnnxIOFloat16ToFloat32` pass converts the inputs and outputs to use Float32 instead of Float16.
+In certain environments, such as Onnxruntime WebGPU, Float32 logits are preferred. The `OnnxIODataTypeConverter` pass enables conversion of model inputs and outputs to a specified data type. This is particularly useful for converting between data types such as Float16 and Float32, or any other supported ONNX data types.
 
 ### Example Configuration
 
-a. The most basic configuration, which is suitable for many models, leaves all configuration options set to their default values:
+The simplest configuration converts all inputs and outputs from Float16 (source_dtype = 10) to Float32 (target_dtype = 1), which is suitable for many models:
+
 ```json
 {
-    "type": "OnnxIOFloat16ToFloat32"
+    "type": "OnnxIODataTypeConverter",
+    "source_dtype": 10,
+    "target_dtype": 1
 }
 ```
 
+### Datatype Mapping
+
+The `source_dtype` and `target_dtype` are integers corresponding to ONNX data types. You can find the complete mapping in the ONNX protobuf definition [here](https://github.com/onnx/onnx/blob/96a0ca4374d2198944ff882bd273e64222b59cb9/onnx/onnx.proto3#L503-L551).
+
 ## Mixed Precision Conversion
 Converting model to mixed precision.
 

diff --git a/docs/source/reference/pass.rst b/docs/source/reference/pass.rst
@@ -43,9 +43,9 @@ OnnxFloatToFloat16
 
 .. _onnx_io_float16_to_float32:
 
-OnnxIOFloat16ToFloat32
-----------------------
-.. autoconfigclass:: olive.passes.OnnxIOFloat16ToFloat32
+OnnxIODataTypeConverter
+------------------------
+.. autoconfigclass:: olive.passes.OnnxIODataTypeConverter
 
 .. _ort_mixed_precision:
 

diff --git a/examples/getting_started/olive-awq-ft-llama.ipynb b/examples/getting_started/olive-awq-ft-llama.ipynb
@@ -0,0 +1,290 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "tv6vx7wooDfk"
+   },
+   "source": [
+    "# ✨ Quantize & Finetune an SLM with Olive\n",
+    "\n",
+    "> ⚠️ **This notebook will quantize an Small Language Model (SLM) using the AWQ algorithm, which requires an Nvidia A10 or A100 GPU device.**\n",
+    "\n",
+    "In this notebook, you will:\n",
+    "\n",
+    "1. Quantize Llama-3.2-1B-Instruct model using the [AWQ Algorithm](https://ar5iv.labs.arxiv.org/html/2306.00978).\n",
+    "1. Fine-tune the quantized model to classify English phrases into Surprise/Joy/Fear/Sadness.\n",
+    "1. Optimize the fine-tuned model for the ONNX Runtime.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 🐍 Install Python dependencies\n",
+    "\n",
+    "The following cells create a pip requirements file and then install the libraries."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%writefile requirements.txt\n",
+    "\n",
+    "olive-ai==0.7.1\n",
+    "transformers==4.44.2\n",
+    "autoawq==0.2.6\n",
+    "optimum==1.23.1\n",
+    "peft==0.13.2\n",
+    "accelerate>=0.30.0\n",
+    "scipy==1.14.1\n",
+    "onnxruntime-genai==0.5.0\n",
+    "torchvision==0.18.1\n",
+    "tabulate==0.9.0"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "ZtY3VYxCoDfm"
+   },
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "\n",
+    "%pip install -r requirements.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 🤗 Login to Hugging Face\n",
+    "\n",
+    "In this notebook you'll be finetuning [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), which is *gated* on Hugging Face and therefore you will need to request access to the model. Once you have access to the model, you'll need to log-in to Hugging Face with a [user access token](https://huggingface.co/docs/hub/security-tokens) so that Olive can download it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!huggingface-cli login --token USER_ACCESS_TOKEN"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 🗜️ Quantize the model using AWQ\n",
+    "First, you'll quantize the [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) model using the [AWQ Algorithm](https://ar5iv.labs.arxiv.org/html/2306.00978). Olive also supports other quantization algorithms, such as GPTQ, HQQ, and RTN.\n",
+    "\n",
+    "You can choose a different model to quantize from Hugging-Face, just update the `--model_name_or_path` argument.\n",
+    "> ⏳ **It takes approximately ~6mins to complete the AWQ quantization**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!olive quantize \\\n",
+    "    --model_name_or_path \"meta-llama/Llama-3.2-1B-Instruct\" \\\n",
+    "    --trust_remote_code \\\n",
+    "    --algorithm awq \\\n",
+    "    --output_path models/llama/awq \\\n",
+    "    --log_level 1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "nxJCT5wioDfp"
+   },
+   "source": [
+    "## 🏃 Train the model\n",
+    "\n",
+    "Fine-tuning language models helps when we desire very specific outputs. In this example, you'll fine-tune the **AWQ quantized model variant** of Llama-3.2-1B-instruct from the previous cell to respond to an English phrase with a single word answer that classifies the phrases into one of surprise/fear/joy/sadness categories. Here is a sample of the data used for fine-tuning:\n",
+    "\n",
+    "```jsonl\n",
+    "{\"phrase\": \"The sudden thunderstorm caught me off guard.\", \"tone\": \"surprise\"}\n",
+    "{\"phrase\": \"The creaking door at night is quite spooky.\", \"tone\": \"fear\"}\n",
+    "{\"phrase\": \"Celebrating my birthday with friends is always fun.\", \"tone\": \"joy\"}\n",
+    "{\"phrase\": \"Saying goodbye to my pet was heart-wrenching.\", \"tone\": \"sadness\"}\n",
+    "```\n",
+    "\n",
+    "Fine-tuning *after* quantization provides an opportunity to recover some of the loss from the quantization process and enhance the model quality. For more details on quantization and finetuning, read [Is it better to quantize before or after finetuning?](https://onnxruntime.ai/blogs/olive-quant-ft).\n",
+    "\n",
+    "In the following `olive finetune` command the `--data_name` argument is a Hugging Face dataset [xxyyzzz/phrase_classification](https://huggingface.co/datasets/xxyyzzz/phrase_classification). You can also provide your own data from local disk using the `--data_files` argument.\n",
+    "\n",
+    "> ⏳ **It takes ~6mins to complete the Finetuning**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "8t36pRF2oDfq"
+   },
+   "outputs": [],
+   "source": [
+    "!olive finetune \\\n",
+    "    --method lora \\\n",
+    "    --model_name_or_path models/llama/awq \\\n",
+    "    --trust_remote_code \\\n",
+    "    --data_name xxyyzzz/phrase_classification \\\n",
+    "    --text_template \"<|start_header_id|>user<|end_header_id|>\\n{phrase}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n{tone}<|eot_id|>\" \\\n",
+    "    --max_steps 300 \\\n",
+    "    --output_path models/llama/ft \\\n",
+    "    --log_level 1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "7woNXLDF0bhh"
+   },
+   "source": [
+    "## 🪄 Automatic model optimization with Olive\n",
+    "\n",
+    "Next, you'll execute Olive's automatic optimizer using the `auto-opt` CLI command, which will:\n",
+    "\n",
+    "1. Capture the fine-tuned model into an ONNX graph and convert the weights into the ONNX format.\n",
+    "1. Optimize the ONNX graph (e.g. fuse nodes, reshape, etc).\n",
+    "1. Extract the fine-tuned LoRA weights and place them into a separate file.\n",
+    "\n",
+    "> ⏳**It takes ~2mins for the automatic optimization to complete**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "M-prKBy20U5m"
+   },
+   "outputs": [],
+   "source": [
+    "!olive auto-opt \\\n",
+    "    --model_name_or_path models/llama/ft/model \\\n",
+    "    --adapter_path models/llama/ft/adapter \\\n",
+    "    --device cpu \\\n",
+    "    --provider CPUExecutionProvider \\\n",
+    "    --use_ort_genai \\\n",
+    "    --output_path models/llama/onnx-ao \\\n",
+    "    --log_level 1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "8Uwm432loDfr"
+   },
+   "source": [
+    "## 🧠 Inference\n",
+    "\n",
+    "The code below creates a test app that consumes the model in a simple console chat interface. You will be prompted to enter an English phrase (for example: \"Cricket is a wonderful game\") and the app will output a chat completion using:\n",
+    "\n",
+    "1. The base model only (no adapter). You should notice that the model gives a verbose response.\n",
+    "1. The base model **plus adapter**. You should notice that we get one word classification. \n",
+    "\n",
+    "In the code, you'll  notice that ONNX Runtime allows you to hot-swap adapters for different tasks, which is often referred to as *multi-LoRA* serving.\n",
+    "\n",
+    "Whilst the inference code uses the Python API for the ONNX Runtime, other language bindings are available in [Java, C#, C++](https://github.com/microsoft/onnxruntime-genai/tree/main/examples).\n",
+    "\n",
+    "To exit the chat interface, enter `exit` or select `Ctrl+c`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "puMdoAxjoDfr"
+   },
+   "outputs": [],
+   "source": [
+    "import onnxruntime_genai as og\n",
+    "\n",
+    "model_path = \"models/llama/onnx-ao/model\"\n",
+    "\n",
+    "model = og.Model(f'{model_path}')\n",
+    "adapters = og.Adapters(model)\n",
+    "adapters.load(f'{model_path}/adapter_weights.onnx_adapter', \"classifier\")\n",
+    "tokenizer = og.Tokenizer(model)\n",
+    "tokenizer_stream = tokenizer.create_stream()\n",
+    "\n",
+    "# Keep asking for input prompts in a loop\n",
+    "while True:\n",
+    "    phrase = input(\"Phrase: \")\n",
+    "    prompt = f\"<|start_header_id|>user<|end_header_id|>\\n{phrase}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n\"\n",
+    "    input_tokens = tokenizer.encode(prompt)\n",
+    "    \n",
+    "    # first run without the adapter\n",
+    "    params = og.GeneratorParams(model)\n",
+    "    params.set_search_options(past_present_share_buffer=False)\n",
+    "    params.input_ids = input_tokens\n",
+    "    generator = og.Generator(model, params)\n",
+    "\n",
+    "    print()\n",
+    "    print(\"Output from Base Model (notice verbosity): \", end='', flush=True)\n",
+    "\n",
+    "    while not generator.is_done():\n",
+    "            generator.compute_logits()\n",
+    "            generator.generate_next_token()\n",
+    "\n",
+    "            new_token = generator.get_next_tokens()[0]\n",
+    "            print(tokenizer_stream.decode(new_token), end='', flush=True)\n",
+    "    print()\n",
+    "    print()\n",
+    "    \n",
+    "    # Delete the generator to free the captured graph for the next generator, if graph capture is enabled\n",
+    "    del generator\n",
+    "    \n",
+    "     # now run with adapter\n",
+    "    generator = og.Generator(model, params)\n",
+    "    # set the adapter to active for this response\n",
+    "    generator.set_active_adapter(adapters, \"classifier\")\n",
+    "\n",
+    "    print()\n",
+    "    print(\"Output from Base Model + Adapter (notice single word response): \", end='', flush=True)\n",
+    "\n",
+    "    while not generator.is_done():\n",
+    "            generator.compute_logits()\n",
+    "            generator.generate_next_token()\n",
+    "\n",
+    "            new_token = generator.get_next_tokens()[0]\n",
+    "            print(tokenizer_stream.decode(new_token), end='', flush=True)\n",
+    "    print()\n",
+    "    print()\n",
+    "    del generator"
+   ]
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "gpuType": "A100",
+   "provenance": [],
+   "toc_visible": true
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/examples/phi2/phi2.py b/examples/phi2/phi2.py
@@ -156,7 +156,7 @@ def main(raw_args=None):
             template_json["systems"]["local_system"]["accelerators"] = [
                 {"device": "GPU", "execution_providers": ["JsExecutionProvider"]}
             ]
-            fl_type = {"type": "OnnxIOFloat16ToFloat32"}
+            fl_type = {"type": "OnnxIODataTypeConverter"}
             template_json["passes"]["fp32_logits"] = fl_type
         new_json_file = "phi2_web.json"
         with open(new_json_file, "w") as f:

diff --git a/examples/phi3/phi3_template.json b/examples/phi3/phi3_template.json
@@ -95,7 +95,7 @@
         "merge_adapter_weights": { "type": "MergeAdapterWeights" },
         "awq": { "type": "AutoAWQQuantizer" },
         "builder": { "type": "ModelBuilder", "precision": "<place_holder>" },
-        "fp32_logits": { "type": "OnnxIOFloat16ToFloat32" },
+        "fp32_logits": { "type": "OnnxIODataTypeConverter" },
         "tune_session_params": {
             "type": "OrtSessionParamsTuning",
             "data_config": "gqa_transformer_prompt_dummy_data",

diff --git a/olive/cli/auto_opt.py b/olive/cli/auto_opt.py
@@ -440,7 +440,7 @@ def _get_passes_config(self, config: Dict[str, Any], olive_config: OlivePackageC
             ),
             ("peephole_optimizer", {"type": "OnnxPeepholeOptimizer"}),
             # change io types to fp32
-            ("fp16_to_fp32", {"type": "OnnxIOFloat16ToFloat32"}),
+            ("fp16_to_fp32", {"type": "OnnxIODataTypeConverter"}),
             # qnn preparation passes
             ("to_fixed_shape", {"type": "DynamicToFixedShape", "dim_param": None, "dim_value": None}),
             ("qnn_preprocess", {"type": "QNNPreprocess"}),

diff --git a/olive/olive_config.json b/olive/olive_config.json
@@ -94,11 +94,11 @@
             "supported_accelerators": [ "cpu" ],
             "supported_precisions": [ "fp16" ]
         },
-        "OnnxIOFloat16ToFloat32": {
-            "module_path": "olive.passes.onnx.float32_conversion.OnnxIOFloat16ToFloat32",
-            "supported_providers": [ "CPUExecutionProvider" ],
-            "supported_accelerators": [ "cpu" ],
-            "supported_precisions": [ "fp32" ]
+        "OnnxIODataTypeConverter": {
+            "module_path": "olive.passes.onnx.io_datatype_converter.OnnxIODataTypeConverter",
+            "supported_providers": [ "*" ],
+            "supported_accelerators": [ "*" ],
+            "supported_precisions": [ "*" ]
         },
         "OnnxMatMul4Quantizer": {
             "module_path": "olive.passes.onnx.quantization.OnnxMatMul4Quantizer",