Fix errors in scripts and notebooks in `examples/` and drop `sparseml` dependence #247

brian-dellabetta · 2025-01-28T20:04:21Z

After running some of the examples and discussing with Kyle, proposing some alterations to the examples provided in compresed_tensors. We removed references to llm-compressor, just sticking to transformers/torch dependencies, and dropped the missing reference to compressed_tensors.quantization.freeze_module_quantization. One example was strictly tied to how sparseml could be used, so it was just refactored to llmcompressor with a comment.

Other maintenance to do after merging this into Kyle's PR

Add CI/CD to test example/notebook files as PR checks

review-notebook-app · 2025-01-28T20:04:27Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

examples/bit_packing/ex_quantize_and_pack.py

brian-dellabetta · 2025-01-28T20:06:03Z

examples/bit_packing/ex_quantize_and_pack.py

+# TODO check with team -- move code back from llmcompressor or drop?
 # freeze params after calibration
-model.apply(freeze_module_quantization)
+# model.apply(freeze_module_quantization)


This is the part I was hoping for feedback on. freeze_module_quantization is part of llm-compressor now, do we need it here for this to work?

I don't think we need to import it from llmcompressor, but yes we shouldn't need it for compression

Definitely in favor of adding back to llm-compressor

@rahul-tuli I think it's worth showing here? Freezing the module is part of the compressed tensors lifecycle, right?

Absolutely, I just meant compression would work regardless

Freezing should be removed. This example does not use observers and therefore, having a freeze step would be irrelevant.

This example actually isn't calibrating anything. It is just running QDQ. To be useful, you would need to introduce some sort of primitive to update scales/zp.

brian-dellabetta · 2025-01-31T20:43:32Z

examples/llama_1.1b/ex_llmcompressor_quantization.py

-model = SparseAutoModelForCausalLM.from_pretrained(model_name, device_map=device, torch_dtype="auto")
-
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-data_args = DataTrainingArguments(
-    dataset=dataset_name,
-    max_seq_length=max_seq_length,
-    pad_to_max_length=pad_to_max_length,
-)
-dataset_manager = TextGenerationDataset.load_from_registry(
-    data_args.dataset,
-    data_args=data_args,
-    split=split,
-    tokenizer=tokenizer,
-)
-calib_dataset = dataset_manager.tokenize_and_process(
-    dataset_manager.get_raw_dataset()
-)


none of this seemed to be used in oneshot, any idea what it was in there for?

This is so outdated, thank you for removing!

brian-dellabetta · 2025-01-31T20:44:10Z

examples/llama_1.1b/ex_llmcompressor_quantization.py

+# The following example shows how the example in `ex_config_quantization.py`
+# can be done within vllm's llm-compressor project
+# Be sure to `pip install llmcompressor` before running
+# See https://github.com/vllm-project/llm-compressor for more information


I figure since this example is explicitly to show sparseml/llmcompressor it's worth leaving the references to llmcompressor, or we can just delete it? what do people think?

Yep definitely! With the other fixed examples, a user should have enough of a sense of how the primitives here work outside of the context of llm compressor

brian-dellabetta · 2025-01-31T20:45:25Z

examples/llama_1.1b/ex_config_quantization.py

-# freeze params after calibration
-model.apply(freeze_module_quantization)
+# apply compression
+# TODO this line fails because "fakequant" format is not found in registry


This script fails because "fakequant" format specified in example_quant_config.json is not in the registry. It looks like some tests use fakequant but it's not part of source. is there a default format we should move this to?

I generally use FROZEN, ModelCompressor.compress should update it to COMPRESSED

Same comment as above.

.gitignore

examples/bit_packing/ex_quantize_and_pack.py

kylesayrs · 2025-01-31T21:09:45Z

examples/bitmask_compression.ipynb

   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "The example layer model.layers.0.self_attn.q_proj.weight has sparsity 0.50%\n"
+      "The example layer model.layers.0.self_attn.q_proj.weight has sparsity 50.00%\n"


Suggested change

"The example layer model.layers.0.self_attn.q_proj.weight has sparsity 50.00%\n"

"The example layer model.layers.0.self_attn.q_proj.weight has sparsity 50%\n"

nit: no need for so many sig figs :)

.gitignore

examples/bit_packing/ex_quantize_and_pack.py

Co-authored-by: Kyle Sayers <[email protected]>

rahul-tuli

Great changes, Thank you for removing all the unnecessary code!

rahul-tuli · 2025-01-31T21:52:32Z

examples/bit_packing/ex_quantize_and_pack.py

@@ -49,39 +50,40 @@
 apply_quantization_config(model, config)

 # create dataset
+dataset = load_dataset(dataset_name, split=f"train[:{num_calibration_samples}]")


rahul-tuli · 2025-01-31T21:56:14Z

examples/llama_1.1b/ex_llmcompressor_quantization.py


-recipe = "example_quant_recipe.yaml"
+
+recipe = str(Path(__file__).parent / "example_quant_recipe.yaml")


rahul-tuli · 2025-01-31T21:57:44Z

examples/llama_1.1b/ex_llmcompressor_quantization.py

-model = SparseAutoModelForCausalLM.from_pretrained(model_name, device_map=device, torch_dtype="auto")
-
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-data_args = DataTrainingArguments(
-    dataset=dataset_name,
-    max_seq_length=max_seq_length,
-    pad_to_max_length=pad_to_max_length,
-)
-dataset_manager = TextGenerationDataset.load_from_registry(
-    data_args.dataset,
-    data_args=data_args,
-    split=split,
-    tokenizer=tokenizer,
-)
-calib_dataset = dataset_manager.tokenize_and_process(
-    dataset_manager.get_raw_dataset()
-)


This is so outdated, thank you for removing!

rahul-tuli · 2025-01-31T22:24:30Z

examples/bit_packing/ex_quantize_and_pack.py

+# TODO check with team -- move code back from llmcompressor or drop?
 # freeze params after calibration
-model.apply(freeze_module_quantization)
+# model.apply(freeze_module_quantization)


I don't think we need to import it from llmcompressor, but yes we shouldn't need it for compression

rahul-tuli · 2025-01-31T22:26:05Z

examples/llama_1.1b/ex_config_quantization.py

@@ -48,40 +50,38 @@
 apply_quantization_config(model, config)

 # create dataset
+dataset = load_dataset(dataset_name, split="train[:128]")


Suggested change

dataset = load_dataset(dataset_name, split="train[:128]")

dataset = load_dataset(dataset_name, split=f"train[:num_calibration_samples]")

Should it be this?

rahul-tuli · 2025-01-31T22:40:41Z

examples/llama_1.1b/ex_config_quantization.py

-# freeze params after calibration
-model.apply(freeze_module_quantization)
+# apply compression
+# TODO this line fails because "fakequant" format is not found in registry


I generally use FROZEN, ModelCompressor.compress should update it to COMPRESSED

kylesayrs

This is really nice, thank you so much for doing this.

I've added my responses to some of the discussion comments, but I agree with these changes. Lastly, I think we should upload the notebooks with cleared cells for consistency

dsikka

The goal of these examples is partly to show how you can run calibration independent of llmcompressor. We would need to introduce a primitive to update the scales and zero-points (such as through showing hooks) on the forward passes.

Let's touchbase offline @brian-dellabetta

dsikka · 2025-02-01T15:03:05Z

examples/bit_packing/ex_quantize_and_pack.py

+# TODO check with team -- move code back from llmcompressor or drop?
 # freeze params after calibration
-model.apply(freeze_module_quantization)
+# model.apply(freeze_module_quantization)


Freezing should be removed. This example does not use observers and therefore, having a freeze step would be irrelevant.

dsikka · 2025-02-01T15:06:22Z

examples/bit_packing/ex_quantize_and_pack.py

+# TODO check with team -- move code back from llmcompressor or drop?
 # freeze params after calibration
-model.apply(freeze_module_quantization)
+# model.apply(freeze_module_quantization)


This example actually isn't calibrating anything. It is just running QDQ. To be useful, you would need to introduce some sort of primitive to update scales/zp.

dsikka · 2025-02-01T15:07:35Z

examples/llama_1.1b/ex_config_quantization.py

-# freeze params after calibration
-model.apply(freeze_module_quantization)
+# apply compression
+# TODO this line fails because "fakequant" format is not found in registry


Same comment as above.

first pass, awaiting team feedback

e90e7ef

brian-dellabetta requested review from kylesayrs, dsikka, rahul-tuli and horheynm January 28, 2025 20:04

brian-dellabetta commented Jan 28, 2025

View reviewed changes

examples/bit_packing/ex_quantize_and_pack.py Outdated Show resolved Hide resolved

brian-dellabetta commented Jan 28, 2025

View reviewed changes

drop hf-transfer nonsense

d698160

kylesayrs mentioned this pull request Jan 29, 2025

Update Readme and Examples #245

Closed

brian-dellabetta added 2 commits January 31, 2025 14:55

remaining example files

fa0acb4

black/isort

6122962

brian-dellabetta marked this pull request as ready for review January 31, 2025 20:42

brian-dellabetta commented Jan 31, 2025

View reviewed changes

brian-dellabetta changed the title ~~[WIP] First pass for updating examples~~ Updating scripts and notebooks in examples/ to drop sparseml dependence and fix errors Jan 31, 2025

brian-dellabetta changed the title ~~Updating scripts and notebooks in examples/ to drop sparseml dependence and fix errors~~ Fix errors in scripts and notebooks in examples/ and drop sparseml dependence Jan 31, 2025

kylesayrs reviewed Jan 31, 2025

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

kylesayrs reviewed Jan 31, 2025

View reviewed changes

examples/bit_packing/ex_quantize_and_pack.py Outdated Show resolved Hide resolved

kylesayrs reviewed Jan 31, 2025

View reviewed changes

examples/bit_packing/ex_quantize_and_pack.py Outdated Show resolved Hide resolved

kylesayrs reviewed Jan 31, 2025

View reviewed changes

examples/bit_packing/ex_quantize_and_pack.py Outdated Show resolved Hide resolved

kylesayrs reviewed Jan 31, 2025

View reviewed changes

brian-dellabetta commented Jan 31, 2025

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

brian-dellabetta commented Jan 31, 2025

View reviewed changes

examples/bit_packing/ex_quantize_and_pack.py Outdated Show resolved Hide resolved

brian-dellabetta commented Jan 31, 2025

View reviewed changes

examples/bit_packing/ex_quantize_and_pack.py Outdated Show resolved Hide resolved

Apply suggestions from code review

3e27604

Co-authored-by: Kyle Sayers <[email protected]>

rahul-tuli approved these changes Jan 31, 2025

View reviewed changes

kylesayrs requested changes Jan 31, 2025

View reviewed changes

dsikka requested changes Feb 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix errors in scripts and notebooks in `examples/` and drop `sparseml` dependence #247

Fix errors in scripts and notebooks in `examples/` and drop `sparseml` dependence #247

brian-dellabetta commented Jan 28, 2025 •

edited

Loading

review-notebook-app bot commented Jan 28, 2025

brian-dellabetta Jan 28, 2025

rahul-tuli Jan 31, 2025

kylesayrs Jan 31, 2025

kylesayrs Jan 31, 2025

rahul-tuli Jan 31, 2025

dsikka Feb 1, 2025

dsikka Feb 1, 2025

brian-dellabetta Jan 31, 2025

rahul-tuli Jan 31, 2025

brian-dellabetta Jan 31, 2025

kylesayrs Jan 31, 2025

brian-dellabetta Jan 31, 2025

rahul-tuli Jan 31, 2025

dsikka Feb 1, 2025

kylesayrs Jan 31, 2025

rahul-tuli left a comment

rahul-tuli Jan 31, 2025

rahul-tuli Jan 31, 2025

rahul-tuli Jan 31, 2025

rahul-tuli Jan 31, 2025

rahul-tuli Jan 31, 2025

rahul-tuli Jan 31, 2025

kylesayrs left a comment

dsikka left a comment

dsikka Feb 1, 2025

dsikka Feb 1, 2025

dsikka Feb 1, 2025

	"The example layer model.layers.0.self_attn.q_proj.weight has sparsity 50.00%\n"
	"The example layer model.layers.0.self_attn.q_proj.weight has sparsity 50%\n"


		recipe = "example_quant_recipe.yaml"

		recipe = str(Path(__file__).parent / "example_quant_recipe.yaml")

	dataset = load_dataset(dataset_name, split="train[:128]")
	dataset = load_dataset(dataset_name, split=f"train[:num_calibration_samples]")

Fix errors in scripts and notebooks in examples/ and drop sparseml dependence #247

Are you sure you want to change the base?

Fix errors in scripts and notebooks in examples/ and drop sparseml dependence #247

Conversation

brian-dellabetta commented Jan 28, 2025 • edited Loading

review-notebook-app bot commented Jan 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahul-tuli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylesayrs left a comment

Choose a reason for hiding this comment

dsikka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fix errors in scripts and notebooks in `examples/` and drop `sparseml` dependence #247

Fix errors in scripts and notebooks in `examples/` and drop `sparseml` dependence #247

brian-dellabetta commented Jan 28, 2025 •

edited

Loading