Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix errors in scripts and notebooks in examples/ and drop sparseml dependence #247

Open
wants to merge 5 commits into
base: kylesayrs/update_readme
Choose a base branch
from

Conversation

brian-dellabetta
Copy link

@brian-dellabetta brian-dellabetta commented Jan 28, 2025

After running some of the examples and discussing with Kyle, proposing some alterations to the examples provided in compresed_tensors. We removed references to llm-compressor, just sticking to transformers/torch dependencies, and dropped the missing reference to compressed_tensors.quantization.freeze_module_quantization. One example was strictly tied to how sparseml could be used, so it was just refactored to llmcompressor with a comment.

Other maintenance to do after merging this into Kyle's PR

  • Add CI/CD to test example/notebook files as PR checks

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Comment on lines +82 to +84
# TODO check with team -- move code back from llmcompressor or drop?
# freeze params after calibration
model.apply(freeze_module_quantization)
# model.apply(freeze_module_quantization)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part I was hoping for feedback on. freeze_module_quantization is part of llm-compressor now, do we need it here for this to work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to import it from llmcompressor, but yes we shouldn't need it for compression

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely in favor of adding back to llm-compressor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rahul-tuli I think it's worth showing here? Freezing the module is part of the compressed tensors lifecycle, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, I just meant compression would work regardless

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Freezing should be removed. This example does not use observers and therefore, having a freeze step would be irrelevant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example actually isn't calibrating anything. It is just running QDQ. To be useful, you would need to introduce some sort of primitive to update scales/zp.

@brian-dellabetta brian-dellabetta marked this pull request as ready for review January 31, 2025 20:42
Comment on lines -31 to -47
model = SparseAutoModelForCausalLM.from_pretrained(model_name, device_map=device, torch_dtype="auto")

tokenizer = AutoTokenizer.from_pretrained(model_name)
data_args = DataTrainingArguments(
dataset=dataset_name,
max_seq_length=max_seq_length,
pad_to_max_length=pad_to_max_length,
)
dataset_manager = TextGenerationDataset.load_from_registry(
data_args.dataset,
data_args=data_args,
split=split,
tokenizer=tokenizer,
)
calib_dataset = dataset_manager.tokenize_and_process(
dataset_manager.get_raw_dataset()
)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none of this seemed to be used in oneshot, any idea what it was in there for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so outdated, thank you for removing!

Comment on lines +17 to +20
# The following example shows how the example in `ex_config_quantization.py`
# can be done within vllm's llm-compressor project
# Be sure to `pip install llmcompressor` before running
# See https://github.com/vllm-project/llm-compressor for more information
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figure since this example is explicitly to show sparseml/llmcompressor it's worth leaving the references to llmcompressor, or we can just delete it? what do people think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep definitely! With the other fixed examples, a user should have enough of a sense of how the primitives here work outside of the context of llm compressor

# freeze params after calibration
model.apply(freeze_module_quantization)
# apply compression
# TODO this line fails because "fakequant" format is not found in registry
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script fails because "fakequant" format specified in example_quant_config.json is not in the registry. It looks like some tests use fakequant but it's not part of source. is there a default format we should move this to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally use FROZEN, ModelCompressor.compress should update it to COMPRESSED

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

@brian-dellabetta brian-dellabetta changed the title [WIP] First pass for updating examples Updating scripts and notebooks in examples/ to drop sparseml dependence and fix errors Jan 31, 2025
@brian-dellabetta brian-dellabetta changed the title Updating scripts and notebooks in examples/ to drop sparseml dependence and fix errors Fix errors in scripts and notebooks in examples/ and drop sparseml dependence Jan 31, 2025
.gitignore Outdated Show resolved Hide resolved
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The example layer model.layers.0.self_attn.q_proj.weight has sparsity 0.50%\n"
"The example layer model.layers.0.self_attn.q_proj.weight has sparsity 50.00%\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The example layer model.layers.0.self_attn.q_proj.weight has sparsity 50.00%\n"
"The example layer model.layers.0.self_attn.q_proj.weight has sparsity 50%\n"

nit: no need for so many sig figs :)

.gitignore Outdated Show resolved Hide resolved
Copy link
Member

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great changes, Thank you for removing all the unnecessary code!

@@ -49,39 +50,40 @@
apply_quantization_config(model, config)

# create dataset
dataset = load_dataset(dataset_name, split=f"train[:{num_calibration_samples}]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!


recipe = "example_quant_recipe.yaml"

recipe = str(Path(__file__).parent / "example_quant_recipe.yaml")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Comment on lines -31 to -47
model = SparseAutoModelForCausalLM.from_pretrained(model_name, device_map=device, torch_dtype="auto")

tokenizer = AutoTokenizer.from_pretrained(model_name)
data_args = DataTrainingArguments(
dataset=dataset_name,
max_seq_length=max_seq_length,
pad_to_max_length=pad_to_max_length,
)
dataset_manager = TextGenerationDataset.load_from_registry(
data_args.dataset,
data_args=data_args,
split=split,
tokenizer=tokenizer,
)
calib_dataset = dataset_manager.tokenize_and_process(
dataset_manager.get_raw_dataset()
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so outdated, thank you for removing!

Comment on lines +82 to +84
# TODO check with team -- move code back from llmcompressor or drop?
# freeze params after calibration
model.apply(freeze_module_quantization)
# model.apply(freeze_module_quantization)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to import it from llmcompressor, but yes we shouldn't need it for compression

@@ -48,40 +50,38 @@
apply_quantization_config(model, config)

# create dataset
dataset = load_dataset(dataset_name, split="train[:128]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dataset = load_dataset(dataset_name, split="train[:128]")
dataset = load_dataset(dataset_name, split=f"train[:num_calibration_samples]")

Should it be this?

# freeze params after calibration
model.apply(freeze_module_quantization)
# apply compression
# TODO this line fails because "fakequant" format is not found in registry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally use FROZEN, ModelCompressor.compress should update it to COMPRESSED

Copy link
Contributor

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice, thank you so much for doing this.

I've added my responses to some of the discussion comments, but I agree with these changes. Lastly, I think we should upload the notebooks with cleared cells for consistency

Copy link
Contributor

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of these examples is partly to show how you can run calibration independent of llmcompressor. We would need to introduce a primitive to update the scales and zero-points (such as through showing hooks) on the forward passes.

Let's touchbase offline @brian-dellabetta

Comment on lines +82 to +84
# TODO check with team -- move code back from llmcompressor or drop?
# freeze params after calibration
model.apply(freeze_module_quantization)
# model.apply(freeze_module_quantization)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Freezing should be removed. This example does not use observers and therefore, having a freeze step would be irrelevant.

Comment on lines +82 to +84
# TODO check with team -- move code back from llmcompressor or drop?
# freeze params after calibration
model.apply(freeze_module_quantization)
# model.apply(freeze_module_quantization)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example actually isn't calibrating anything. It is just running QDQ. To be useful, you would need to introduce some sort of primitive to update scales/zp.

# freeze params after calibration
model.apply(freeze_module_quantization)
# apply compression
# TODO this line fails because "fakequant" format is not found in registry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants