-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix errors in scripts and notebooks in examples/
and drop sparseml
dependence
#247
base: kylesayrs/update_readme
Are you sure you want to change the base?
Fix errors in scripts and notebooks in examples/
and drop sparseml
dependence
#247
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
# TODO check with team -- move code back from llmcompressor or drop? | ||
# freeze params after calibration | ||
model.apply(freeze_module_quantization) | ||
# model.apply(freeze_module_quantization) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part I was hoping for feedback on. freeze_module_quantization
is part of llm-compressor
now, do we need it here for this to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to import it from llmcompressor, but yes we shouldn't need it for compression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely in favor of adding back to llm-compressor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rahul-tuli I think it's worth showing here? Freezing the module is part of the compressed tensors lifecycle, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, I just meant compression would work regardless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Freezing should be removed. This example does not use observers and therefore, having a freeze step would be irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example actually isn't calibrating anything. It is just running QDQ. To be useful, you would need to introduce some sort of primitive to update scales/zp.
model = SparseAutoModelForCausalLM.from_pretrained(model_name, device_map=device, torch_dtype="auto") | ||
|
||
tokenizer = AutoTokenizer.from_pretrained(model_name) | ||
data_args = DataTrainingArguments( | ||
dataset=dataset_name, | ||
max_seq_length=max_seq_length, | ||
pad_to_max_length=pad_to_max_length, | ||
) | ||
dataset_manager = TextGenerationDataset.load_from_registry( | ||
data_args.dataset, | ||
data_args=data_args, | ||
split=split, | ||
tokenizer=tokenizer, | ||
) | ||
calib_dataset = dataset_manager.tokenize_and_process( | ||
dataset_manager.get_raw_dataset() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
none of this seemed to be used in oneshot
, any idea what it was in there for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so outdated, thank you for removing!
# The following example shows how the example in `ex_config_quantization.py` | ||
# can be done within vllm's llm-compressor project | ||
# Be sure to `pip install llmcompressor` before running | ||
# See https://github.com/vllm-project/llm-compressor for more information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figure since this example is explicitly to show sparseml/llmcompressor it's worth leaving the references to llmcompressor, or we can just delete it? what do people think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep definitely! With the other fixed examples, a user should have enough of a sense of how the primitives here work outside of the context of llm compressor
# freeze params after calibration | ||
model.apply(freeze_module_quantization) | ||
# apply compression | ||
# TODO this line fails because "fakequant" format is not found in registry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script fails because "fakequant"
format specified in example_quant_config.json
is not in the registry. It looks like some tests use fakequant
but it's not part of source. is there a default format we should move this to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally use FROZEN, ModelCompressor.compress should update it to COMPRESSED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
examples/
to drop sparseml
dependence and fix errors
examples/
to drop sparseml
dependence and fix errorsexamples/
and drop sparseml
dependence
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"The example layer model.layers.0.self_attn.q_proj.weight has sparsity 0.50%\n" | ||
"The example layer model.layers.0.self_attn.q_proj.weight has sparsity 50.00%\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The example layer model.layers.0.self_attn.q_proj.weight has sparsity 50.00%\n" | |
"The example layer model.layers.0.self_attn.q_proj.weight has sparsity 50%\n" |
nit: no need for so many sig figs :)
Co-authored-by: Kyle Sayers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -49,39 +50,40 @@ | |||
apply_quantization_config(model, config) | |||
|
|||
# create dataset | |||
dataset = load_dataset(dataset_name, split=f"train[:{num_calibration_samples}]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
|
||
recipe = "example_quant_recipe.yaml" | ||
|
||
recipe = str(Path(__file__).parent / "example_quant_recipe.yaml") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
model = SparseAutoModelForCausalLM.from_pretrained(model_name, device_map=device, torch_dtype="auto") | ||
|
||
tokenizer = AutoTokenizer.from_pretrained(model_name) | ||
data_args = DataTrainingArguments( | ||
dataset=dataset_name, | ||
max_seq_length=max_seq_length, | ||
pad_to_max_length=pad_to_max_length, | ||
) | ||
dataset_manager = TextGenerationDataset.load_from_registry( | ||
data_args.dataset, | ||
data_args=data_args, | ||
split=split, | ||
tokenizer=tokenizer, | ||
) | ||
calib_dataset = dataset_manager.tokenize_and_process( | ||
dataset_manager.get_raw_dataset() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so outdated, thank you for removing!
# TODO check with team -- move code back from llmcompressor or drop? | ||
# freeze params after calibration | ||
model.apply(freeze_module_quantization) | ||
# model.apply(freeze_module_quantization) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to import it from llmcompressor, but yes we shouldn't need it for compression
@@ -48,40 +50,38 @@ | |||
apply_quantization_config(model, config) | |||
|
|||
# create dataset | |||
dataset = load_dataset(dataset_name, split="train[:128]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataset = load_dataset(dataset_name, split="train[:128]") | |
dataset = load_dataset(dataset_name, split=f"train[:num_calibration_samples]") |
Should it be this?
# freeze params after calibration | ||
model.apply(freeze_module_quantization) | ||
# apply compression | ||
# TODO this line fails because "fakequant" format is not found in registry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally use FROZEN, ModelCompressor.compress should update it to COMPRESSED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice, thank you so much for doing this.
I've added my responses to some of the discussion comments, but I agree with these changes. Lastly, I think we should upload the notebooks with cleared cells for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of these examples is partly to show how you can run calibration independent of llmcompressor. We would need to introduce a primitive to update the scales and zero-points (such as through showing hooks) on the forward passes.
Let's touchbase offline @brian-dellabetta
# TODO check with team -- move code back from llmcompressor or drop? | ||
# freeze params after calibration | ||
model.apply(freeze_module_quantization) | ||
# model.apply(freeze_module_quantization) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Freezing should be removed. This example does not use observers and therefore, having a freeze step would be irrelevant.
# TODO check with team -- move code back from llmcompressor or drop? | ||
# freeze params after calibration | ||
model.apply(freeze_module_quantization) | ||
# model.apply(freeze_module_quantization) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example actually isn't calibrating anything. It is just running QDQ. To be useful, you would need to introduce some sort of primitive to update scales/zp.
# freeze params after calibration | ||
model.apply(freeze_module_quantization) | ||
# apply compression | ||
# TODO this line fails because "fakequant" format is not found in registry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
After running some of the examples and discussing with Kyle, proposing some alterations to the examples provided in
compresed_tensors
. We removed references tollm-compressor
, just sticking to transformers/torch dependencies, and dropped the missing reference tocompressed_tensors.quantization.freeze_module_quantization
. One example was strictly tied to howsparseml
could be used, so it was just refactored tollmcompressor
with a comment.Other maintenance to do after merging this into Kyle's PR