Support offline `logits` for teacher model #441

austin362667 · 2024-12-09T08:03:10Z

🚀 The feature, motivation and pitch

In knowledge distillation, it has better efficiency to add support for pre-computed logits/logprobs offline in teacher model beforehand. Rather than load and forward the teacher outputs inside the kernel.

Some other thoughts on using `logits` or `logprobs`?

We scaled temperature here.

As @winglian mentioned here.

I'd actually like to see both a logit and logprob implementation since it's easy to get logprobs offline from vllm and that is a faster way to generate the dataset.

So rather than having to have the teacher model loaded during training, depending on the workload type, it can be faster and more compute efficient to pre-compute the logins/logprobs offline beforehand. However, vllm and sglang only provide the logprobs, and that's not easily back-calculated to logits.

While @shivam15s pointed out the concern regarding temperature scaled logprobs in here

curious if vllm/sglang support temperature scaled logprobs. This would be needed to enable https://github.com/huggingface/trl/blob/9c5388b69e0842f76edc46a2ff9d0b51e1db4337/trl/trainer/gkd_trainer.py#L174

Besides, @Tcc0403 suggested that log-space is the right way to go in here. For my understanding, I agree with this idea given temperature=1.

Sorry for the misleading question and late response. Passing logpbs is totally fine, it's actually better that it can avoid underflow issues in the log-space. Torch's KLDivLoss also expect inputs in the log-space, and the extra amount of calculation from softmax to logsoftmax shouldn't be an issue anyway. So if most APIs are expecting input as logpbs, then I think it's the way to go.

In my opinion, I think it's good to support offline forwarded value (e.g., logits) for teacher model beforehand. However, I’m unsure how we should support log_probs/probs as args in ditillation_loss_fn? Since multiple input vectors can yield the same output probabilities due to the normalization step, softmax is not invertible in a strict sense. In conclusion it's hard to scale on these values (after softmax) by temperature.

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

@Tcc0403

## Summary Addressed the part of issue raised in #441 Moving the scale temperature outside the `distillation_loss_fn` is fine as well. Keep the `loss_fn` simpler, and the rest can be handled in the `forward` function beforehand. Thanks to the advice by @Tcc0403  ## Testing Done   - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [X] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence --------- Signed-off-by: Austin Liu <[email protected]>

austin362667 self-assigned this Dec 9, 2024

austin362667 mentioned this issue Dec 9, 2024

Refactor Temperature Scaling in Distillation Loss #444

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support offline `logits` for teacher model #441

Support offline `logits` for teacher model #441

austin362667 commented Dec 9, 2024 •

edited

Loading

Support offline logits for teacher model #441

Support offline logits for teacher model #441

Comments

austin362667 commented Dec 9, 2024 • edited Loading

🚀 The feature, motivation and pitch

Some other thoughts on using logits or logprobs?

Alternatives

Additional context

Support offline `logits` for teacher model #441

Support offline `logits` for teacher model #441

austin362667 commented Dec 9, 2024 •

edited

Loading

Some other thoughts on using `logits` or `logprobs`?