-
Notifications
You must be signed in to change notification settings - Fork 29
Separate reference model preprocessing #235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
#229 still works without this, so we can merge in any order. |
@nitsanluke can you please test distillation from this branch? thanks! |
Distillation run with a 5B teacher and 3B student model |
The state memory is low enough as expected, but the activation memory is suspiciously high. It's for ssm models though for which we didn't really measure acitvation memory, so I don't have a way to tell whether there is a problem with distillation or not. Right now distillation uses a distillation loss exclusively, additional work will be needed if we want to have both. |
Can we try merging this PR? |
I was able to test most of it (also with mistral 15B). Happy to merge this and fix issues as they appear. The
|
@nitsanluke It's a problem from #194, the ssm config ended up being added in the base LM config so is there for the GPT model that doesn't use it. It's annoying but harmless, I fixed in #252 by moving to hybrid ssm config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
I think we've got enough evidence now that this code works reliably, let's merge!
✨ Description
Run a separate preprocessing for reference models so they can have a different preprocessing scheme, and different parameters for preprocessing (ex. rotary)
🔍 Type of change
Select all that apply: