Skip to content

Separate reference model preprocessing #235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 62 commits into from
May 2, 2025

Conversation

jlamypoirier
Copy link
Collaborator

✨ Description

Run a separate preprocessing for reference models so they can have a different preprocessing scheme, and different parameters for preprocessing (ex. rotary)

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@jlamypoirier jlamypoirier marked this pull request as ready for review April 18, 2025 16:47
@jlamypoirier jlamypoirier requested a review from tscholak April 18, 2025 16:48
@jlamypoirier
Copy link
Collaborator Author

#229 still works without this, so we can merge in any order.

Base automatically changed from distillation to main April 21, 2025 15:13
@tscholak
Copy link
Collaborator

tscholak commented Apr 23, 2025

@nitsanluke can you please test distillation from this branch? thanks!
use ghcr.io/servicenow/fast-llm:sha-9b750c2, which matches this commit: 7133e4d

@jlamypoirier jlamypoirier mentioned this pull request Apr 29, 2025
8 tasks
@nitsanluke
Copy link
Contributor

nitsanluke commented Apr 29, 2025

Distillation run with a 5B teacher and 3B student model 4096f1f0-8d01-4bba-b8fc-6bcfc261a63f. Works with 8 nodes.
Note: We only get distillation loss reported (on training data and validation data): It would be better to have at lest an option to do lm-loss on the validation datasets ( and training).

@jlamypoirier
Copy link
Collaborator Author

2025-04-29 11:16:16,606 [Rank 00] PhaseType.training @ iteration   3210/ 20000 | consumed samples:    9,861,120 | consumed tokens:   40,391,147,520 | batch size: 3072 | step time: 59378.29 ms | throughput: 63.16 tflop/s (model) | 63.91 tflop/s (hardware) | 3311.11 tokens/s/gpu | Memory allocated 9,380.08 MiB | max allocated 71,532.97 MiB | reserved 76,350.00 MiB | max reserved 76,350.00 MiB | global max reserved 76,350.00 MiB | learning rate: 2.625e-04 | loss scale:     1 | grad norm: 0.2160 | skipped iterations:   0 | nan iterations:   0 | average step time 18531.78 ms | remaining 3 days, 14:25:49  | completion 2025-05-03 01:42:05 (16.05 %) | language model loss 0: 1.69111 | run: 6

The state memory is low enough as expected, but the activation memory is suspiciously high. It's for ssm models though for which we didn't really measure acitvation memory, so I don't have a way to tell whether there is a problem with distillation or not.

Right now distillation uses a distillation loss exclusively, additional work will be needed if we want to have both.

@jlamypoirier
Copy link
Collaborator Author

Can we try merging this PR?

@nitsanluke
Copy link
Contributor

Can we try merging this PR?

I was able to test most of it (also with mistral 15B). Happy to merge this and fix issues as they appear.

The ssm part in the base_model is showing up as part of the model config (even when not defined). This should not have an impact on the training but not sure whether memory is allocated to these params ..

  base_model:
    transformer:
      normalization:
        type: rms_norm
        epsilon: 1.0e-05
      peft:
        type: none
      num_layers: 18
      hidden_size: 4096
      num_attention_heads: 24
      head_groups: 8
      add_linear_biases: false
      ffn_hidden_size: 8192
      kv_channels: 128
      rotary:
        type: yarn
        theta: 1000000.0
        scale_factor: 32.0
        low_frequency_factor: 1.0
        high_frequency_factor: 4.0
        original_context_length: 4096
        attention_factor: null
        beta_fast: 32.0
        beta_slow: 1.0
      gated: true
      activation_type: silu
      use_flash_attention: true
    **ssm:
      normalization:
        type: layer_norm
      expansion_factor: 2
      state_size: 16
      conv_kernel_dimension: 4
      add_bias_linear: false
      dt_rank: -1
      chunk_size: 256
      n_qk_heads: 32
      n_v_heads: 32
      activation_type: silu
      dt_min: 0.001
      dt_max: 0.1
      dt_init_floor: 0.0001**

@jlamypoirier
Copy link
Collaborator Author

@nitsanluke It's a problem from #194, the ssm config ended up being added in the base LM config so is there for the GPT model that doesn't use it. It's annoying but harmless, I fixed in #252 by moving to hybrid ssm config.

Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!
I think we've got enough evidence now that this code works reliably, let's merge!

@jlamypoirier jlamypoirier merged commit 286f9d3 into main May 2, 2025
2 checks passed
@jlamypoirier jlamypoirier deleted the reference_model_preprocessing branch May 2, 2025 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants