Separate reference model preprocessing #235

jlamypoirier · 2025-04-16T18:33:30Z

✨ Description

Run a separate preprocessing for reference models so they can have a different preprocessing scheme, and different parameters for preprocessing (ex. rotary)

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

jlamypoirier · 2025-04-18T16:49:43Z

#229 still works without this, so we can merge in any order.

…ocessing

tscholak · 2025-04-23T14:13:30Z

@nitsanluke can you please test distillation from this branch? thanks!
use ghcr.io/servicenow/fast-llm:sha-9b750c2, which matches this commit: 7133e4d

…ocessing

nitsanluke · 2025-04-29T20:13:54Z

Distillation run with a 5B teacher and 3B student model 4096f1f0-8d01-4bba-b8fc-6bcfc261a63f. Works with 8 nodes.
Note: We only get distillation loss reported (on training data and validation data): It would be better to have at lest an option to do lm-loss on the validation datasets ( and training).

jlamypoirier · 2025-04-30T20:00:47Z

2025-04-29 11:16:16,606 [Rank 00] PhaseType.training @ iteration   3210/ 20000 | consumed samples:    9,861,120 | consumed tokens:   40,391,147,520 | batch size: 3072 | step time: 59378.29 ms | throughput: 63.16 tflop/s (model) | 63.91 tflop/s (hardware) | 3311.11 tokens/s/gpu | Memory allocated 9,380.08 MiB | max allocated 71,532.97 MiB | reserved 76,350.00 MiB | max reserved 76,350.00 MiB | global max reserved 76,350.00 MiB | learning rate: 2.625e-04 | loss scale:     1 | grad norm: 0.2160 | skipped iterations:   0 | nan iterations:   0 | average step time 18531.78 ms | remaining 3 days, 14:25:49  | completion 2025-05-03 01:42:05 (16.05 %) | language model loss 0: 1.69111 | run: 6

The state memory is low enough as expected, but the activation memory is suspiciously high. It's for ssm models though for which we didn't really measure acitvation memory, so I don't have a way to tell whether there is a problem with distillation or not.

Right now distillation uses a distillation loss exclusively, additional work will be needed if we want to have both.

jlamypoirier · 2025-05-01T16:35:43Z

Can we try merging this PR?

nitsanluke · 2025-05-02T14:09:19Z

Can we try merging this PR?

I was able to test most of it (also with mistral 15B). Happy to merge this and fix issues as they appear.

The ssm part in the base_model is showing up as part of the model config (even when not defined). This should not have an impact on the training but not sure whether memory is allocated to these params ..

  base_model:
    transformer:
      normalization:
        type: rms_norm
        epsilon: 1.0e-05
      peft:
        type: none
      num_layers: 18
      hidden_size: 4096
      num_attention_heads: 24
      head_groups: 8
      add_linear_biases: false
      ffn_hidden_size: 8192
      kv_channels: 128
      rotary:
        type: yarn
        theta: 1000000.0
        scale_factor: 32.0
        low_frequency_factor: 1.0
        high_frequency_factor: 4.0
        original_context_length: 4096
        attention_factor: null
        beta_fast: 32.0
        beta_slow: 1.0
      gated: true
      activation_type: silu
      use_flash_attention: true
    **ssm:
      normalization:
        type: layer_norm
      expansion_factor: 2
      state_size: 16
      conv_kernel_dimension: 4
      add_bias_linear: false
      dt_rank: -1
      chunk_size: 256
      n_qk_heads: 32
      n_v_heads: 32
      activation_type: silu
      dt_min: 0.001
      dt_max: 0.1
      dt_init_floor: 0.0001**

jlamypoirier · 2025-05-02T15:29:46Z

@nitsanluke It's a problem from #194, the ssm config ended up being added in the base LM config so is there for the GPT model that doesn't use it. It's annoying but harmless, I fixed in #252 by moving to hybrid ssm config.

tscholak

LGTM, thanks!
I think we've got enough evidence now that this code works reliably, let's merge!

…ocessing

jlamypoirier added 30 commits March 26, 2025 00:10

stuff

5137757

Merge remote-tracking branch 'origin/main' into config_updates

f0cb32a

Update pretrained config

f26010e

stuff

b930a39

Merge branch 'config_updates' into update_pretrained_config

918a7a8

fixes

8117c47

fix

1c995d3

Merge branch 'main' into config_updates

3f90475

Merge branch 'config_updates' into update_pretrained_config

e389058

fixes

506fe92

fixes

971d3ef

Tests wip

6bf20cb

misc

c13fb19

tests

a20fcec

Merge branch 'main' into config_updates

9af26a7

Tests, fixes, remove tuple format

9af372d

fix

dded00a

Merge remote-tracking branch 'origin/main' into config_updates

42d5ca4

fix

986f9f3

Merge branch 'config_updates' into update_pretrained_config

5abc087

fixes

8e3e795

fixes

da6eb7b

Merge branch 'main' into config_updates

67e08aa

Merge branch 'config_updates' into update_pretrained_config

a09e6f3

fix

baad705

Test, fixes

b702837

Knowledge distillation, fix cross-entropy

a8684f8

Fixes, distillation

b781729

fixes

db6504b

Merge remote-tracking branch 'origin/main' into config_updates

7c2933a

jlamypoirier marked this pull request as ready for review April 18, 2025 16:47

jlamypoirier requested a review from tscholak April 18, 2025 16:48

jlamypoirier requested review from nitsanluke and RaymondLi0 April 21, 2025 14:18

jlamypoirier added 2 commits April 21, 2025 10:21

Merge remote-tracking branch 'origin/main' into distillation

067ba97

misc

d2b3154

Base automatically changed from distillation to main April 21, 2025 15:13

jlamypoirier added 2 commits April 22, 2025 17:31

Merge branch 'distillation' into reference_model_preprocessing

2e63d29

Merge remote-tracking branch 'origin/main' into reference_model_prepr…

7133e4d

…ocessing

jlamypoirier added 3 commits April 25, 2025 15:50

fixes

a0ba051

Merge remote-tracking branch 'origin/main' into reference_model_prepr…

66fb0a2

…ocessing

fix

0d4d5c5

jlamypoirier mentioned this pull request Apr 29, 2025

Fix and test LM head #240

Merged

8 tasks

fix

ac4a598

adding check for missing rope_type (#246)

0c0e7d9

jlamypoirier added 2 commits May 2, 2025 11:59

fixes

005e623

fix

cad951a

tscholak approved these changes May 2, 2025

View reviewed changes

jlamypoirier added 3 commits May 2, 2025 13:16

Merge remote-tracking branch 'origin/main' into reference_model_prepr…

2c96abb

…ocessing

fix

935c470

Merge remote-tracking branch 'origin/main' into reference_model_prepr…

d82ddbf

…ocessing

jlamypoirier merged commit 286f9d3 into main May 2, 2025
2 checks passed

jlamypoirier deleted the reference_model_preprocessing branch May 2, 2025 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate reference model preprocessing #235

Separate reference model preprocessing #235

jlamypoirier commented Apr 16, 2025

jlamypoirier commented Apr 18, 2025

tscholak commented Apr 23, 2025 •

edited

Loading

nitsanluke commented Apr 29, 2025 •

edited

Loading

jlamypoirier commented Apr 30, 2025

jlamypoirier commented May 1, 2025

nitsanluke commented May 2, 2025

jlamypoirier commented May 2, 2025

tscholak left a comment

Separate reference model preprocessing #235

Separate reference model preprocessing #235

Conversation

jlamypoirier commented Apr 16, 2025

✨ Description

🔍 Type of change

jlamypoirier commented Apr 18, 2025

tscholak commented Apr 23, 2025 • edited Loading

nitsanluke commented Apr 29, 2025 • edited Loading

jlamypoirier commented Apr 30, 2025

jlamypoirier commented May 1, 2025

nitsanluke commented May 2, 2025

jlamypoirier commented May 2, 2025

tscholak left a comment

Choose a reason for hiding this comment

tscholak commented Apr 23, 2025 •

edited

Loading

nitsanluke commented Apr 29, 2025 •

edited

Loading