update

KeitaW · KeitaW · commit 332285e5a63a · 2024-05-31T12:51:54.000Z
diff --git a/3.test_cases/torchtune/README.md b/3.test_cases/torchtune/README.md
@@ -4,11 +4,11 @@ This guide demonstrates the comprehensive process of developing a Large Language
 
 ![LLMOps](docs/LLMOps.png)
 
-1. **Data Preparation**: The journey begins with the collection and preparation of data for training. This step is crucial as it involves exploring the data's characteristics, performing necessary cleaning, and applying preprocessing techniques to ensure the data is in the right shape for model training.
+1. **(Continuous) Pretraining the Language Model**: Next, the language model undergoes pretraining on a vast corpus of text data. This step can be bypassed if starting with an already pretrained model. Pretraining is essential for the model to learn the general patterns and structures of language. Refer `torchtitan` test case for the large scale pretraining with the latest techniques such as 3D parallelism and `torch.compile`.
 
-2. **Pretraining the Language Model**: Next, the language model undergoes pretraining on a vast corpus of text data. This step can be bypassed if starting with an already pretrained model. Pretraining is essential for the model to learn the general patterns and structures of language. Refer `torchtitan` test case for the large scale pretraining with the latest techniques such as 3D parallelism and `torch.compile`.
+2. **Instruction Tuning**: The pretrained model is then fine-tuned to cater to specific tasks by updating its parameters with a new dataset. This process involves partially retraining the model with samples that exemplify the desired behavior, thus refining the model weights for the particular application.
 
-3. **Fine-Tuning**: The pretrained model is then fine-tuned to cater to specific tasks by updating its parameters with a new dataset. This process involves partially retraining the model with samples that exemplify the desired behavior, thus refining the model weights for the particular application.
+3. **Aligment**: The pretrained model is then fine-tuned to cater to specific tasks by updating its parameters with a new dataset. This process involves partially retraining the model with samples that exemplify the desired behavior, thus refining the model weights for the particular application.
 
 4. **Evaluation**: Evaluating the LLM's performance is a critical step. It involves using various metrics to assess the model's accuracy and effectiveness. This step is vital for validating new techniques and objectively comparing different model releases.
 
diff --git a/3.test_cases/torchtune/docs/LLMOps.png b/3.test_cases/torchtune/docs/LLMOps.png
diff --git a/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/README.md b/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/README.md
@@ -1,10 +1,11 @@
 # End-to-End LLama3-70B model development with Torchtune  <!-- omit in toc -->
 
 In this tutorial, you will see how to:
-* Pretrain
-* Finetune
-* Evaluate
-* Deploy
+* Contious Pretraining
+* Instruction Finetuning
+* Alignment
+* Evaluation
+* Deployment
 
 ## 1. Prerequisites
 Before starting, ensure you have requested access to Meta-Llama-3-70B by visiting [Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B) on Hugging Face and following the access request instructions. Additionally, make sure all prerequisites described in the [slurm](..) directory are set up.
@@ -64,16 +65,16 @@ This output confirms that the `torchtune download` command has been executed wit
 By following these steps, you ensure that the necessary model components are in place, setting the stage for subsequent tasks such as pretraining, finetuning, evaluation, and deployment.
 
 
-## 3. Full-parameter finetuning
+## 3. Continuous Pretraining
 
-WIP In this step, you will author Llama3 model using c4 dataset. 
+In this step, you will fine-tune the Llama model. Specifically, the finetune process in this step is called Full-parameter finetuning, which will update all the parameters in the original model. 
 
 ```bash
 sbatch tutorials/e2e-llama3-70b-development/pretrain.sbatch
 ```
 
 
-## 4. Lora parameter efficient finetuning
+## 4. Instruction-tuning
 
 In this step, you will fine-tune the LLaMA model using Low-Rank Adaptation (LoRA) with the Alpaca dataset. We will first cover the basic concepts and relevant configurations found in the [config file](configs/lora_finetune_distributed.yaml), followed by a detailed fine-tuning tutorial.
 
@@ -111,6 +112,10 @@ dataset:
 
 As the config suggests, we use a predefined dataset class prepared in torchtune.
 
+## 5. Alignment
+
+
+
 ### Submit Finetuning job
 
 You can submit the finetuning job with the following command:
@@ -226,15 +231,33 @@ quantizer:
   groupsize: 256
 ```
 
-`Int4WeightOnlyQuantizer` performs per-axis group quantization, which means it quantizes weights in groups rather than individually. This helps maintain a balance between compression and model accuracy.
+`Int4WeightOnlyQuantizer` performs per-axis group quantization, which means it quantizes weights in groups rather than individually. By adjusting the `groupsize`, one can control the trade-off between compression ratio and accuracy. Smaller group sizes typically lead to higher accuracy but lower compression, while larger group sizes achieve higher compression at the potential cost of accuracy.
 
 ```bash
 sbatch quentize.sbatch
 ```
 
 
+```bash
+Executing following command:
+torchtune run quantize --config /fsx/ubuntu/awsome-distributed-training/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/configs/quantize.yaml tokenizer.path=/fsx/ubuntu/models/torchtune/meta-llama/Meta-Llama-3-70B/original/tokenizer.model checkpointer.checkpoint_dir=/fsx/ubuntu/models/torchtune/meta-llama/Meta-Llama-3-70B-tuned checkpointer.output_dir=/fsx/ubuntu/models/torchtune/meta-llama/Meta-Llama-3-70B-quantized
+```
+
+The resultant quantized weights is saved as follows:
+
+```bash
+0: 2024-05-31:02:10:46,964 DEBUG    [seed.py:60] Setting manual seed to local seed 1234. Local seed is seed + rank = 1234 + 0
+0: 2024-05-31:02:18:17,728 INFO     [quantize.py:90] Model is initialized with precision torch.bfloat16.
+0: 2024-05-31:02:20:33,576 INFO     [quantize.py:98] Time for quantization: 133.08 sec
+0: 2024-05-31:02:20:33,577 INFO     [quantize.py:99] Memory used: 40.03 GB
+0: 2024-05-31:02:21:18,609 INFO     [quantize.py:112] Model checkpoint of size 37.94 GB saved to /fsx/ubuntu/models/torchtune/meta-llama/Meta-Llama-3-70B-quantized/hf_model_0001_0-4w.pt
+```
+
+
 ## 7. Generation
 
+Now that you have production-ready quantized model. This last step test text generation using the model.
+
 ```bash
 sbatch 7.generate.sbatch --config configs/generate_llama3.yaml --prompt "Hello, my name is"
 ```
diff --git a/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/configs/quantize.yaml b/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/configs/quantize.yaml
@@ -12,42 +12,43 @@ checkpointer:
   _component_: torchtune.utils.FullModelHFCheckpointer
   checkpoint_dir:  ${MODEL_PATH}/${HF_MODEL}
   checkpoint_files: [
-      model-00001-of-00030.safetensors,
-      model-00002-of-00030.safetensors,
-      model-00003-of-00030.safetensors,
-      model-00004-of-00030.safetensors,
-      model-00005-of-00030.safetensors,
-      model-00006-of-00030.safetensors,
-      model-00007-of-00030.safetensors,
-      model-00008-of-00030.safetensors,
-      model-00009-of-00030.safetensors,
-      model-00010-of-00030.safetensors,
-      model-00011-of-00030.safetensors,
-      model-00012-of-00030.safetensors,
-      model-00013-of-00030.safetensors,
-      model-00014-of-00030.safetensors,
-      model-00015-of-00030.safetensors,
-      model-00016-of-00030.safetensors,
-      model-00017-of-00030.safetensors,
-      model-00018-of-00030.safetensors,
-      model-00019-of-00030.safetensors,
-      model-00020-of-00030.safetensors,
-      model-00021-of-00030.safetensors,
-      model-00022-of-00030.safetensors,
-      model-00023-of-00030.safetensors,
-      model-00024-of-00030.safetensors,
-      model-00025-of-00030.safetensors,
-      model-00026-of-00030.safetensors,
-      model-00027-of-00030.safetensors,
-      model-00028-of-00030.safetensors,
-      model-00029-of-00030.safetensors,
-      model-00030-of-00030.safetensors,
+    hf_model_0001_0.pt,
+    hf_model_0002_0.pt,
+    hf_model_0003_0.pt,
+    hf_model_0004_0.pt,
+    hf_model_0005_0.pt,
+    hf_model_0006_0.pt,
+    hf_model_0007_0.pt,
+    hf_model_0007_0.pt,
+    hf_model_0008_0.pt,
+    hf_model_0009_0.pt,
+    hf_model_0010_0.pt,
+    hf_model_0011_0.pt,
+    hf_model_0012_0.pt,
+    hf_model_0013_0.pt,
+    hf_model_0014_0.pt,
+    hf_model_0015_0.pt,
+    hf_model_0016_0.pt,
+    hf_model_0017_0.pt,
+    hf_model_0018_0.pt,
+    hf_model_0019_0.pt,
+    hf_model_0020_0.pt,
+    hf_model_0021_0.pt,
+    hf_model_0022_0.pt,
+    hf_model_0023_0.pt,
+    hf_model_0024_0.pt,
+    hf_model_0025_0.pt,
+    hf_model_0026_0.pt,
+    hf_model_0027_0.pt,
+    hf_model_0028_0.pt,
+    hf_model_0029_0.pt,
+    hf_model_0030_0.pt,
   ]
   recipe_checkpoint: null
   output_dir: ${MODEL_PATH}/${HF_MODEL}-quantized
   model_type: LLAMA3
 
-device: cuda
+device: cpu
 dtype: bf16
 seed: 1234
 
diff --git a/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/full_finetune_distributed.sbatch b/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/full_finetune_distributed.sbatch
@@ -0,0 +1,95 @@
+#!/bin/bash
+
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+
+#SBATCH --job-name=full-finetuning
+#SBATCH --nodes=2
+#SBATCH --ntasks=2
+#SBATCH --gpus-per-node=8 # Number of GPU per node
+#SBATCH --output=logs/%x_%j.out # logfile for stdout
+#SBATCH --error=logs/%x_%j.err # logfile for stderr, remove it to merge both outputs
+#SBATCH --wait-all-nodes=1
+#SBATCH --exclusive
+set -euxo pipefail
+
+##################################################################
+########### Check current working directory ######################
+##################################################################
+if [ $(basename $(pwd)) != "slurm" ]
+then
+    echo "Please run this script from the slurm directory"
+    exit 1
+fi
+##################################################################
+############# Load environment variables #########################
+##################################################################
+# Load environment variables
+if [ ! -f .env ]
+then
+    echo "Please create a .env file with the required environment variables"
+    exit 1
+else
+    source .env
+fi
+
+##################################################################
+######### Define EFA/NCCL/Slurm environment variables ############
+##################################################################
+## EFA settings
+export FI_LOG_LEVEL=1
+export FI_PROVIDER=efa # change to eth if you want to use ENA for comparisons
+export FI_EFA_USE_HUGE_PAGE=0
+# https://discuss.pytorch.org/t/nccl-network-is-unreachable-connection-refused-when-initializing-ddp/137352
+# https://github.com/pytorch/pytorch/issues/68893
+export NCCL_SOCKET_IFNAME=en
+export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
+export NCCL_DEBUG=INFO
+export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
+export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
+export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l`
+export NODES=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
+export NODES_ARRAY=($NODES)
+export HEAD_NODE=${NODES_ARRAY[0]}
+export MASTER_ADDR=$(hostname --ip-address)
+export MASTER_PORT=$RANDOM
+export NNODES=$SLURM_JOB_NUM_NODES
+export NPROC=$SLURM_GPUS_PER_NODE
+export WORLD_SIZE=$(( $NNODES * $NPROC ))
+
+##################################################################
+############# Set training arguments #############################
+##################################################################
+export HF_MODEL="meta-llama/Meta-Llama-3-70B"
+: "${CONTAINER_MOUNT:=$FSX_PATH:$FSX_PATH}"
+declare -a SRUN_ARGS=(
+    --container-image $ENROOT_IMAGE
+    --container-mounts $CONTAINER_MOUNT
+)
+declare -a TORCHRUN_ARGS=(
+    # change this to match the number of gpus per node:
+    --master_addr $MASTER_ADDR 
+    --master_port $RANDOM 
+    --nproc_per_node=8 
+    --nnodes $NNODES 
+    --nnodes=$SLURM_JOB_NUM_NODES 
+    --rdzv_backend=c10d 
+    --rdzv_endpoint=$(hostname)
+)
+declare -a TRAIN_ARGS=(
+    --config  ${PWD}/tutorials/e2e-llama3-70b-development/configs/lora_finetune_distributed.yaml
+    tokenizer.path=${MODEL_PATH}/${HF_MODEL}/original/tokenizer.model
+    checkpointer.checkpoint_dir=${MODEL_PATH}/${HF_MODEL}
+    checkpointer.output_dir=${MODEL_PATH}/${HF_MODEL}-tuned
+    output_dir=${MODEL_PATH}/${HF_MODEL}-tuned/log
+    metric_logger.log_dir=${MODEL_PATH}/${HF_MODEL}-tuned/log/metrics
+)
+##################################################################
+################# Run torchtune ##################################
+##################################################################
+export PYTHONPATH=${PWD}/torchtune
+export TORCHTUNE=${PWD}/torchtune/torchtune/_cli/tune.py
+export TORCHTUNE_COMMAND="full_finetune_distributed"
+echo "Executing following command:"
+echo "torchtune" "run" "${TORCHRUN_ARGS[@]}" "${TORCHTUNE_COMMAND}" "${TORCHTUNE_ARGS[@]}"
+srun -l "${SRUN_ARGS[@]}" python ${TORCHTUNE} run "${TORCHRUN_ARGS[@]}" "${TORCHTUNE_COMMAND}" "${TRAIN_ARGS[@]}"
diff --git a/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/generate.sbatch b/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/generate.sbatch
diff --git a/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/quantize.sbatch b/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/quantize.sbatch
@@ -13,6 +13,14 @@
 #SBATCH --exclusive
 set -euxo pipefail
 
+##################################################################
+########### Check current working directory ######################
+##################################################################
+if [ $(basename $(pwd)) != "slurm" ]
+then
+    echo "Please run this script from the slurm directory"
+    exit 1
+fi
 ##################################################################
 ############# Load environment variables #########################
 ##################################################################
@@ -50,26 +58,26 @@ export NPROC=$SLURM_GPUS_PER_NODE
 export WORLD_SIZE=$(( $NNODES * $NPROC ))
 
 ##################################################################
-############### Create train config ##############################
-##################################################################
-if [ ! -d ${FSX_PATH}/tmp ]; then
-    mkdir -p ${FSX_PATH}/tmp
-fi
-cat ${PWD}/train_configs/quantize_llama3.yaml | envsubst > ${FSX_PATH}/tmp/quantize_llama3.yaml
-##################################################################
-################# Set arguments ##################################
+############# Set training arguments #############################
 ##################################################################
+export HF_MODEL="meta-llama/Meta-Llama-3-70B"
 : "${CONTAINER_MOUNT:=$FSX_PATH:$FSX_PATH}"
 declare -a SRUN_ARGS=(
     --container-image $ENROOT_IMAGE
     --container-mounts $CONTAINER_MOUNT
 )
 declare -a TRAIN_ARGS=(
-    --config ${FSX_PATH}/tmp/quantize_llama3.yaml
+    --config  ${PWD}/tutorials/e2e-llama3-70b-development/configs/quantize.yaml
+    tokenizer.path=${MODEL_PATH}/${HF_MODEL}/original/tokenizer.model
+    checkpointer.checkpoint_dir=${MODEL_PATH}/${HF_MODEL}-tuned
+    checkpointer.output_dir=${MODEL_PATH}/${HF_MODEL}-quantized
 )
-
-export TORCHTUNE=${PWD}/torchtune/torchtune/_cli/tune.py
+##################################################################
+################# Run torchtune ##################################
+##################################################################
 export PYTHONPATH=${PWD}/torchtune
-
-#srun -l "${SRUN_ARGS[@]}" python ${TORCHTUNE} cp generation /fsx/tmp/generate_llama3.yaml
-srun -l "${SRUN_ARGS[@]}" python ${TORCHTUNE} run quantize  "${TRAIN_ARGS[@]}"
+export TORCHTUNE=${PWD}/torchtune/torchtune/_cli/tune.py
+export TORCHTUNE_COMMAND="quantize"
+echo "Executing following command:"
+echo "torchtune" "run" "${TORCHTUNE_COMMAND}" "${TRAIN_ARGS[@]}"
+srun -l "${SRUN_ARGS[@]}" python ${TORCHTUNE} run "${TORCHTUNE_COMMAND}" "${TRAIN_ARGS[@]}"