You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/README.md
+9-6
Original file line number
Diff line number
Diff line change
@@ -1,12 +1,15 @@
1
1
# End-to-End LLama3-70B model development with Torchtune <!-- omit in toc -->
2
2
3
-
In this tutorial, you will see how to:
3
+
This tutorial guide you through each following LLM model development steps using Llama3-70B:
4
+
4
5
* Contious Pretraining
5
6
* Instruction Finetuning
6
7
* Alignment
7
8
* Evaluation
8
9
* Deployment
9
10
11
+
for details of each step, refer the [overview documentation](../../README.md).
12
+
10
13
## 1. Prerequisites
11
14
Before starting, ensure you have requested access to Meta-Llama-3-70B by visiting [Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B) on Hugging Face and following the access request instructions. Additionally, make sure all prerequisites described in the [slurm](..) directory are set up.
12
15
@@ -22,8 +25,6 @@ Navigate to the [test case path](..) and prepare your environment by sourcing th
22
25
source .env
23
26
```
24
27
25
-
This step is crucial for configuring the necessary paths and credentials for accessing and working with the Llama3-70B model.
26
-
27
28
### Fetching the Model Weights and Tokenizer
28
29
29
30
Execute the `download_hf_model.sh` script with the model identifier as an argument to download the model weights and tokenizer:
@@ -67,13 +68,15 @@ By following these steps, you ensure that the necessary model components are in
67
68
68
69
## 3. Continuous Pretraining
69
70
70
-
In this step, you will fine-tune the Llama model. Specifically, the finetune process in this step is called Full-parameter finetuning, which will update all the parameters in the original model.
71
+
In this step, you will fine-tune Llama3 model from the orinal checkpoint. Specifically, the finetune process in this step is called Full-parameter finetuning, which will update all the parameters in the original model. One of the problem we encounter in such training is memory consumption. A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory (6 bytes for parameters for mixed precision training, 8 bytes for AdamW, 4 bytes).For more details of the anatomy, see [huggingface blog post](https://huggingface.co/docs/transformers/model_memory_anatomy). This means that 70B parameter model training would require more than 1.12 TB of accelerated memory, which is way bigger than 80 GB of H100 accelerated memory size. To tackle the problem, `torchtune` integrates PyTorch Fully Distributed Data Parallel (FSDP). In this framework. PyTorch Fully Sharded Data Parallel (FSDP) is a distributed training feature designed to efficiently handle large model training by sharding model parameters, gradients, and optimizer states across multiple devices. This approach significantly reduces memory consumption and optimizes resource utilization, making it possible to train models that are too large to fit on a single GPU.
In this step, you will fine-tune the LLaMA model using Low-Rank Adaptation (LoRA) with the Alpaca dataset. We will first cover the basic concepts and relevant configurations found in the [config file](configs/lora_finetune_distributed.yaml), followed by a detailed fine-tuning tutorial.
@@ -82,7 +85,7 @@ In this step, you will fine-tune the LLaMA model using Low-Rank Adaptation (LoRA
82
85
### Basic Concepts and Relevant Configurations
83
86
84
87
**Low-Rank Adaptation (LoRA)** is a method for fine-tuning large language models efficiently. It is a Parameter-efficient Fine-tuning (PEFT) technique that modifies a small, low-rank subset of a model's parameters, significantly reducing the computational cost and time required for fine-tuning. LoRA operates on the principle that large models, despite their size, inherently possess a low-dimensional structure, allowing significant changes to be represented with fewer parameters. This method involves decomposing large weight matrices into smaller matrices, drastically reducing the number of trainable parameters and making the adaptation process faster and less resource-intensive. It leverages the concept of lower-rank matrices to efficiently train models, making it a cost-effective solution for fine-tuning large language models.
0 commit comments