Skip to content

Commit 06487ca

Browse files
authored
Update README.md
1 parent 0dc4bfd commit 06487ca

File tree

1 file changed

+9
-0
lines changed

1 file changed

+9
-0
lines changed

template/README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,15 @@ To finetune an LLM on remote infrastructure, you can either use a remote orchest
6565
[-s <STEP_OPERATOR_NAME>]
6666
```
6767

68+
## 🗂️ Bring Your Own Data
69+
70+
To fine-tune an LLM using your own datasets, consider adjusting the [`prepare_data` step](steps/prepare_datasets.py) to match your needs:
71+
- This step loads, tokenizes, and stores the dataset from an external source to the artifact store defined in the ZenML Stack.
72+
- The dataset can be loaded from Hugging Face by adjusting the `dataset_name` parameter in the configuration file. By default, the step code expects the dataset to have at least three splits: `train`, `validation`, and `test`. If your dataset uses different split naming, you'll need to make the necessary adjustments.
73+
- If you want to retrieve the dataset from other sources, you'll need to create the relevant code and prepare the splits in a Hugging Face dataset format for further processing.
74+
- Tokenization occurs in the utility function [`generate_and_tokenize_prompt`](utils/tokenizer.py). It has a default way of formatting the inputs before passing them into the model. If this default logic doesn't fit your use case, you'll also need to adjust this function.
75+
- The return value is the path to the stored datasets (by default, `train`, `val`, and `test_raw` splits). Note: The test set is not tokenized here and will be tokenized later during evaluation.
76+
6877
## 📜 Project Structure
6978

7079
The project loosely follows [the recommended ZenML project structure](https://docs.zenml.io/user-guide/starter-guide/follow-best-practices):

0 commit comments

Comments
 (0)