-
Notifications
You must be signed in to change notification settings - Fork 165
Nvingest curator tutorial #584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
ruchaa-apte
wants to merge
52
commits into
NVIDIA-NeMo:main
from
ruchaa-apte:nvingest_curator_tutorial
Closed
Changes from all commits
Commits
Show all changes
52 commits
Select commit
Hold shift + click to select a range
d216ede
Adding files for NVingest portion of tutorial
ruchaa-apte 26ac579
README for nvingest update
ruchaa-apte 3f252d9
Adding nemo curator portion of the tutorial
ruchaa-apte 7bf899f
Merge branch 'NVIDIA:main' into nvingest_curator_tutorial
ruchaa-apte ee2173a
README update
ruchaa-apte c192c6a
Adding Workflow Image
ruchaa-apte b65d6f0
Update README.md
ruchaa-apte 61b3128
Minor edit to caption for image
ruchaa-apte 8f7c707
Merge branch 'NVIDIA:main' into nvingest_curator_tutorial
ruchaa-apte b33657c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 62636f6
Update tutorials/multimodal_dapt_curation/ingest/main.py
ruchaa-apte d791fc5
Merge branch 'main' into nvingest_curator_tutorial
ruchaa-apte dd225a5
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte ff4bf2d
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte 7125e23
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte 5ea91f2
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte 7ff9ff0
Update tutorials/multimodal_dapt_curation/README.md
ruchaa-apte 6a67597
Update tutorials/multimodal_dapt_curation/curator/configs/struct_sema…
ruchaa-apte 6bc0a1a
Update tutorials/multimodal_dapt_curation/curator/configs/text_semant…
ruchaa-apte 89051d9
Update tutorials/multimodal_dapt_curation/curator/configs/struct_sema…
ruchaa-apte 4d42543
Update tutorials/multimodal_dapt_curation/curator/configs/text_semant…
ruchaa-apte 90df3e6
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte 43f739d
Update tutorials/multimodal_dapt_curation/curator/main.py
ruchaa-apte 4674188
Update tutorials/multimodal_dapt_curation/curator/main.py
ruchaa-apte 206026c
Update tutorials/multimodal_dapt_curation/curator/main.py
ruchaa-apte 56ed2ac
Update tutorials/multimodal_dapt_curation/curator/main.py
ruchaa-apte 1ae9028
Update tutorials/multimodal_dapt_curation/curator/utils.py
ruchaa-apte 1befa21
Update tutorials/multimodal_dapt_curation/curator/utils.py
ruchaa-apte 398a31a
Update tutorials/multimodal_dapt_curation/curator/README.md
ruchaa-apte 83c224b
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte fe8ac8f
Update tutorials/multimodal_dapt_curation/curator/README.md
ruchaa-apte cddf63a
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte cc9c1b2
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte ab31cb7
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte ff38938
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte 9870e58
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte 9d49e09
Addressing comments on PR
ruchaa-apte a45fd5d
Fixing image display issue
ruchaa-apte 61741de
Addressing README edits
ruchaa-apte ec512b6
Making changes to config based on correct keys
ruchaa-apte 613d3c5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] bcba5f4
Merge branch 'NVIDIA:main' into nvingest_curator_tutorial
ruchaa-apte c7283cc
Semantic Dedupe for Image Curation
ruchaa-apte e2fb216
Config file update
ruchaa-apte 72a3926
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 06da4da
addressing comments on PR
ruchaa-apte 65ba3a5
Merge branch 'nvingest_curator_tutorial' of https://github.com/ruchaa…
ruchaa-apte d54f1da
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 8096500
resolve issues
ruchaa-apte e7e4cb4
fix linting issues
ruchaa-apte 7520f2f
Merge branch 'main' into nvingest_curator_tutorial
ruchaa-apte 7098294
Merge branch 'main' into nvingest_curator_tutorial
ayushdg File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Multimodal Extraction and Curation | ||
|
||
## Workflow | ||
 | ||
|
||
## Overview | ||
This tutorial is divided into two parts: | ||
|
||
### Part 1: Multimodal Extraction | ||
In this section, we guide you through extracting various modalities (text, images, tables, etc.) from PDFs using NVIDIA's multimodal extraction (`nv-ingest`) framework. To complete the prerequisites and run the tutorial, refer to the README located in the `ingest` folder within the directory. | ||
|
||
### Part 2: Data Curation for Domain-Adaptive Pre-Training (DAPT) | ||
The second part of the tutorial covers best practices for data curation for DAPT. This stage processes extracted text, tables, charts, and images using the curation pipeline. To complete the prerequisites and execute the tutorial, follow the README in the `curator` folder within the directory. | ||
|
||
## Instructions | ||
- Ensure that all prerequisites for both `nv-ingest` (extraction) and `curator` (curation) are completed before proceeding. | ||
- Follow the respective READMEs in the `ingest` and `curator` folders for step-by-step guidance. | ||
|
||
## License | ||
Refer to the respective repositories for licensing information. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# Multi-Modal Data Curation from PDFs | ||
|
||
## Overview | ||
This is Part 2 of the tutorial that provides best practices for data curation in Domain-Adaptive Pre-Training (DAPT). | ||
The dataset used in this tutorial is small, making it ideal for developing and validating data curation pipelines on either a local machine or a computing cluster. The playbook employs specialized tools and techniques for high-quality text curation and refinement. | ||
|
||
## Hardware Requirements | ||
This playbook is compatible with both CPUs and GPUs. | ||
While most steps can run on a CPU, the semantic and fuzzy deduplication modules require a GPU. | ||
If GPUs are available, the PII redaction and exact deduplication processes will be accelerated. | ||
|
||
## Walkthrough | ||
The datasets used in this tutorial are located in the `NeMo-Curator/tutorials/multimodal_dapt_curation/ingest/sources/separated_extracted_data/data_type_map.json` file. | ||
|
||
The tutorial follows these steps: | ||
1. Install requirements and import libraries | ||
2. Convert extracted data: Transform data from `nv-ingest` into Dask DataFrames and convert them to `DocumentDataset`. | ||
3. Examine file types and sizes (optional) | ||
4. Run the data curation pipeline with NeMo Curator: | ||
- Identify and separate file types | ||
- Perform document-level exact deduplication | ||
- Apply heuristic-based quality filtering (e.g., number of lines, word count, top N-grams) | ||
- Fix Unicode errors using `ftfy` | ||
- Redact PII | ||
- Execute GPU-accelerated fuzzy and semantic deduplication | ||
5. Convert images extracted from nv-ingest into webdataset format | ||
6. Apply semantic deduplication to get rid of duplicate images extracted | ||
7. Save the filtered and curated data | ||
|
||
## Usage | ||
After installing the NeMo Curator package, install the required dependencies and run the pipeline using the following command: | ||
```sh | ||
pip install -r requirements.txt | ||
``` | ||
|
||
```sh | ||
python main.py --device "gpu" | ||
``` | ||
|
||
## License | ||
Refer to the relevant repository for licensing information. |
21 changes: 21 additions & 0 deletions
21
tutorials/multimodal_dapt_curation/curator/configs/struct_semantic_dedupe_config.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# Configuration file for struct semantic deduplication | ||
cache_dir: "workspace/semdedup_cache/struct" | ||
num_files: 16 | ||
|
||
# Embeddings configuration | ||
embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2" | ||
embedding_batch_size: 128 | ||
embeddings_save_loc: "embeddings" | ||
write_embeddings_to_disk: false | ||
|
||
# Clustering configuration | ||
max_iter: 100 | ||
n_clusters: 5 | ||
clustering_save_loc: "clustering_results" | ||
sim_metric: "cosine" | ||
which_to_keep: "hard" | ||
batched_cosine_similarity: 1024 | ||
clustering_input_partition_size: "2gb" | ||
|
||
# Which threshold to use for extracting deduped data | ||
eps_to_extract: 0.1 |
21 changes: 21 additions & 0 deletions
21
tutorials/multimodal_dapt_curation/curator/configs/text_semantic_dedupe_config.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# Configuration file for text semantic deduplication | ||
cache_dir: "workspace/semdedup_cache/text" | ||
num_files: 16 | ||
|
||
# Embeddings configuration | ||
embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2" | ||
embedding_batch_size: 128 | ||
embeddings_save_loc: "embeddings" | ||
write_embeddings_to_disk: false | ||
|
||
# Clustering configuration | ||
max_iter: 100 | ||
n_clusters: 15 | ||
clustering_save_loc: "clustering_results" | ||
sim_metric: "cosine" | ||
which_to_keep: "hard" | ||
batched_cosine_similarity: 1024 | ||
clustering_input_partition_size: "2gb" | ||
|
||
# Which threshold to use for extracting deduped data | ||
eps_to_extract: 0.1 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.