Skip to content

Nvingest curator tutorial #584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 52 commits into from
Closed
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
d216ede
Adding files for NVingest portion of tutorial
ruchaa-apte Feb 25, 2025
26ac579
README for nvingest update
ruchaa-apte Feb 25, 2025
3f252d9
Adding nemo curator portion of the tutorial
ruchaa-apte Feb 25, 2025
7bf899f
Merge branch 'NVIDIA:main' into nvingest_curator_tutorial
ruchaa-apte Feb 28, 2025
ee2173a
README update
ruchaa-apte Feb 28, 2025
c192c6a
Adding Workflow Image
ruchaa-apte Feb 28, 2025
b65d6f0
Update README.md
ruchaa-apte Feb 28, 2025
61b3128
Minor edit to caption for image
ruchaa-apte Mar 11, 2025
8f7c707
Merge branch 'NVIDIA:main' into nvingest_curator_tutorial
ruchaa-apte Mar 11, 2025
b33657c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 11, 2025
62636f6
Update tutorials/multimodal_dapt_curation/ingest/main.py
ruchaa-apte Mar 26, 2025
d791fc5
Merge branch 'main' into nvingest_curator_tutorial
ruchaa-apte Mar 26, 2025
dd225a5
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte Apr 2, 2025
ff4bf2d
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte Apr 2, 2025
7125e23
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte Apr 2, 2025
5ea91f2
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte Apr 2, 2025
7ff9ff0
Update tutorials/multimodal_dapt_curation/README.md
ruchaa-apte Apr 2, 2025
6a67597
Update tutorials/multimodal_dapt_curation/curator/configs/struct_sema…
ruchaa-apte Apr 2, 2025
6bc0a1a
Update tutorials/multimodal_dapt_curation/curator/configs/text_semant…
ruchaa-apte Apr 2, 2025
89051d9
Update tutorials/multimodal_dapt_curation/curator/configs/struct_sema…
ruchaa-apte Apr 2, 2025
4d42543
Update tutorials/multimodal_dapt_curation/curator/configs/text_semant…
ruchaa-apte Apr 2, 2025
90df3e6
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte Apr 2, 2025
43f739d
Update tutorials/multimodal_dapt_curation/curator/main.py
ruchaa-apte Apr 2, 2025
4674188
Update tutorials/multimodal_dapt_curation/curator/main.py
ruchaa-apte Apr 2, 2025
206026c
Update tutorials/multimodal_dapt_curation/curator/main.py
ruchaa-apte Apr 2, 2025
56ed2ac
Update tutorials/multimodal_dapt_curation/curator/main.py
ruchaa-apte Apr 2, 2025
1ae9028
Update tutorials/multimodal_dapt_curation/curator/utils.py
ruchaa-apte Apr 2, 2025
1befa21
Update tutorials/multimodal_dapt_curation/curator/utils.py
ruchaa-apte Apr 2, 2025
398a31a
Update tutorials/multimodal_dapt_curation/curator/README.md
ruchaa-apte Apr 2, 2025
83c224b
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte Apr 2, 2025
fe8ac8f
Update tutorials/multimodal_dapt_curation/curator/README.md
ruchaa-apte Apr 2, 2025
cddf63a
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte Apr 2, 2025
cc9c1b2
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte Apr 2, 2025
ab31cb7
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte Apr 2, 2025
ff38938
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte Apr 2, 2025
9870e58
Update tutorials/multimodal_dapt_curation/ingest/README.md
ruchaa-apte Apr 2, 2025
9d49e09
Addressing comments on PR
ruchaa-apte Apr 8, 2025
a45fd5d
Fixing image display issue
ruchaa-apte Apr 8, 2025
61741de
Addressing README edits
ruchaa-apte Apr 8, 2025
ec512b6
Making changes to config based on correct keys
ruchaa-apte Apr 8, 2025
613d3c5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 8, 2025
bcba5f4
Merge branch 'NVIDIA:main' into nvingest_curator_tutorial
ruchaa-apte Apr 22, 2025
c7283cc
Semantic Dedupe for Image Curation
ruchaa-apte Apr 22, 2025
e2fb216
Config file update
ruchaa-apte Apr 22, 2025
72a3926
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 22, 2025
06da4da
addressing comments on PR
ruchaa-apte Apr 23, 2025
65ba3a5
Merge branch 'nvingest_curator_tutorial' of https://github.com/ruchaa…
ruchaa-apte Apr 23, 2025
d54f1da
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 23, 2025
8096500
resolve issues
ruchaa-apte Apr 29, 2025
e7e4cb4
fix linting issues
ruchaa-apte Apr 29, 2025
7520f2f
Merge branch 'main' into nvingest_curator_tutorial
ruchaa-apte Apr 29, 2025
7098294
Merge branch 'main' into nvingest_curator_tutorial
ayushdg May 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions tutorials/multimodal_dapt_curation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Multimodal Extraction and Curation

## Workflow
![Workflow Overview](image/workflow.png)

## Overview
This tutorial is divided into two parts:

### Part 1: Multimodal Extraction
In this section, we guide you through extracting various modalities (text, images, tables, etc.) from PDFs using NVIDIA's multimodal extraction (`nv-ingest`) framework. To complete the prerequisites and run the tutorial, refer to the README located in the `ingest` folder within the directory.

### Part 2: Data Curation for Domain-Adaptive Pre-Training (DAPT)
The second part of the tutorial covers best practices for data curation for DAPT. This stage processes extracted text, tables, charts, and images using the curation pipeline. To complete the prerequisites and execute the tutorial, follow the README in the `curator` folder within the directory.

## Instructions
- Ensure that all prerequisites for both `nv-ingest` (extraction) and `curator` (curation) are completed before proceeding.
- Follow the respective READMEs in the `ingest` and `curator` folders for step-by-step guidance.

## License
Refer to the respective repositories for licensing information.
39 changes: 39 additions & 0 deletions tutorials/multimodal_dapt_curation/curator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Multi-Modal Data Curation from PDFs

## Overview
This is Part 2 of the tutorial that provides best practices for data curation in Domain-Adaptive Pre-Training (DAPT).
The dataset used in this tutorial is small, making it ideal for developing and validating data curation pipelines on either a local machine or a computing cluster. The playbook employs specialized tools and techniques for high-quality text curation and refinement.

## Hardware Requirements
This playbook is compatible with both CPUs and GPUs.
While most steps can run on a CPU, the semantic and fuzzy deduplication modules require a GPU.
If GPUs are available, the PII redaction and exact deduplication processes will be accelerated.

## Walkthrough
The datasets used in this tutorial are located in the `NeMo-Curator/tutorials/multimodal_dapt_curation/ingest/sources/separated_extracted_data/data_type_map.json` file.

The tutorial follows these steps:
1. Install requirements and import libraries
2. Convert extracted data: Transform data from `nv-ingest` into Dask DataFrames and convert them to `DocumentDataset`.
3. Examine file types and sizes (optional)
4. Run the data curation pipeline with NeMo Curator:
- Identify and separate file types
- Perform document-level exact deduplication
- Apply heuristic-based quality filtering (e.g., number of lines, word count, top N-grams)
- Fix Unicode errors using `ftfy`
- Redact PII
- Execute GPU-accelerated fuzzy and semantic deduplication
5. Save the filtered and curated data

## Usage
After installing the NeMo Curator package, install the required dependencies and run the pipeline using the following command:
```sh
pip install -r requirements.txt
```

```sh
python main.py --device "gpu"
```

## License
Refer to the relevant repository for licensing information.
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Configuration file for struct semantic deduplication
cache_dir: "workspace/semdedup_cache/struct"
num_files: 16

# Embeddings configuration
embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2"
embedding_batch_size: 128
embeddings_save_loc: "embeddings"
write_embeddings_to_disk: false

# Clustering configuration
max_iter: 100
n_clusters: 5
clustering_save_loc: "clustering_results"
sim_metric: "cosine"
which_to_keep: "hard"
batched_cosine_similarity: 1024
clustering_input_partition_size: "2gb"

# Extract dedup configuration
eps_thresholds:
- 0.1
- 0.01

# Which threshold to use for extracting deduped data
eps_to_extract: 0.1
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Configuration file for text semantic deduplication
cache_dir: "workspace/semdedup_cache/text"
num_files: 16

# Embeddings configuration
embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2"
embedding_batch_size: 128
embeddings_save_loc: "embeddings"
write_embeddings_to_disk: false

# Clustering configuration
max_iter: 100
n_clusters: 5
clustering_save_loc: "clustering_results"
sim_metric: "cosine"
which_to_keep: "hard"
batched_cosine_similarity: 1024
clustering_input_partition_size: "2gb"

# Extract dedup configuration
eps_thresholds:
- 0.1
- 0.01

# Which threshold to use for extracting deduped data
eps_to_extract: 0.1
Loading