Nvingest curator tutorial #584

ruchaa-apte · 2025-03-11T21:19:55Z

Description

This tutorial is divided into two parts:
Part 1: Multimodal Extraction - guide to extracting various modalities (text, images, tables, etc.) from PDFs using NVIDIA's multimodal extraction (nv-ingest) framework.
Part 2: Data Curation for Domain-Adaptive Pre-Training (DAPT) - covers best practices for data curation in DAPT. This stage processes extracted text, tables, charts, and images using the curation pipeline.

Usage

Follow the README under ingest first and then curator folder for installation of pre-reqs

cd ingest
python main.py --analyze --display

cd curator
python main.py --device "gpu"

Checklist

[Y] I am familiar with the Contributing Guide.
[Y] New or Existing tests cover these changes.
[Y] The documentation is up to date with these changes.

Signed-off-by: Rucha Apte <[email protected]>

ruchaa-apte · 2025-03-11T21:22:59Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

ChrisJar

Lgtm! Just had one small nitpick

tutorials/multimodal_dapt_curation/ingest/main.py

ChrisJar

One more thing

ChrisJar · 2025-03-26T20:18:12Z

tutorials/multimodal_dapt_curation/ingest/README.md

+3. Using Ingestor interface to chain together an extraction task and a deduplication task to ingest PDF
+    - `extract` : Performs multimodal extractions from a document, including text, images, and tables.
+    - `dedup` : Identifies duplicate images in document that can be filtered to remove data redundancy.
+    - `filter` : Filters out images that are likely not useful using some heuristics, including size and aspect ratio.


I see that the ingest job in main.py has a caption task, it might be helpful to add information about that here

Good find ! I will add the caption task potion to README and update

Co-authored-by: ChrisJar <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

sarahyurick

Very clean code, thank you. Added a bunch of nits.

tutorials/multimodal_dapt_curation/README.md

tutorials/multimodal_dapt_curation/curator/configs/struct_semantic_dedupe_config.yaml

tutorials/multimodal_dapt_curation/curator/configs/text_semantic_dedupe_config.yaml

tutorials/multimodal_dapt_curation/curator/configs/struct_semantic_dedupe_config.yaml

tutorials/multimodal_dapt_curation/curator/configs/text_semantic_dedupe_config.yaml

tutorials/multimodal_dapt_curation/ingest/README.md

sarahyurick · 2025-04-02T19:33:34Z

If needed, could you add pdfminer.six==20221105 as a requirement for the DAPT tutorial?

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

…ntic_dedupe_config.yaml Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

…ic_dedupe_config.yaml Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

…ntic_dedupe_config.yaml Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

…ic_dedupe_config.yaml Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

ruchaa-apte added 9 commits February 25, 2025 13:54

Adding files for NVingest portion of tutorial

d216ede

Signed-off-by: Rucha Apte <[email protected]>

README for nvingest update

26ac579

Signed-off-by: Rucha Apte <[email protected]>

Adding nemo curator portion of the tutorial

3f252d9

Signed-off-by: Rucha Apte <[email protected]>

Merge branch 'NVIDIA:main' into nvingest_curator_tutorial

7bf899f

README update

ee2173a

Signed-off-by: Rucha Apte <[email protected]>

Adding Workflow Image

c192c6a

Signed-off-by: Rucha Apte <[email protected]>

Update README.md

b65d6f0

Signed-off-by: Rucha Apte <[email protected]>

Minor edit to caption for image

61b3128

Signed-off-by: Rucha Apte <[email protected]>

Merge branch 'NVIDIA:main' into nvingest_curator_tutorial

8f7c707

[pre-commit.ci] auto fixes from pre-commit.com hooks

b33657c

for more information, see https://pre-commit.ci

ChrisJar approved these changes Mar 26, 2025

View reviewed changes

tutorials/multimodal_dapt_curation/ingest/main.py Outdated Show resolved Hide resolved

ChrisJar approved these changes Mar 26, 2025

View reviewed changes

ruchaa-apte and others added 2 commits March 26, 2025 14:03

Update tutorials/multimodal_dapt_curation/ingest/main.py

62636f6

Co-authored-by: ChrisJar <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Merge branch 'main' into nvingest_curator_tutorial

d791fc5

sarahyurick reviewed Mar 26, 2025

View reviewed changes

ruchaa-apte and others added 13 commits April 2, 2025 12:49

Update tutorials/multimodal_dapt_curation/ingest/README.md

dd225a5

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/ingest/README.md

ff4bf2d

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/ingest/README.md

7125e23

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/ingest/README.md

5ea91f2

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/README.md

7ff9ff0

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/curator/configs/struct_sema…

6a67597

…ntic_dedupe_config.yaml Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/curator/configs/text_semant…

6bc0a1a

…ic_dedupe_config.yaml Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/curator/configs/struct_sema…

89051d9

…ntic_dedupe_config.yaml Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/curator/configs/text_semant…

4d42543

…ic_dedupe_config.yaml Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/ingest/README.md

90df3e6

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/curator/main.py

43f739d

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/curator/main.py

4674188

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/curator/main.py

206026c

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

ruchaa-apte and others added 11 commits April 2, 2025 12:52

Update tutorials/multimodal_dapt_curation/curator/main.py

56ed2ac

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/curator/utils.py

1ae9028

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/curator/utils.py

1befa21

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/curator/README.md

398a31a

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/ingest/README.md

83c224b

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/curator/README.md

fe8ac8f

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/ingest/README.md

cddf63a

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/ingest/README.md

cc9c1b2

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/ingest/README.md

ab31cb7

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/ingest/README.md

ff38938

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Update tutorials/multimodal_dapt_curation/ingest/README.md

9870e58

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nvingest curator tutorial #584

Nvingest curator tutorial #584

ruchaa-apte commented Mar 11, 2025

ruchaa-apte commented Mar 11, 2025

ChrisJar left a comment

ChrisJar left a comment

ChrisJar Mar 26, 2025

ruchaa-apte Mar 26, 2025

sarahyurick left a comment

sarahyurick commented Apr 2, 2025

Nvingest curator tutorial #584

Are you sure you want to change the base?

Nvingest curator tutorial #584

Conversation

ruchaa-apte commented Mar 11, 2025

Description

Usage

Checklist

ruchaa-apte commented Mar 11, 2025

ChrisJar left a comment

Choose a reason for hiding this comment

ChrisJar left a comment

Choose a reason for hiding this comment

ChrisJar Mar 26, 2025

Choose a reason for hiding this comment

ruchaa-apte Mar 26, 2025

Choose a reason for hiding this comment

sarahyurick left a comment

Choose a reason for hiding this comment

sarahyurick commented Apr 2, 2025