Add Synthetic Data Generation Module #136

ryantwolf · 2024-07-02T05:33:00Z

Description

Adds a suite of tools for interacting with LLM services. These LLM services are then used to build synthetic data generation tools and example pipelines following the Nemotron 340B Technical Report. The prompt templates used in the report are supplied as defaults throughout the code.

Usage

OpenAI API

from nemo_curator import AsyncOpenAIClient
from nemo_curator.synthetic import (
    AsyncNemotronGenerator,
    NemotronGenerator,
)
from openai import OpenAI, AsyncOpenAI

async def demo():
  openai_client = AsyncOpenAI(
      base_url="https://integrate.api.nvidia.com/v1",
      api_key="",
  )
  client = AsyncOpenAIClient(openai_client)
  generator = AsyncNemotronGenerator(client)
  
  model = "nvidia/nemotron-4-340b-instruct"
  model_kwargs = {
      "top_p": 0.7,
      "max_tokens": 1024,
      "seed": 1234,
  }
  
  openlines = await generator.run_open_qa_pipeline(
      n_macro_topics=5,
      n_subtopics=3,
      n_openlines=3,
      n_revisions=2,
      model=model,
      base_model_kwargs=model_kwargs,
      conversion_model_kwargs=model_kwargs,
      ignore_conversion_failure=True,
  )
  
  dialogue = await generator.generate_dialogue(
      openline=openlines[0],
      user_model=model,
      assistant_model=model,
      user_model_kwargs=model_kwargs,
      assistant_model_kwargs=model_kwargs,
  )

  print(dialogue)

NeMo Deploy

from nemo_curator import NemoDeployClient
from nemo_curator.synthetic import (
    AsyncNemotronGenerator,
    NemotronGenerator,
    NemotronFormatter,
)
from nemo.deploy.nlp import NemoQueryLLM

async def demo():
  model = "local_nemotron"
  model_kwargs = {
      "top_p": 0.7,
      "max_tokens": 1024,
      "seed": 1234,
      "conversation_formatter": NemotronFormatter(),
      "stop": ['<extra_id_1>'],
  }
  
  nemo_client = NemoQueryLLM(url="localhost:8000", model_name=model)
  client = NemoDeployClient(nemo_client)
  
  openlines = await generator.run_open_qa_pipeline(
      n_macro_topics=5,
      n_subtopics=3,
      n_openlines=3,
      n_revisions=2,
      model=model,
      base_model_kwargs=model_kwargs,
      conversion_model_kwargs=model_kwargs,
      ignore_conversion_failure=True,
  )

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Ryan Wolf <[email protected]>

VibhuJawa

Thanks for working on this @ryantwolf . This is a very important functionality we are adding, Very excited for it.

My major concern currently is around not having a way to rate limit the number of requests we are sending, everything else is mostly nits.

nemo_curator/datasets/doc_dataset.py

nemo_curator/services/openai_client.py

nemo_curator/synthetic/async_nemotron.py

nemo_curator/synthetic/prompts.py

nemo_curator/synthetic/nemotron.py

ayushdg

Minor nit but at a high level looks good to me! Thanks a lot for this effort

docs/user-guide/syntheticdata.rst

Signed-off-by: Ryan Wolf <[email protected]>

VibhuJawa

LGTM, thanks for working on this Ryan.

* Begin implementation on OpenAI client Signed-off-by: Ryan Wolf <[email protected]> * Fix relative import Signed-off-by: Ryan Wolf <[email protected]> * Add temperature Signed-off-by: Ryan Wolf <[email protected]> * Modify client interface and begin ultrachat Signed-off-by: Ryan Wolf <[email protected]> * Change type annotation in openai client Signed-off-by: Ryan Wolf <[email protected]> * Make imports easier Signed-off-by: Ryan Wolf <[email protected]> * Reformat to match nemotron report Signed-off-by: Ryan Wolf <[email protected]> * Add yaml conversion Signed-off-by: Ryan Wolf <[email protected]> * Fix index error Signed-off-by: Ryan Wolf <[email protected]> * Add error handling for yaml parsing Signed-off-by: Ryan Wolf <[email protected]> * Fix error Signed-off-by: Ryan Wolf <[email protected]> * Add additional yaml parsing check Signed-off-by: Ryan Wolf <[email protected]> * Add more yaml error handling Signed-off-by: Ryan Wolf <[email protected]> * Export conversion error Signed-off-by: Ryan Wolf <[email protected]> * Change variable naming Signed-off-by: Ryan Wolf <[email protected]> * Make error catching more general Signed-off-by: Ryan Wolf <[email protected]> * Refactor list out of nemotron Signed-off-by: Ryan Wolf <[email protected]> * Add prompt helper function Signed-off-by: Ryan Wolf <[email protected]> * Add revisions and writing prompts Signed-off-by: Ryan Wolf <[email protected]> * Fix default prompt templates Signed-off-by: Ryan Wolf <[email protected]> * Add closed qa Signed-off-by: Ryan Wolf <[email protected]> * Fix prompt Signed-off-by: Ryan Wolf <[email protected]> * Add math and coding Signed-off-by: Ryan Wolf <[email protected]> * Add problem generation Signed-off-by: Ryan Wolf <[email protected]> * Rename function Signed-off-by: Ryan Wolf <[email protected]> * Add dialogue support Signed-off-by: Ryan Wolf <[email protected]> * Fix mispell Signed-off-by: Ryan Wolf <[email protected]> * Add two turn generation Signed-off-by: Ryan Wolf <[email protected]> * Add reward model as judge Signed-off-by: Ryan Wolf <[email protected]> * Refactor reward query Signed-off-by: Ryan Wolf <[email protected]> * Add error handling for non-reward models Signed-off-by: Ryan Wolf <[email protected]> * Add error handling to sync client Signed-off-by: Ryan Wolf <[email protected]> * Add open qa pipeline Signed-off-by: Ryan Wolf <[email protected]> * Improve docs and add writing pipeline Signed-off-by: Ryan Wolf <[email protected]> * Add closed qa pipeline Signed-off-by: Ryan Wolf <[email protected]> * Add math pipeline Signed-off-by: Ryan Wolf <[email protected]> * Add python pipeline Signed-off-by: Ryan Wolf <[email protected]> * Add async nemotron generator Signed-off-by: Ryan Wolf <[email protected]> * Fix await with index Signed-off-by: Ryan Wolf <[email protected]> * Add seed parameter Signed-off-by: Ryan Wolf <[email protected]> * Add missing await Signed-off-by: Ryan Wolf <[email protected]> * Fix parameter names Signed-off-by: Ryan Wolf <[email protected]> * Fix subscript await issues Signed-off-by: Ryan Wolf <[email protected]> * Switch parsing method for reward model Signed-off-by: Ryan Wolf <[email protected]> * Add initial docs Signed-off-by: Ryan Wolf <[email protected]> * Add nemo deploy client Signed-off-by: Ryan Wolf <[email protected]> * Add easy import Signed-off-by: Ryan Wolf <[email protected]> * Move conversation formatter Signed-off-by: Ryan Wolf <[email protected]> * Add other file Signed-off-by: Ryan Wolf <[email protected]> * Update nemotron import Signed-off-by: Ryan Wolf <[email protected]> * Update model client import Signed-off-by: Ryan Wolf <[email protected]> * Remove model in query call Signed-off-by: Ryan Wolf <[email protected]> * Add extra index Signed-off-by: Ryan Wolf <[email protected]> * Fix response indexing Signed-off-by: Ryan Wolf <[email protected]> * Add top k Signed-off-by: Ryan Wolf <[email protected]> * Remove extras Signed-off-by: Ryan Wolf <[email protected]> * Add safe import for nemo deploy Signed-off-by: Ryan Wolf <[email protected]> * Add pandas conversions Signed-off-by: Ryan Wolf <[email protected]> * Add partition default Signed-off-by: Ryan Wolf <[email protected]> * Add no format Signed-off-by: Ryan Wolf <[email protected]> * Move no format location Signed-off-by: Ryan Wolf <[email protected]> * Use top_k in nemo client Signed-off-by: Ryan Wolf <[email protected]> * Address vibhu's review Signed-off-by: Ryan Wolf <[email protected]> * Add logging import Signed-off-by: Ryan Wolf <[email protected]> * Fix import Signed-off-by: Ryan Wolf <[email protected]> * Fix tqdm Signed-off-by: Ryan Wolf <[email protected]> * Add missing awaits Signed-off-by: Ryan Wolf <[email protected]> * Standardize names Signed-off-by: Ryan Wolf <[email protected]> * Address Ayush nit Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]>

ryantwolf added 30 commits June 26, 2024 09:06

Begin implementation on OpenAI client

4bb6ddc

Signed-off-by: Ryan Wolf <[email protected]>

Fix relative import

850403f

Signed-off-by: Ryan Wolf <[email protected]>

Add temperature

e6cac7a

Signed-off-by: Ryan Wolf <[email protected]>

Modify client interface and begin ultrachat

131b2d6

Signed-off-by: Ryan Wolf <[email protected]>

Change type annotation in openai client

fb82737

Signed-off-by: Ryan Wolf <[email protected]>

Make imports easier

f3e6309

Signed-off-by: Ryan Wolf <[email protected]>

Reformat to match nemotron report

5ad683f

Signed-off-by: Ryan Wolf <[email protected]>

Add yaml conversion

0d552b4

Signed-off-by: Ryan Wolf <[email protected]>

Fix index error

87ebfc4

Signed-off-by: Ryan Wolf <[email protected]>

Add error handling for yaml parsing

bb72a68

Signed-off-by: Ryan Wolf <[email protected]>

Fix error

32c7f55

Signed-off-by: Ryan Wolf <[email protected]>

Add additional yaml parsing check

a6d306e

Signed-off-by: Ryan Wolf <[email protected]>

Add more yaml error handling

ece34b5

Signed-off-by: Ryan Wolf <[email protected]>

Export conversion error

28d3a08

Signed-off-by: Ryan Wolf <[email protected]>

Change variable naming

8cf295e

Signed-off-by: Ryan Wolf <[email protected]>

Make error catching more general

7fcd719

Signed-off-by: Ryan Wolf <[email protected]>

Refactor list out of nemotron

76ddfda

Signed-off-by: Ryan Wolf <[email protected]>

Add prompt helper function

2f7a03b

Signed-off-by: Ryan Wolf <[email protected]>

Add revisions and writing prompts

76c4bdd

Signed-off-by: Ryan Wolf <[email protected]>

Fix default prompt templates

2f15d89

Signed-off-by: Ryan Wolf <[email protected]>

Add closed qa

cc18dfe

Signed-off-by: Ryan Wolf <[email protected]>

Fix prompt

d4755c0

Signed-off-by: Ryan Wolf <[email protected]>

Add math and coding

366fea8

Signed-off-by: Ryan Wolf <[email protected]>

Add problem generation

f563018

Signed-off-by: Ryan Wolf <[email protected]>

Rename function

294a390

Signed-off-by: Ryan Wolf <[email protected]>

Add dialogue support

728d585

Signed-off-by: Ryan Wolf <[email protected]>

Fix mispell

4c64c3a

Signed-off-by: Ryan Wolf <[email protected]>

Add two turn generation

8db6019

Signed-off-by: Ryan Wolf <[email protected]>

Add reward model as judge

2d13d63

Signed-off-by: Ryan Wolf <[email protected]>

Refactor reward query

8336452

Signed-off-by: Ryan Wolf <[email protected]>

ryantwolf added 16 commits July 4, 2024 23:17

Add easy import

7daefb7

Signed-off-by: Ryan Wolf <[email protected]>

Move conversation formatter

c0509f9

Signed-off-by: Ryan Wolf <[email protected]>

Add other file

e964712

Signed-off-by: Ryan Wolf <[email protected]>

Update nemotron import

e500814

Signed-off-by: Ryan Wolf <[email protected]>

Update model client import

2b4d3ff

Signed-off-by: Ryan Wolf <[email protected]>

Remove model in query call

7acbee9

Signed-off-by: Ryan Wolf <[email protected]>

Add extra index

06b7310

Signed-off-by: Ryan Wolf <[email protected]>

Fix response indexing

f05b13a

Signed-off-by: Ryan Wolf <[email protected]>

Add top k

0efc808

Signed-off-by: Ryan Wolf <[email protected]>

Remove extras

c8d1419

Signed-off-by: Ryan Wolf <[email protected]>

Add safe import for nemo deploy

2d11a8c

Signed-off-by: Ryan Wolf <[email protected]>

Add pandas conversions

20afd89

Signed-off-by: Ryan Wolf <[email protected]>

Add partition default

2987c9a

Signed-off-by: Ryan Wolf <[email protected]>

Add no format

3f8dcc8

Signed-off-by: Ryan Wolf <[email protected]>

Move no format location

0926cbd

Signed-off-by: Ryan Wolf <[email protected]>

Use top_k in nemo client

e2beb5b

Signed-off-by: Ryan Wolf <[email protected]>

VibhuJawa requested changes Jul 8, 2024

View reviewed changes

ayushdg approved these changes Jul 8, 2024

View reviewed changes

docs/user-guide/syntheticdata.rst Outdated Show resolved Hide resolved

ryantwolf added 7 commits July 8, 2024 17:39

Address vibhu's review

b918c14

Signed-off-by: Ryan Wolf <[email protected]>

Add logging import

b79ce6b

Signed-off-by: Ryan Wolf <[email protected]>

Fix import

8957e12

Signed-off-by: Ryan Wolf <[email protected]>

Fix tqdm

1400a32

Signed-off-by: Ryan Wolf <[email protected]>

Add missing awaits

0926d6e

Signed-off-by: Ryan Wolf <[email protected]>

Standardize names

fbe9292

Signed-off-by: Ryan Wolf <[email protected]>

Address Ayush nit

8f66396

Signed-off-by: Ryan Wolf <[email protected]>

ryantwolf requested a review from VibhuJawa July 9, 2024 01:21

VibhuJawa approved these changes Jul 9, 2024

View reviewed changes

ryantwolf merged commit f572314 into main Jul 9, 2024
3 checks passed

ryantwolf deleted the rywolf/synth-data branch July 9, 2024 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Synthetic Data Generation Module #136

Add Synthetic Data Generation Module #136

ryantwolf commented Jul 2, 2024 •

edited

Loading

VibhuJawa left a comment

ayushdg left a comment

VibhuJawa left a comment

Add Synthetic Data Generation Module #136

Add Synthetic Data Generation Module #136

Conversation

ryantwolf commented Jul 2, 2024 • edited Loading

Description

Usage

OpenAI API

NeMo Deploy

Checklist

VibhuJawa left a comment

Choose a reason for hiding this comment

ayushdg left a comment

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

ryantwolf commented Jul 2, 2024 •

edited

Loading