Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Synthetic Data Generation Module #136

Merged
merged 69 commits into from
Jul 9, 2024
Merged

Add Synthetic Data Generation Module #136

merged 69 commits into from
Jul 9, 2024

Conversation

ryantwolf
Copy link
Collaborator

@ryantwolf ryantwolf commented Jul 2, 2024

Description

Adds a suite of tools for interacting with LLM services. These LLM services are then used to build synthetic data generation tools and example pipelines following the Nemotron 340B Technical Report. The prompt templates used in the report are supplied as defaults throughout the code.

Usage

OpenAI API

from nemo_curator import AsyncOpenAIClient
from nemo_curator.synthetic import (
    AsyncNemotronGenerator,
    NemotronGenerator,
)
from openai import OpenAI, AsyncOpenAI

async def demo():
  openai_client = AsyncOpenAI(
      base_url="https://integrate.api.nvidia.com/v1",
      api_key="",
  )
  client = AsyncOpenAIClient(openai_client)
  generator = AsyncNemotronGenerator(client)
  
  model = "nvidia/nemotron-4-340b-instruct"
  model_kwargs = {
      "top_p": 0.7,
      "max_tokens": 1024,
      "seed": 1234,
  }
  
  openlines = await generator.run_open_qa_pipeline(
      n_macro_topics=5,
      n_subtopics=3,
      n_openlines=3,
      n_revisions=2,
      model=model,
      base_model_kwargs=model_kwargs,
      conversion_model_kwargs=model_kwargs,
      ignore_conversion_failure=True,
  )
  
  dialogue = await generator.generate_dialogue(
      openline=openlines[0],
      user_model=model,
      assistant_model=model,
      user_model_kwargs=model_kwargs,
      assistant_model_kwargs=model_kwargs,
  )

  print(dialogue)

NeMo Deploy

from nemo_curator import NemoDeployClient
from nemo_curator.synthetic import (
    AsyncNemotronGenerator,
    NemotronGenerator,
    NemotronFormatter,
)
from nemo.deploy.nlp import NemoQueryLLM

async def demo():
  model = "local_nemotron"
  model_kwargs = {
      "top_p": 0.7,
      "max_tokens": 1024,
      "seed": 1234,
      "conversation_formatter": NemotronFormatter(),
      "stop": ['<extra_id_1>'],
  }
  
  nemo_client = NemoQueryLLM(url="localhost:8000", model_name=model)
  client = NemoDeployClient(nemo_client)
  
  openlines = await generator.run_open_qa_pipeline(
      n_macro_topics=5,
      n_subtopics=3,
      n_openlines=3,
      n_revisions=2,
      model=model,
      base_model_kwargs=model_kwargs,
      conversion_model_kwargs=model_kwargs,
      ignore_conversion_failure=True,
  )

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

ryantwolf added 30 commits June 26, 2024 09:06
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
ryantwolf added 16 commits July 4, 2024 23:17
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @ryantwolf . This is a very important functionality we are adding, Very excited for it.

My major concern currently is around not having a way to rate limit the number of requests we are sending, everything else is mostly nits.

nemo_curator/datasets/doc_dataset.py Outdated Show resolved Hide resolved
nemo_curator/services/openai_client.py Show resolved Hide resolved
nemo_curator/services/openai_client.py Show resolved Hide resolved
nemo_curator/synthetic/async_nemotron.py Outdated Show resolved Hide resolved
nemo_curator/synthetic/async_nemotron.py Outdated Show resolved Hide resolved
nemo_curator/synthetic/async_nemotron.py Outdated Show resolved Hide resolved
nemo_curator/synthetic/async_nemotron.py Outdated Show resolved Hide resolved
nemo_curator/synthetic/async_nemotron.py Outdated Show resolved Hide resolved
nemo_curator/synthetic/prompts.py Show resolved Hide resolved
nemo_curator/synthetic/nemotron.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but at a high level looks good to me! Thanks a lot for this effort

docs/user-guide/syntheticdata.rst Outdated Show resolved Hide resolved
ryantwolf added 7 commits July 8, 2024 17:39
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
@ryantwolf ryantwolf requested a review from VibhuJawa July 9, 2024 01:21
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this Ryan.

@ryantwolf ryantwolf merged commit f572314 into main Jul 9, 2024
3 checks passed
@ryantwolf ryantwolf deleted the rywolf/synth-data branch July 9, 2024 02:05
sarahyurick pushed a commit to sarahyurick/NeMo-Curator that referenced this pull request Jul 23, 2024
* Begin implementation on OpenAI client

Signed-off-by: Ryan Wolf <[email protected]>

* Fix relative import

Signed-off-by: Ryan Wolf <[email protected]>

* Add temperature

Signed-off-by: Ryan Wolf <[email protected]>

* Modify client interface and begin ultrachat

Signed-off-by: Ryan Wolf <[email protected]>

* Change type annotation in openai client

Signed-off-by: Ryan Wolf <[email protected]>

* Make imports easier

Signed-off-by: Ryan Wolf <[email protected]>

* Reformat to match nemotron report

Signed-off-by: Ryan Wolf <[email protected]>

* Add yaml conversion

Signed-off-by: Ryan Wolf <[email protected]>

* Fix index error

Signed-off-by: Ryan Wolf <[email protected]>

* Add error handling for yaml parsing

Signed-off-by: Ryan Wolf <[email protected]>

* Fix error

Signed-off-by: Ryan Wolf <[email protected]>

* Add additional yaml parsing check

Signed-off-by: Ryan Wolf <[email protected]>

* Add more yaml error handling

Signed-off-by: Ryan Wolf <[email protected]>

* Export conversion error

Signed-off-by: Ryan Wolf <[email protected]>

* Change variable naming

Signed-off-by: Ryan Wolf <[email protected]>

* Make error catching more general

Signed-off-by: Ryan Wolf <[email protected]>

* Refactor list out of nemotron

Signed-off-by: Ryan Wolf <[email protected]>

* Add prompt helper function

Signed-off-by: Ryan Wolf <[email protected]>

* Add revisions and writing prompts

Signed-off-by: Ryan Wolf <[email protected]>

* Fix default prompt templates

Signed-off-by: Ryan Wolf <[email protected]>

* Add closed qa

Signed-off-by: Ryan Wolf <[email protected]>

* Fix prompt

Signed-off-by: Ryan Wolf <[email protected]>

* Add math and coding

Signed-off-by: Ryan Wolf <[email protected]>

* Add problem generation

Signed-off-by: Ryan Wolf <[email protected]>

* Rename function

Signed-off-by: Ryan Wolf <[email protected]>

* Add dialogue support

Signed-off-by: Ryan Wolf <[email protected]>

* Fix mispell

Signed-off-by: Ryan Wolf <[email protected]>

* Add two turn generation

Signed-off-by: Ryan Wolf <[email protected]>

* Add reward model as judge

Signed-off-by: Ryan Wolf <[email protected]>

* Refactor reward query

Signed-off-by: Ryan Wolf <[email protected]>

* Add error handling for non-reward models

Signed-off-by: Ryan Wolf <[email protected]>

* Add error handling to sync client

Signed-off-by: Ryan Wolf <[email protected]>

* Add open qa pipeline

Signed-off-by: Ryan Wolf <[email protected]>

* Improve docs and add writing pipeline

Signed-off-by: Ryan Wolf <[email protected]>

* Add closed qa pipeline

Signed-off-by: Ryan Wolf <[email protected]>

* Add math pipeline

Signed-off-by: Ryan Wolf <[email protected]>

* Add python pipeline

Signed-off-by: Ryan Wolf <[email protected]>

* Add async nemotron generator

Signed-off-by: Ryan Wolf <[email protected]>

* Fix await with index

Signed-off-by: Ryan Wolf <[email protected]>

* Add seed parameter

Signed-off-by: Ryan Wolf <[email protected]>

* Add missing await

Signed-off-by: Ryan Wolf <[email protected]>

* Fix parameter names

Signed-off-by: Ryan Wolf <[email protected]>

* Fix subscript await issues

Signed-off-by: Ryan Wolf <[email protected]>

* Switch parsing method for reward model

Signed-off-by: Ryan Wolf <[email protected]>

* Add initial docs

Signed-off-by: Ryan Wolf <[email protected]>

* Add nemo deploy client

Signed-off-by: Ryan Wolf <[email protected]>

* Add easy import

Signed-off-by: Ryan Wolf <[email protected]>

* Move conversation formatter

Signed-off-by: Ryan Wolf <[email protected]>

* Add other file

Signed-off-by: Ryan Wolf <[email protected]>

* Update nemotron import

Signed-off-by: Ryan Wolf <[email protected]>

* Update model client import

Signed-off-by: Ryan Wolf <[email protected]>

* Remove model in query call

Signed-off-by: Ryan Wolf <[email protected]>

* Add extra index

Signed-off-by: Ryan Wolf <[email protected]>

* Fix response indexing

Signed-off-by: Ryan Wolf <[email protected]>

* Add top k

Signed-off-by: Ryan Wolf <[email protected]>

* Remove extras

Signed-off-by: Ryan Wolf <[email protected]>

* Add safe import for nemo deploy

Signed-off-by: Ryan Wolf <[email protected]>

* Add pandas conversions

Signed-off-by: Ryan Wolf <[email protected]>

* Add partition default

Signed-off-by: Ryan Wolf <[email protected]>

* Add no format

Signed-off-by: Ryan Wolf <[email protected]>

* Move no format location

Signed-off-by: Ryan Wolf <[email protected]>

* Use top_k in nemo client

Signed-off-by: Ryan Wolf <[email protected]>

* Address vibhu's review

Signed-off-by: Ryan Wolf <[email protected]>

* Add logging import

Signed-off-by: Ryan Wolf <[email protected]>

* Fix import

Signed-off-by: Ryan Wolf <[email protected]>

* Fix tqdm

Signed-off-by: Ryan Wolf <[email protected]>

* Add missing awaits

Signed-off-by: Ryan Wolf <[email protected]>

* Standardize names

Signed-off-by: Ryan Wolf <[email protected]>

* Address Ayush nit

Signed-off-by: Ryan Wolf <[email protected]>

---------

Signed-off-by: Ryan Wolf <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants