Skip to content

Commit

Permalink
adding helper to bootstrap the generation of the meta.yaml file
Browse files Browse the repository at this point in the history
  • Loading branch information
kjappelbaum committed Aug 13, 2024
1 parent 8f04af5 commit d3eea1b
Show file tree
Hide file tree
Showing 4 changed files with 110 additions and 168 deletions.
154 changes: 0 additions & 154 deletions data/tabular/blood_brain_barrier_martins_et_al/meta.yaml

This file was deleted.

42 changes: 42 additions & 0 deletions docs/api/meta_yaml_generator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Meta YAML Generator

## Overview

The Meta YAML Generator is a tool designed to automatically create a `meta.yaml` file for chemical datasets using Large Language Models (LLMs). It analyzes the structure of a given DataFrame and generates a comprehensive metadata file, including advanced sampling methods and template formats.

The model used by default is `gpt4o`. For using it, you need to expose the `OPENAI_API_KEY` environment variable.

## `generate_meta_yaml`

::: chemnlp.data.meta_yaml_generator.generate_meta_yaml
handler: python
options:
show_root_heading: true
show_source: false

## Usage Example

```python
import pandas as pd
from chemnlp.data.meta_yaml_generator import generate_meta_yaml

# Load your dataset
df = pd.read_csv("your_dataset.csv")

# Generate meta.yaml
meta_yaml = generate_meta_yaml(
df,
dataset_name="Polymer Properties Dataset",
description="A dataset of polymer properties including glass transition temperatures and densities",
output_path="path/to/save/meta.yaml"
)

# The meta_yaml variable now contains the dictionary representation of the meta.yaml
# If an output_path was provided, the meta.yaml file has been saved to that location
```

You can also use it as a command-line tool:

```bash
python -m chemnlp.data.meta_yaml_generator path/to/your_dataset.csv --dataset_name "Polymer Properties Dataset" --description "A dataset of polymer properties including glass transition temperatures and densities" --output_path "path/to/save/meta.yaml"
```
51 changes: 37 additions & 14 deletions src/chemnlp/data/helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import yaml
from typing import Dict, Any
from litellm import completion
import fire

CONSTANT_PROMPT = """
Expand Down Expand Up @@ -100,7 +101,10 @@
4. Random Choices:
- Use {#option1|option2|option3!} for random selection of text.
Generate a similar meta.yaml structure for the given dataset, including appropriate targets, identifiers, and templates based on the column names and example data provided. Include at least one multiple choice template and one benchmarking template."""
Generate a similar meta.yaml structure for the given dataset, including appropriate targets, identifiers, and templates based on the column names and example data provided. Include at least one multiple choice template and one benchmarking template.
Just return raw YAML string, no need to wrap it into backticks or anything else.
"""


def generate_meta_yaml(
Expand Down Expand Up @@ -140,9 +144,11 @@ def generate_meta_yaml(

# Call the LLM with the prompt
llm_response = completion(
model=model, messages=[{"role": "user", "content": prompt}]
model=model, messages=[{"role": "user", "content": prompt}], temperature=0
)

llm_response = llm_response.choices[0].message.content

# Parse the LLM's response and convert it to a dictionary
try:
meta_yaml = yaml.safe_load(llm_response)
Expand All @@ -153,24 +159,41 @@ def generate_meta_yaml(
return meta_yaml


# Example usage
if __name__ == "__main__":
# Load your DataFrame
df = pd.read_csv("your_dataset.csv")

# Generate meta.yaml
meta_yaml = generate_meta_yaml(
df,
dataset_name="Your Dataset Name",
description="A brief description of your dataset",
)
def cli(
data_path: str,
dataset_name: str,
description: str,
model: str = "gpt-4o",
output_path: str = None,
):
"""
Generate a meta.yaml structure for a dataset using an LLM based on a CSV file.
Args:
data_path (str): The path to the CSV file containing the dataset.
dataset_name (str): The name of the dataset.
description (str): A brief description of the dataset.
model (str, optional): The LLM model to use. Defaults to 'gpt-4o'.
output_path (str, optional): The path to save the generated meta.yaml. Defaults to None.
"""
# Load the dataset from the CSV file
df = pd.read_csv(data_path)

# Generate the meta.yaml structure
meta_yaml = generate_meta_yaml(df, dataset_name, description, model)

output_path = output_path or "meta.yaml"
# Print or save the generated meta.yaml
if meta_yaml:
print(yaml.dump(meta_yaml, default_flow_style=False))

# Optionally, save to a file
with open("meta.yaml", "w") as f:
with open(output_path, "w") as f:
yaml.dump(meta_yaml, f, default_flow_style=False)
else:
print("Failed to generate meta.yaml")


# Example usage
if __name__ == "__main__":
fire.Fire(cli)
31 changes: 31 additions & 0 deletions src/chemnlp/data/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
bibtex:
- "@article{martins2023,\nauthor = {Martins, John and Doe, Jane and Smith, Alice},\ntitle = {Study on Blood-Brain Barrier Penetration of Various Drugs},\njournal = {Journal of Pharmacology},\nvolume = {12},\nnumber = {3},\npages = {123-134},\nyear = {2023},\ndoi = {10.1234/jpharm.2023.56789}}"
description: Describing the ability of different drugs to penetrate the blood-brain barrier.
identifiers:
- description: Simplified Molecular Input Line Entry System
id: SMILES
type: SMILES
- description: Name of the compound
id: compound_name
names:
- noun: compound name
type: Other
license: CC BY 4.0
links:
- description: corresponding publication
url: https://example.com/publication
- description: data source
url: https://example.com/data_source
name: blood_brain_barrier_martins_et_al
num_points: 2030
targets:
- description: Indicates whether the compound can penetrate the blood-brain barrier (1 for yes, 0 for no)
id: penetrate_BBB
names:
- noun: blood-brain barrier penetration
type: integer
templates:
- The compound {compound_name__names__noun} with SMILES {SMILES#} can {#penetrate|not penetrate!} the blood-brain barrier.
- The compound {compound_name__names__noun} with SMILES {SMILES#} is in the {split#} set.
- "Question: Which of the following compounds can penetrate the blood-brain barrier?\nOptions: {%multiple_choice_enum%4%aA1}\n{compound_name%}\nAnswer: {%multiple_choice_result}"
- The compound with SMILES {SMILES#} can penetrate the blood-brain barrier:<EOI>{penetrate_BBB#}

0 comments on commit d3eea1b

Please sign in to comment.