Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added support for hugging face datasets #354

Merged
merged 2 commits into from
Jun 23, 2023
Merged

Conversation

Tyrest
Copy link
Contributor

@Tyrest Tyrest commented Jun 21, 2023

We can now load datasets from Huggingface as show below

dataset = load_dataset("lex_glue", "ledgar")

test_dataset = dataset["test"]
test_dataset = map_label_to_string(test_dataset, "label")

test_dataset = test_dataset.rename_column("text", "example")

ledgar_path = Path("../autolabel/examples/ledgar")
with open(ledgar_path / "config_ledgar.json", "r") as f:
    config = json.load(f)
if "few_shot_examples" in config["prompt"] and isinstance(config["prompt"]["few_shot_examples"], str):
    config["prompt"]["few_shot_examples"] = str(ledgar_path / config["prompt"]["few_shot_examples"])
agent = LabelingAgent(config)

agent.plan(test_dataset, max_items=8)
agent.run(test_dataset, max_items=8)

@Tyrest Tyrest requested a review from nihit June 21, 2023 22:30
Comment on lines 222 to 223
dataset[
start_index : max_items if max_items and max_items > 0 else len(dataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do any sanity checks on the dataset object before attempting to convert to dataframe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a more robust solution so I'll implement it here as well


self.db.initialize()
self.dataset = self.db.initialize_dataset(
dataset, self.config, start_index, max_items
dataset_loader.dat, self.config, start_index, max_items
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we access the dataframe with a helper method in dataset loader (vs directly accessing it here) ? for readability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of what you want it to look like instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was thinking just having a def as_datafame() method in the data loader that returns the internal pandas dataframe object, instead of the labeling agent accessing it directly

else f"{dataset.replace('.csv','')}_labeled.csv"
)
else:
csv_file_name = "labeled.csv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change this to contains a task name prefix_labeled.csv so that the output files from different tasks don't overwrite each other

@Tyrest Tyrest requested a review from nihit June 23, 2023 21:29
@Tyrest Tyrest merged commit 7784e6d into main Jun 23, 2023
@Tyrest Tyrest deleted the hugging-face-dataset-support branch June 23, 2023 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants