added support for hugging face datasets #354

Tyrest · 2023-06-21T22:30:34Z

We can now load datasets from Huggingface as show below

dataset = load_dataset("lex_glue", "ledgar")

test_dataset = dataset["test"]
test_dataset = map_label_to_string(test_dataset, "label")

test_dataset = test_dataset.rename_column("text", "example")

ledgar_path = Path("../autolabel/examples/ledgar")
with open(ledgar_path / "config_ledgar.json", "r") as f:
    config = json.load(f)
if "few_shot_examples" in config["prompt"] and isinstance(config["prompt"]["few_shot_examples"], str):
    config["prompt"]["few_shot_examples"] = str(ledgar_path / config["prompt"]["few_shot_examples"])
agent = LabelingAgent(config)

agent.plan(test_dataset, max_items=8)
agent.run(test_dataset, max_items=8)

nihit · 2023-06-22T17:23:19Z

src/autolabel/dataset_loader.py

+            dataset[
+                start_index : max_items if max_items and max_items > 0 else len(dataset)


do we need to do any sanity checks on the dataset object before attempting to convert to dataframe?

This seems to be a more robust solution so I'll implement it here as well

nihit · 2023-06-22T17:24:59Z

src/autolabel/labeler.py


        self.db.initialize()
        self.dataset = self.db.initialize_dataset(
-            dataset, self.config, start_index, max_items
+            dataset_loader.dat, self.config, start_index, max_items


can we access the dataframe with a helper method in dataset loader (vs directly accessing it here) ? for readability

Can you give an example of what you want it to look like instead?

i was thinking just having a def as_datafame() method in the data loader that returns the internal pandas dataframe object, instead of the labeling agent accessing it directly

nihit · 2023-06-22T17:26:05Z

src/autolabel/labeler.py

+                else f"{dataset.replace('.csv','')}_labeled.csv"
+            )
+        else:
+            csv_file_name = "labeled.csv"


let's change this to contains a task name prefix_labeled.csv so that the output files from different tasks don't overwrite each other

added support for hugging face datasets

479d399

Tyrest requested a review from nihit June 21, 2023 22:30

nihit reviewed Jun 22, 2023

View reviewed changes

addressed comments

d852f69

Tyrest requested a review from nihit June 23, 2023 21:29

nihit approved these changes Jun 23, 2023

View reviewed changes

Tyrest merged commit 7784e6d into main Jun 23, 2023

Tyrest deleted the hugging-face-dataset-support branch June 23, 2023 23:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added support for hugging face datasets #354

added support for hugging face datasets #354

Tyrest commented Jun 21, 2023

nihit Jun 22, 2023

Tyrest Jun 22, 2023

nihit Jun 22, 2023

Tyrest Jun 22, 2023

nihit Jun 23, 2023

nihit Jun 22, 2023

		dataset[
		start_index : max_items if max_items and max_items > 0 else len(dataset)

added support for hugging face datasets #354

added support for hugging face datasets #354

Conversation

Tyrest commented Jun 21, 2023

nihit Jun 22, 2023

Choose a reason for hiding this comment

Tyrest Jun 22, 2023

Choose a reason for hiding this comment

nihit Jun 22, 2023

Choose a reason for hiding this comment

Tyrest Jun 22, 2023

Choose a reason for hiding this comment

nihit Jun 23, 2023

Choose a reason for hiding this comment

nihit Jun 22, 2023

Choose a reason for hiding this comment