Add multi label classification task #405

Tyrest · 2023-06-27T21:53:09Z

This PR adds the MultiLabelClassificationTask to autolabel. Currently tests the prompt generation and MultiLabelClassificationTask.eval. Below is an example prompt:


You are an expert at classifying tweets as neutral or one or more of the given emotions that best represent the mental state of the poster.
Your job is to correctly label the provided input example into one or more of the following categories:
neutral
anger
anticipation
disgust
fear
joy
love
optimism
pessimism
sadness
surprise
trust

You will return the answer as a comma separated list of labels sorted in alphabetical order. For example: "label1, label2, label3"

Some examples with their output answers are provided below:

Input: In my room 101 would go  Russell Howard,Tom Odell,Michael Buble!!! #pants #nogood 
Output: anger, disgust, pessimism, sadness

Input: @FullTimeDEVILS Memphis looking bright. Rojo looking like Rojo.
Output: joy, optimism

Input: Yet again another night I should've stayed in😊
Output: joy

Input: I hope a goal comes soon on either side. Otherwise there is a serious threat of a dull and frustrating game. #COYS
Output: anger, disgust, pessimism, sadness

Input: Watch this amazing live.ly broadcast by @rosannahill  #musically
Output: joy, surprise

Now I want you to label the following example:
Input: Will be a lively evening at ot that one.
Output:

rishabh-bhargava

just added some minor comments. haven't gone through the code too deeply

rishabh-bhargava · 2023-06-27T23:45:10Z

examples/twitter_emotion_detection/config_twitter_emotion_detection.json

@@ -0,0 +1,34 @@
+{
+  "task_name": "EmotionClassification",
+  "task_type": "multi_label_classification",


multilabel_classification instead of multi_label_classification?

I am looking at sklearn, and they have skelarn.multiclass and skelarn.multioutput (treating it as one word)

(if we make this change, we will need to make it everywhere)

rishabh-bhargava · 2023-06-27T23:48:30Z

examples/twitter_emotion_detection/config_twitter_emotion_detection.json

+  "prompt": {
+    "task_guidelines": "You are an expert at classifying tweets as neutral or one or more of the given emotions that best represent the mental state of the poster.\nYour job is to correctly label the provided input example into one or more of the following categories:\n{labels}",
+    "output_guidelines": "You will return the answer as a comma separated list of labels sorted in alphabetical order. For example: \"label1, label2, label3\"",
+    "labels": [


@nihit @Tyrest do we want some parameters that constrains the number of outputs? for example, if a user wanted to ask for:

exactly three outputs

at most three outputs

at least two outputs

this is possible with other ML libraries by just using the predicted probabilities for each class, but curious how we'd want to enable that here.

@rishabh-bhargava i think it's reasonable to expect labeling guidelines to take care of it?

rishabh-bhargava · 2023-06-27T23:56:40Z

examples/twitter_emotion_detection/config_twitter_emotion_detection.json

+  "task_name": "EmotionClassification",
+  "task_type": "multi_label_classification",
+  "dataset": {
+    "label_column": "label",


"label_column": "labels", (replace label -> labels given it's a multi-label classification

rishabh-bhargava · 2023-06-27T23:58:23Z

tests/assets/sem_eval_2018_task_1/seed.csv

@@ -0,0 +1,70 @@
+example,label


we should rename the second column as labels

also, from what I can tell, the seed.csv and test.csv files that are being checked into the repo are much smaller. take a look at the assets/ folder in test/?

Yes, these should be just dummy seed and test files for testing (not the entire seed and test sets)

Additionally, @Tyrest I see ~6700 rows in test and only ~70 rows in seed. This doesn't matter for the unit testing assets but want to make sure that the actual dataset files being checked into https://s3.console.aws.amazon.com/s3/buckets/autolabel-benchmarking are 200 labeled examples for seed, and 2000 for test for the benchmarking

Yep, didn't realize that the csv files were smaller. I'll make these changes now!

nihit · 2023-06-28T00:10:51Z

tests/assets/sem_eval_2018_task_1/seed.csv

@@ -0,0 +1,70 @@
+example,label


Yes, these should be just dummy seed and test files for testing (not the entire seed and test sets)

Additionally, @Tyrest I see ~6700 rows in test and only ~70 rows in seed. This doesn't matter for the unit testing assets but want to make sure that the actual dataset files being checked into https://s3.console.aws.amazon.com/s3/buckets/autolabel-benchmarking are 200 labeled examples for seed, and 2000 for test for the benchmarking

nihit · 2023-06-28T00:11:38Z

tests/assets/sem_eval_2018_task_1/config_sem_eval_2018_task_1.json

@@ -0,0 +1,34 @@
+{


let's rename this file to twitter_emotion_detection to avoid confusion

nihit · 2023-06-28T00:11:56Z

tests/unit/tasks_test.py

@@ -19,6 +20,10 @@

 SCIQ_CONFIG = json.load(open("tests/assets/sciq/config_sciq.json", "r"))

+SEM_EVAL_2018_CONFIG = json.load(


same - rename to twitter emotion detection config

nihit · 2023-06-28T00:13:37Z

examples/twitter_emotion_detection/config_twitter_emotion_detection.json

+  "prompt": {
+    "task_guidelines": "You are an expert at classifying tweets as neutral or one or more of the given emotions that best represent the mental state of the poster.\nYour job is to correctly label the provided input example into one or more of the following categories:\n{labels}",
+    "output_guidelines": "You will return the answer as a comma separated list of labels sorted in alphabetical order. For example: \"label1, label2, label3\"",
+    "labels": [


@rishabh-bhargava i think it's reasonable to expect labeling guidelines to take care of it?

nihit · 2023-06-28T00:15:33Z

src/autolabel/tasks/multi_label_classification.py

+
+
+class MultiLabelClassificationTask(BaseTask):
+    DEFAULT_OUTPUT_GUIDELINES = 'You will return the answer as a comma separated list of labels. For example: "label1, label2, label3"'


how well does this guideline do across LLMs ? A few other choices possible here are:

return it separated by newlines

return it as a list/array

I'm neutral to what we choose ultimately, but whatever is the default output guideline should work well across LLMs (including Refuel LLM)

nihit · 2023-06-28T00:17:05Z

src/autolabel/tasks/multi_label_classification.py

+
+        return answered_gt_labels, answered_llm_preds
+
+    def eval(


@DhruvaBansal00 can you share some resources/pointers for multi-label eval metrics that would be good to add here?

nihit · 2023-06-28T00:17:14Z

src/autolabel/tasks/multi_label_classification.py

+            llm_labels (List[LLMAnnotation]): _description_
+            gt_labels (List[str]): _description_


add description please

nihit · 2023-06-28T00:45:30Z

issue: #391

DhruvaBansal00 · 2023-06-28T19:05:58Z

src/autolabel/tasks/multilabel_classification.py

+        confidences = []
+        for index, llm_label in enumerate(llm_labels):
+            labels.append(llm_label.label.lower() == gt_labels[index].lower())
+            confidences.append(llm_label.confidence_score)


Confidence here will be the average of log probs of all the tokens in the llm labels.

We might instead want to define confidence individually for each each label in the list. Not sure if that is in the scope of this PR - if not, let's open a ticket for doing this as a follow up.

Implementing this will require some revamp on the confidence calculation side. We probably want to now move confidence calculation into each task (and the parent class can implement the default confidence calculator)

Just created #406 for this

DhruvaBansal00 · 2023-06-28T19:11:38Z

src/autolabel/tasks/multilabel_classification.py

+from autolabel.configs import AutolabelConfig
+from autolabel.schema import LLMAnnotation, Metric, MetricResult
+from autolabel.tasks import BaseTask
+from autolabel.tasks.utils import compute_f1


I think compute_f1 requires some changes as a part of this PR:

autolabel/src/autolabel/tasks/utils.py

Line 25 in 1079624

def compute_f1(prediction: str, truth: str) -> float:

That function currently splits string using .split() and the default separator. We can either change that function to take in a list of tokens that are pre-split or allow sending a custom separator. I believe this classes uses ',' as the separator between labels.

Also, there we are calculating a micro F1 score. We can instead use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html to give the user the option to change between micro, macro, weighted, or class wise F1 scores.

Since QuestionAnsweringTask uses this method as well, I'm just going to leave compute_f1 as is for now and use sklearn's f1_score directly in MultilabelClassificationTask.eval

nihit · 2023-06-29T06:06:31Z

@DhruvaBansal00 will let you take a final pass, lgtm

DhruvaBansal00

Lgtm!

However, I do think we should consolidate f1 calculation for multilabel and question answering into one function. We should be using sklearn to calculate it for both. Let's open a ticket to track that?

Tyrest added 2 commits June 27, 2023 11:27

added multi label classification task

aa36cd7

added tests and added "neutral" label

6583d16

Tyrest requested a review from nihit June 27, 2023 21:53

polished example notebook and added example to s3

41ccd35

rishabh-bhargava reviewed Jun 28, 2023

View reviewed changes

nihit requested a review from DhruvaBansal00 June 28, 2023 00:05

nihit reviewed Jun 28, 2023

View reviewed changes

renaming and other comments

e81e039

Tyrest added 3 commits June 27, 2023 17:47

added comments to eval

904d272

fixed bug with label_column

527a99b

rerun notebook

339f2c8

DhruvaBansal00 reviewed Jun 28, 2023

View reviewed changes

Tyrest mentioned this pull request Jun 28, 2023

[Feature Request]: Individual label confidence calculation for multilabel classification #406

Open

Tyrest added 3 commits June 28, 2023 15:23

label separator, fixed f1

985ec35

Merge branch 'main' into add-multi-label-classification-task

ee1c19e

f1 test fix

9dcab64

nihit approved these changes Jun 29, 2023

View reviewed changes

DhruvaBansal00 self-requested a review June 29, 2023 06:10

DhruvaBansal00 approved these changes Jun 29, 2023

View reviewed changes

This was referenced Jun 29, 2023

[Bug]: Consolidate calculate f1 #409

Closed

[Feature Request]: Allow user to change type of f1 score #410

Closed

Tyrest merged commit 2993888 into main Jun 29, 2023

Tyrest deleted the add-multi-label-classification-task branch June 29, 2023 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi label classification task #405

Add multi label classification task #405

Tyrest commented Jun 27, 2023

rishabh-bhargava left a comment

rishabh-bhargava Jun 27, 2023

rishabh-bhargava Jun 27, 2023

nihit Jun 28, 2023

rishabh-bhargava Jun 27, 2023

rishabh-bhargava Jun 27, 2023

rishabh-bhargava Jun 28, 2023

nihit Jun 28, 2023

Tyrest Jun 28, 2023

nihit Jun 28, 2023

nihit Jun 28, 2023

nihit Jun 28, 2023

nihit Jun 28, 2023

nihit Jun 28, 2023

nihit Jun 28, 2023

nihit Jun 28, 2023

nihit commented Jun 28, 2023

DhruvaBansal00 Jun 28, 2023

DhruvaBansal00 Jun 28, 2023

Tyrest Jun 28, 2023

DhruvaBansal00 Jun 28, 2023

DhruvaBansal00 Jun 28, 2023

Tyrest Jun 28, 2023

nihit commented Jun 29, 2023

DhruvaBansal00 left a comment

		@@ -19,6 +20,10 @@

		SCIQ_CONFIG = json.load(open("tests/assets/sciq/config_sciq.json", "r"))

		SEM_EVAL_2018_CONFIG = json.load(



		class MultiLabelClassificationTask(BaseTask):
		DEFAULT_OUTPUT_GUIDELINES = 'You will return the answer as a comma separated list of labels. For example: "label1, label2, label3"'

		llm_labels (List[LLMAnnotation]): _description_
		gt_labels (List[str]): _description_

Add multi label classification task #405

Add multi label classification task #405

Conversation

Tyrest commented Jun 27, 2023

rishabh-bhargava left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nihit commented Jun 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nihit commented Jun 29, 2023

DhruvaBansal00 left a comment

Choose a reason for hiding this comment