Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi label classification task #405

Merged
merged 10 commits into from
Jun 29, 2023
Merged

Conversation

Tyrest
Copy link
Contributor

@Tyrest Tyrest commented Jun 27, 2023

This PR adds the MultiLabelClassificationTask to autolabel. Currently tests the prompt generation and MultiLabelClassificationTask.eval. Below is an example prompt:


You are an expert at classifying tweets as neutral or one or more of the given emotions that best represent the mental state of the poster.
Your job is to correctly label the provided input example into one or more of the following categories:
neutral
anger
anticipation
disgust
fear
joy
love
optimism
pessimism
sadness
surprise
trust

You will return the answer as a comma separated list of labels sorted in alphabetical order. For example: "label1, label2, label3"

Some examples with their output answers are provided below:

Input: In my room 101 would go  Russell Howard,Tom Odell,Michael Buble!!! #pants #nogood 
Output: anger, disgust, pessimism, sadness

Input: @FullTimeDEVILS Memphis looking bright. Rojo looking like Rojo.
Output: joy, optimism

Input: Yet again another night I should've stayed in😊
Output: joy

Input: I hope a goal comes soon on either side. Otherwise there is a serious threat of a dull and frustrating game. #COYS
Output: anger, disgust, pessimism, sadness

Input: Watch this amazing live.ly broadcast by @rosannahill  #musically
Output: joy, surprise

Now I want you to label the following example:
Input: Will be a lively evening at ot that one.
Output: 

@Tyrest Tyrest requested a review from nihit June 27, 2023 21:53
Copy link
Contributor

@rishabh-bhargava rishabh-bhargava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just added some minor comments. haven't gone through the code too deeply

@@ -0,0 +1,34 @@
{
"task_name": "EmotionClassification",
"task_type": "multi_label_classification",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multilabel_classification instead of multi_label_classification?

I am looking at sklearn, and they have skelarn.multiclass and skelarn.multioutput (treating it as one word)

(if we make this change, we will need to make it everywhere)

"prompt": {
"task_guidelines": "You are an expert at classifying tweets as neutral or one or more of the given emotions that best represent the mental state of the poster.\nYour job is to correctly label the provided input example into one or more of the following categories:\n{labels}",
"output_guidelines": "You will return the answer as a comma separated list of labels sorted in alphabetical order. For example: \"label1, label2, label3\"",
"labels": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nihit @Tyrest do we want some parameters that constrains the number of outputs? for example, if a user wanted to ask for:

  • exactly three outputs
  • at most three outputs
  • at least two outputs

this is possible with other ML libraries by just using the predicted probabilities for each class, but curious how we'd want to enable that here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rishabh-bhargava i think it's reasonable to expect labeling guidelines to take care of it?

"task_name": "EmotionClassification",
"task_type": "multi_label_classification",
"dataset": {
"label_column": "label",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"label_column": "labels", (replace label -> labels given it's a multi-label classification

@@ -0,0 +1,70 @@
example,label
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should rename the second column as labels

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, from what I can tell, the seed.csv and test.csv files that are being checked into the repo are much smaller. take a look at the assets/ folder in test/?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these should be just dummy seed and test files for testing (not the entire seed and test sets)

Additionally, @Tyrest I see ~6700 rows in test and only ~70 rows in seed. This doesn't matter for the unit testing assets but want to make sure that the actual dataset files being checked into https://s3.console.aws.amazon.com/s3/buckets/autolabel-benchmarking are 200 labeled examples for seed, and 2000 for test for the benchmarking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, didn't realize that the csv files were smaller. I'll make these changes now!

@@ -0,0 +1,70 @@
example,label
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these should be just dummy seed and test files for testing (not the entire seed and test sets)

Additionally, @Tyrest I see ~6700 rows in test and only ~70 rows in seed. This doesn't matter for the unit testing assets but want to make sure that the actual dataset files being checked into https://s3.console.aws.amazon.com/s3/buckets/autolabel-benchmarking are 200 labeled examples for seed, and 2000 for test for the benchmarking

@@ -0,0 +1,34 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's rename this file to twitter_emotion_detection to avoid confusion

@@ -19,6 +20,10 @@

SCIQ_CONFIG = json.load(open("tests/assets/sciq/config_sciq.json", "r"))

SEM_EVAL_2018_CONFIG = json.load(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same - rename to twitter emotion detection config

"prompt": {
"task_guidelines": "You are an expert at classifying tweets as neutral or one or more of the given emotions that best represent the mental state of the poster.\nYour job is to correctly label the provided input example into one or more of the following categories:\n{labels}",
"output_guidelines": "You will return the answer as a comma separated list of labels sorted in alphabetical order. For example: \"label1, label2, label3\"",
"labels": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rishabh-bhargava i think it's reasonable to expect labeling guidelines to take care of it?



class MultiLabelClassificationTask(BaseTask):
DEFAULT_OUTPUT_GUIDELINES = 'You will return the answer as a comma separated list of labels. For example: "label1, label2, label3"'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how well does this guideline do across LLMs ? A few other choices possible here are:

  • return it separated by newlines
  • return it as a list/array

I'm neutral to what we choose ultimately, but whatever is the default output guideline should work well across LLMs (including Refuel LLM)


return answered_gt_labels, answered_llm_preds

def eval(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DhruvaBansal00 can you share some resources/pointers for multi-label eval metrics that would be good to add here?

Comment on lines 124 to 125
llm_labels (List[LLMAnnotation]): _description_
gt_labels (List[str]): _description_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add description please

@nihit
Copy link
Contributor

nihit commented Jun 28, 2023

issue: #391

confidences = []
for index, llm_label in enumerate(llm_labels):
labels.append(llm_label.label.lower() == gt_labels[index].lower())
confidences.append(llm_label.confidence_score)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confidence here will be the average of log probs of all the tokens in the llm labels.

We might instead want to define confidence individually for each each label in the list. Not sure if that is in the scope of this PR - if not, let's open a ticket for doing this as a follow up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing this will require some revamp on the confidence calculation side. We probably want to now move confidence calculation into each task (and the parent class can implement the default confidence calculator)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just created #406 for this

from autolabel.configs import AutolabelConfig
from autolabel.schema import LLMAnnotation, Metric, MetricResult
from autolabel.tasks import BaseTask
from autolabel.tasks.utils import compute_f1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think compute_f1 requires some changes as a part of this PR:

def compute_f1(prediction: str, truth: str) -> float:

That function currently splits string using .split() and the default separator. We can either change that function to take in a list of tokens that are pre-split or allow sending a custom separator. I believe this classes uses ',' as the separator between labels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there we are calculating a micro F1 score. We can instead use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html to give the user the option to change between micro, macro, weighted, or class wise F1 scores.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since QuestionAnsweringTask uses this method as well, I'm just going to leave compute_f1 as is for now and use sklearn's f1_score directly in MultilabelClassificationTask.eval

@nihit
Copy link
Contributor

nihit commented Jun 29, 2023

@DhruvaBansal00 will let you take a final pass, lgtm

@DhruvaBansal00 DhruvaBansal00 self-requested a review June 29, 2023 06:10
Copy link
Contributor

@DhruvaBansal00 DhruvaBansal00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

However, I do think we should consolidate f1 calculation for multilabel and question answering into one function. We should be using sklearn to calculate it for both. Let's open a ticket to track that?

@Tyrest Tyrest merged commit 2993888 into main Jun 29, 2023
@Tyrest Tyrest deleted the add-multi-label-classification-task branch June 29, 2023 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants