-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi label classification task #405
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just added some minor comments. haven't gone through the code too deeply
@@ -0,0 +1,34 @@ | |||
{ | |||
"task_name": "EmotionClassification", | |||
"task_type": "multi_label_classification", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multilabel_classification
instead of multi_label_classification
?
I am looking at sklearn
, and they have skelarn.multiclass
and skelarn.multioutput
(treating it as one word)
(if we make this change, we will need to make it everywhere)
"prompt": { | ||
"task_guidelines": "You are an expert at classifying tweets as neutral or one or more of the given emotions that best represent the mental state of the poster.\nYour job is to correctly label the provided input example into one or more of the following categories:\n{labels}", | ||
"output_guidelines": "You will return the answer as a comma separated list of labels sorted in alphabetical order. For example: \"label1, label2, label3\"", | ||
"labels": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nihit @Tyrest do we want some parameters that constrains the number of outputs? for example, if a user wanted to ask for:
- exactly three outputs
- at most three outputs
- at least two outputs
this is possible with other ML libraries by just using the predicted probabilities for each class, but curious how we'd want to enable that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rishabh-bhargava i think it's reasonable to expect labeling guidelines to take care of it?
"task_name": "EmotionClassification", | ||
"task_type": "multi_label_classification", | ||
"dataset": { | ||
"label_column": "label", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"label_column": "labels",
(replace label
-> labels
given it's a multi-label classification
@@ -0,0 +1,70 @@ | |||
example,label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should rename the second column as labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, from what I can tell, the seed.csv
and test.csv
files that are being checked into the repo are much smaller. take a look at the assets/
folder in test/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these should be just dummy seed and test files for testing (not the entire seed and test sets)
Additionally, @Tyrest I see ~6700 rows in test and only ~70 rows in seed. This doesn't matter for the unit testing assets but want to make sure that the actual dataset files being checked into https://s3.console.aws.amazon.com/s3/buckets/autolabel-benchmarking are 200 labeled examples for seed, and 2000 for test for the benchmarking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, didn't realize that the csv
files were smaller. I'll make these changes now!
@@ -0,0 +1,70 @@ | |||
example,label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these should be just dummy seed and test files for testing (not the entire seed and test sets)
Additionally, @Tyrest I see ~6700 rows in test and only ~70 rows in seed. This doesn't matter for the unit testing assets but want to make sure that the actual dataset files being checked into https://s3.console.aws.amazon.com/s3/buckets/autolabel-benchmarking are 200 labeled examples for seed, and 2000 for test for the benchmarking
@@ -0,0 +1,34 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's rename this file to twitter_emotion_detection to avoid confusion
tests/unit/tasks_test.py
Outdated
@@ -19,6 +20,10 @@ | |||
|
|||
SCIQ_CONFIG = json.load(open("tests/assets/sciq/config_sciq.json", "r")) | |||
|
|||
SEM_EVAL_2018_CONFIG = json.load( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same - rename to twitter emotion detection config
"prompt": { | ||
"task_guidelines": "You are an expert at classifying tweets as neutral or one or more of the given emotions that best represent the mental state of the poster.\nYour job is to correctly label the provided input example into one or more of the following categories:\n{labels}", | ||
"output_guidelines": "You will return the answer as a comma separated list of labels sorted in alphabetical order. For example: \"label1, label2, label3\"", | ||
"labels": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rishabh-bhargava i think it's reasonable to expect labeling guidelines to take care of it?
|
||
|
||
class MultiLabelClassificationTask(BaseTask): | ||
DEFAULT_OUTPUT_GUIDELINES = 'You will return the answer as a comma separated list of labels. For example: "label1, label2, label3"' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how well does this guideline do across LLMs ? A few other choices possible here are:
- return it separated by newlines
- return it as a list/array
I'm neutral to what we choose ultimately, but whatever is the default output guideline should work well across LLMs (including Refuel LLM)
|
||
return answered_gt_labels, answered_llm_preds | ||
|
||
def eval( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DhruvaBansal00 can you share some resources/pointers for multi-label eval metrics that would be good to add here?
llm_labels (List[LLMAnnotation]): _description_ | ||
gt_labels (List[str]): _description_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add description please
issue: #391 |
confidences = [] | ||
for index, llm_label in enumerate(llm_labels): | ||
labels.append(llm_label.label.lower() == gt_labels[index].lower()) | ||
confidences.append(llm_label.confidence_score) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confidence here will be the average of log probs of all the tokens in the llm labels.
We might instead want to define confidence individually for each each label in the list. Not sure if that is in the scope of this PR - if not, let's open a ticket for doing this as a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing this will require some revamp on the confidence calculation side. We probably want to now move confidence calculation into each task (and the parent class can implement the default confidence calculator)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just created #406 for this
from autolabel.configs import AutolabelConfig | ||
from autolabel.schema import LLMAnnotation, Metric, MetricResult | ||
from autolabel.tasks import BaseTask | ||
from autolabel.tasks.utils import compute_f1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think compute_f1 requires some changes as a part of this PR:
autolabel/src/autolabel/tasks/utils.py
Line 25 in 1079624
def compute_f1(prediction: str, truth: str) -> float: |
That function currently splits string using .split() and the default separator. We can either change that function to take in a list of tokens that are pre-split or allow sending a custom separator. I believe this classes uses ',' as the separator between labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there we are calculating a micro F1 score. We can instead use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html to give the user the option to change between micro, macro, weighted, or class wise F1 scores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since QuestionAnsweringTask
uses this method as well, I'm just going to leave compute_f1
as is for now and use sklearn
's f1_score
directly in MultilabelClassificationTask.eval
@DhruvaBansal00 will let you take a final pass, lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm!
However, I do think we should consolidate f1 calculation for multilabel and question answering into one function. We should be using sklearn to calculate it for both. Let's open a ticket to track that?
This PR adds the
MultiLabelClassificationTask
toautolabel
. Currently tests the prompt generation andMultiLabelClassificationTask.eval
. Below is an example prompt: