labelmodel.fit on a superset of data changes predictions of subset #1581

srimugunthan · 2020-04-23T16:42:58Z

Issue description

We have a dataset with records which will be either have one label or multiple labels.
To verify the label model predictions, we filtered out from the original data, the records with only one label. Doing labelmodel.fit on the single-labelled data was giving accuracy of more than 90%.

But when we did labelmodel.fit on the whole data the above accuracy for singlelabelled datapoints dropped drastically to 30%.

Code example/repro steps

i was able to reproduce the bug with some generated label matrix https://github.com/srimugunthan/snorkeldebugging/blob/master/snorkeldebug.ipynb
Although here the accuracy drop in the generated data is not drastic, it illustrates the scenario

Expected behavior

the subset of data with single labels should have the same accuracy.

System info

used snorkel 0.9.3 on linux

srimugunthan · 2020-04-24T18:38:16Z

Hi,
In the original example, in which the drop was from 90% to 30%, i found an issue in the code.
I see that it happens only when i use PandasParallelLFApplier to get the label matrix. With PandasLFapplier it is fine.

i check the matrices generated from PandasLFApplier and PandasParallelLFApplier and they were different.
Below is the code from notebook, which i used to check.

df_full = pd.concat([df_single,df_multilabel]
df_full.index.is_unique
True

lm1 =applier.apply(df=df_full)
lm2 =applier_regex.apply(df=df_full,n_parallel=8)

np.array_equal(lm1, lm2)
False

Is there anything i am missing.

ajratner · 2020-04-30T18:46:08Z

Hi @srimugunthan thanks for surfacing this! At the current moment, the master branch version of Snorkel is not configured to support multi-label, though we've certainly applied Snorkel here (e.g. https://www.snorkel.org/blog/superglue / multi-task formulation...). So I'm not surprised there are some issues here- perhaps, since Snorkel's label model is expecting a single label, it's just taking e.g. the last one per data point, but this order is getting shuffled when applied in parallel?

Either way, we'll look into this to make sure not an issue with PandasParallelLFApplier. If, as I suspect, it's just an issue with multi-label support, we'll put on the roadmap!

srimugunthan · 2020-05-05T22:22:24Z

@ajratner @henryre

I have checked in the spam classify example code with PandasParallelLFAppluer and plain PandasLFApplier https://github.com/srimugunthan/snorkeldebugging/blob/master/spamClassify.ipynb
I do see the Label matrices are different, although the summary metrics are same.
Isnt the multi-task formulation for hierarchical labelling?. For multilabel(same-level,manylabels) case, we used the approach suggested in this article: https://towardsdatascience.com/using-snorkel-for-multi-label-annotation-cc2aa217986a We look at the labelmodel's prediction probability values , and pick additional labels which are close to maximum probability class. Let me know if this approach can be followed.
In the original examplenotebook i shared ( https://github.com/srimugunthan/snorkeldebugging/blob/master/snorkeldebug.ipynb ) i see the single label accuracy shrink by 4 to 6% when multilabel data is added. This is not much and not sure if qualifies as an issue. But you can reproduce the issue from the notebook and let us know your comments.

henryre · 2020-05-17T18:39:17Z

Hi @srimugunthan, sorry for the delayed reply! In response to the PandasParallelLFApplier issue, I've opened up #1589. In the meantime, you can either use the standard PandasLFApplier or sort the index of the original DF before using the PandasParallelLFApplier so that the index matches.

github-actions · 2020-08-16T12:19:12Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

ajratner assigned henryre Apr 30, 2020

henryre mentioned this issue May 17, 2020

Restore original index with PandasParallelLFApplier #1589

Open

github-actions bot added the no-issue-activity label Aug 16, 2020

github-actions bot closed this as completed Aug 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

labelmodel.fit on a superset of data changes predictions of subset #1581

labelmodel.fit on a superset of data changes predictions of subset #1581

srimugunthan commented Apr 23, 2020 •

edited

Loading

srimugunthan commented Apr 24, 2020 •

edited

Loading

ajratner commented Apr 30, 2020

srimugunthan commented May 5, 2020 •

edited

Loading

henryre commented May 17, 2020

github-actions bot commented Aug 16, 2020

labelmodel.fit on a superset of data changes predictions of subset #1581

labelmodel.fit on a superset of data changes predictions of subset #1581

Comments

srimugunthan commented Apr 23, 2020 • edited Loading

Issue description

Code example/repro steps

Expected behavior

System info

srimugunthan commented Apr 24, 2020 • edited Loading

ajratner commented Apr 30, 2020

srimugunthan commented May 5, 2020 • edited Loading

henryre commented May 17, 2020

github-actions bot commented Aug 16, 2020

srimugunthan commented Apr 23, 2020 •

edited

Loading

srimugunthan commented Apr 24, 2020 •

edited

Loading

srimugunthan commented May 5, 2020 •

edited

Loading