How to create training data for NER task using snorkel ? #1254

thak123 · 2019-07-10T08:17:21Z

I want to create a dataset using the snorkel labelling function but I am not able to find any links.
I want to train a NER model using the above data.

Can anyone tell me how to proced

Mageswaran1989 · 2019-08-09T09:07:31Z

@thak123 Follow the link #838
You will find following notebooks:
1.Crowdsourced_Sentiment_Analysis
2. Categorical_Classes

But I am doubtful on the area of tagging table data from PDFS/Receipts

hpeiyan · 2019-08-20T02:28:57Z

Crowdsourced_Sentiment_Analysis

Categorical_Classes

Hi Mageswaran. I found the link you posted is not found.

ajratner · 2019-09-01T02:01:53Z

Hi @thak123 while you can hopefully look at some of the existing tutorials to help you in the interim, we're actually planning to release an NER-specific tutorial soon! Marking as "feature request" and will leave open till this is done

marctorsoc · 2019-10-28T16:02:26Z

Hi @ajratner , I'm quite interested in this feature, do you have an expected timeline for the release of those tutorials? not a hard deadline, but just to know if some weeks, months, years...

christopheratfarmjournal · 2019-10-30T16:11:22Z

Hi @ajratner, I'm very interested in this feature. Any idea when the tutorial may be released? Here we are 2 months after your previous mention . . . does it still look months away?

maciejbiesek · 2019-11-20T11:50:14Z

Any update on this issue?

pfllo · 2019-11-21T08:18:26Z

I found 2 papers in the snorkel resources page that tackles the NER task.
The SwellShark paper, handles the overlapping candidate problem in NER using the Maximum Marginal Likelihood Approach.
The MeTaL paper uses the Matrix Completion-Style Approach, but I can't find any details on handling the overlapping candidate problem in NER.
@ajratner Could you give some hints on how to handle the overlapping candidate problem in the matrix completion-style approach, so that we can try out the NER task before the tutorial comes out?

thak123 · 2020-02-19T12:10:55Z

any update on this issue ?

blah-crusader · 2020-02-27T09:33:38Z

Also interested.. C'mon guys! :D

jason-fries · 2020-02-27T20:48:47Z

The simplest way to do NER/sequence labeling using the off-the-shelf Snorkel label model is to assume each token is independent and define your label matrix L as tokens X LFs. Mechanistically, you can materialize this matrix however you like, but conceptually its cleaner to define your LFs as accepting sequences as input and returning a vector of token labels. These sequence LFs can apply regular expressions, dictionary matching, arbitrary heuristics, etc. as per typical Snorkel labeling functions.

For example, you could do very simple NER LOCATION model (using binary/IO tagging) as follows:

import numpy as np
from scipy.sparse import dok_matrix, vstack, csr_matrix

ABSTAIN = -1
LOCATION = 1
NOT_LOCATION = 0

# helper functions
def dict_match(sentence, dictionary, max_ngrams=4):
   m = {}
   for i in range(len(sentence)):
       for j in range(i+1, min(len(sentence), i + max_ngrams) + 1):
           term = ' '.join(sentence[i:j])
           if term in dictionary:
               m.update({idx:1 for idx in range(i,j+1)})
   return m
           
def create_token_L_mat(Xs, Ls, num_lfs):
   """
   Create token-level LF matrix from LFs indexed by sentence
   """
   Yws = []
   for sent_i in range(len(Xs)):
       ys = dok_matrix((len(Xs[sent_i]), num_lfs))
       for lf_i in range(num_lfs):
           for word_i,y in Ls[sent_i][lf_i].items():
               ys[word_i, lf_i] = y
       Yws.append(ys)
   return csr_matrix(vstack(Yws))
  
# labeling functions
def LF_is_location(s):
   locations = {"Big Apple", "Cupertino", "Cupertino, California", "California"}
   matches = dict_match(s, locations)
   return {i:LOCATION if i in matches else ABSTAIN for i in range(len(s))}
   
def LF_is_company(s):
   companies = {"Apple", "Apple, Inc."}
   matches = dict_match(s, companies)
   return {i:NOT_LOCATION if i in matches else ABSTAIN for i in range(len(s))}

def LF_is_titlecase(s):
   return {i:LOCATION if s[i][0].isupper() else ABSTAIN for i in range(len(s))}

# training set
sents = [
   "Apple, Inc. is headquartered in Cupertino, California .".split(),
   "Explore the very best of the Big Apple .".split(),
]

lfs = [
   LF_is_location,
   LF_is_company,
   LF_is_titlecase
]

# apply labeling functions and transform label matrix 
L = [[lf(s) for lf in lfs] for s in sents] 
L = create_token_L_mat(sents, L, len(lfs))

# train your Snorkel label model

Generating weakly labeled sequence data then just requires some bookkeeping to split your predicted token probabilities back into their original sequences.

When training the end model, you can either mask tokens that don't have any LF coverage or assume some prior (e.g., all tags are equally likely) and train a BERT, BiLSTM, etc. model.

As @pfllo pointed out, there are dependencies between tokens that we would also like capture in the label model. The papers Multi-Resolution Weak Supervision for Sequential Data and Weakly Supervised Sequence Tagging from Noisy Rules do handle this (both have code available). In practice however, treating tokens independently and using the default Snorkel label model works surprisingly well, especially if you come from a domain with rich knowledge base & dictionary resources such as biomedicine, geography, etc.

ajratner · 2020-02-27T21:09:18Z

@jason-fries thanks so much! Just to make sure it's clear: this repo has been generously maintained primarily by researchers like @jason-fries, and we are in general very capacity limited in terms of major changes to the repo. As such, we currently don't have a timeline on an NER tutorial. Contributions are very welcome though!

To additionally be clear: our policy for the issues page is that questions and comments are great, but demands such as "cmon guys" are not appropriate usage. Thanks for your understanding!

ajratner · 2020-02-27T22:06:47Z

And also just to be very clear: we all really want to put more stuff out here... we're working on it, and so grateful to all of you on the issues page for your patience, enthusiasm, and support in trying Snorkel out in the meantime!!! :)

blah-crusader · 2020-02-28T08:24:22Z

Thanks a lot for this response @jason-fries ! @ajratner apologies for coming across impatient/rude; I've been really amazed by the current release, and the corresponding research papers and did not mean anything other than: "I'm also super interested in staying up to date on the topic".

Thanks!

marctorsoc · 2020-02-28T09:25:14Z

The simplest way to do NER/sequence labeling using the off-the-shelf Snorkel label model is to assume each token is independent and define your label matrix L as tokens X LFs. Mechanistically, you can materialize this matrix however you like, but conceptually its cleaner to define your LFs as accepting sequences as input and returning a vector of token labels. These sequence LFs can apply regular expressions, dictionary matching, arbitrary heuristics, etc. as per typical Snorkel labeling functions.

For example, you could do very simple NER LOCATION model (using binary/IO tagging) as follows:
import numpy as np
from scipy.sparse import dok_matrix, vstack, csr_matrix

ABSTAIN = -1
LOCATION = 1
NOT_LOCATION = 0

# helper functions
def dict_match(sentence, dictionary, max_ngrams=4):
   m = {}
   for i in range(len(sentence)):
       for j in range(i+1, min(len(sentence), i + max_ngrams) + 1):
           term = ' '.join(sentence[i:j])
           if term in dictionary:
               m.update({idx:1 for idx in range(i,j+1)})
   return m
           
def create_token_L_mat(Xs, Ls, num_lfs):
   """
   Create token-level LF matrix from LFs indexed by sentence
   """
   Yws = []
   for sent_i in range(len(Xs)):
       ys = dok_matrix((len(Xs[sent_i]), num_lfs))
       for lf_i in range(num_lfs):
           for word_i,y in Ls[sent_i][lf_i].items():
               ys[word_i, lf_i] = y
       Yws.append(ys)
   return csr_matrix(vstack(Yws))
  
# labeling functions
def LF_is_location(s):
   locations = {"Big Apple", "Cupertino", "Cupertino, California", "California"}
   matches = dict_match(s, locations)
   return {i:LOCATION if i in matches else ABSTAIN for i in range(len(s))}
   
def LF_is_company(s):
   companies = {"Apple", "Apple, Inc."}
   matches = dict_match(s, companies)
   return {i:NOT_LOCATION if i in matches else ABSTAIN for i in range(len(s))}

def LF_is_titlecase(s):
   return {i:LOCATION if s[i][0].isupper() else ABSTAIN for i in range(len(s))}

# training set
sents = [
   "Apple, Inc. is headquartered in Cupertino, California .".split(),
   "Explore the very best of the Big Apple .".split(),
]

lfs = [
   LF_is_location,
   LF_is_company,
   LF_is_titlecase
]

# apply labeling functions and transform label matrix 
L = [[lf(s) for lf in lfs] for s in sents] 
L = create_token_L_mat(sents, L, len(lfs))

# train your Snorkel label model 
Generating weakly labeled sequence data then just requires some bookkeeping to split your predicted token probabilities back into their original sequences.

When training the end model, you can either mask tokens that don't have any LF coverage or assume some prior (e.g., all tags are equally likely) and train a BERT, BiLSTM, etc. model.

As @pfllo pointed out, there are dependencies between tokens that we would also like capture in the label model. The papers Multi-Resolution Weak Supervision for Sequential Data and Weakly Supervised Sequence Tagging from Noisy Rules do handle this (both have code available). In practice however, treating tokens independently and using the default Snorkel label model works surprisingly well, especially if you come from a domain with rich knowledge base & dictionary resources such as biomedicine, geography, etc.

thanks for this. I already experimented with a similar approach in the past, but it's really useful to me to have confirmation that this actually works quite well and there's not much difference (given enough resources) as compared to something specific to sequence data 👍

raj5287 · 2020-07-20T05:41:11Z

The simplest way to do NER/sequence labeling using the off-the-shelf Snorkel label model is to assume each token is independent and define your label matrix L as tokens X LFs. Mechanistically, you can materialize this matrix however you like, but conceptually its cleaner to define your LFs as accepting sequences as input and returning a vector of token labels. These sequence LFs can apply regular expressions, dictionary matching, arbitrary heuristics, etc. as per typical Snorkel labeling functions.

For example, you could do very simple NER LOCATION model (using binary/IO tagging) as follows:
import numpy as np
from scipy.sparse import dok_matrix, vstack, csr_matrix

ABSTAIN = -1
LOCATION = 1
NOT_LOCATION = 0

# helper functions
def dict_match(sentence, dictionary, max_ngrams=4):
   m = {}
   for i in range(len(sentence)):
       for j in range(i+1, min(len(sentence), i + max_ngrams) + 1):
           term = ' '.join(sentence[i:j])
           if term in dictionary:
               m.update({idx:1 for idx in range(i,j+1)})
   return m
           
def create_token_L_mat(Xs, Ls, num_lfs):
   """
   Create token-level LF matrix from LFs indexed by sentence
   """
   Yws = []
   for sent_i in range(len(Xs)):
       ys = dok_matrix((len(Xs[sent_i]), num_lfs))
       for lf_i in range(num_lfs):
           for word_i,y in Ls[sent_i][lf_i].items():
               ys[word_i, lf_i] = y
       Yws.append(ys)
   return csr_matrix(vstack(Yws))
  
# labeling functions
def LF_is_location(s):
   locations = {"Big Apple", "Cupertino", "Cupertino, California", "California"}
   matches = dict_match(s, locations)
   return {i:LOCATION if i in matches else ABSTAIN for i in range(len(s))}
   
def LF_is_company(s):
   companies = {"Apple", "Apple, Inc."}
   matches = dict_match(s, companies)
   return {i:NOT_LOCATION if i in matches else ABSTAIN for i in range(len(s))}

def LF_is_titlecase(s):
   return {i:LOCATION if s[i][0].isupper() else ABSTAIN for i in range(len(s))}

# training set
sents = [
   "Apple, Inc. is headquartered in Cupertino, California .".split(),
   "Explore the very best of the Big Apple .".split(),
]

lfs = [
   LF_is_location,
   LF_is_company,
   LF_is_titlecase
]

# apply labeling functions and transform label matrix 
L = [[lf(s) for lf in lfs] for s in sents] 
L = create_token_L_mat(sents, L, len(lfs))

# train your Snorkel label model 
Generating weakly labeled sequence data then just requires some bookkeeping to split your predicted token probabilities back into their original sequences.

When training the end model, you can either mask tokens that don't have any LF coverage or assume some prior (e.g., all tags are equally likely) and train a BERT, BiLSTM, etc. model.

As @pfllo pointed out, there are dependencies between tokens that we would also like capture in the label model. The papers Multi-Resolution Weak Supervision for Sequential Data and Weakly Supervised Sequence Tagging from Noisy Rules do handle this (both have code available). In practice however, treating tokens independently and using the default Snorkel label model works surprisingly well, especially if you come from a domain with rich knowledge base & dictionary resources such as biomedicine, geography, etc.

@jason-fries thanks for this, but could you please tell how to train the MajorityLabelVoter or LableModel , since I am getting error with both these Methods and even with LFAnalysis(L=L_, lfs=lfs).lf_summary() . I am guessing, may be this is because of sparse matrix since the error is NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported > So could you please help me out here, what to do next?

alvin-c-shih · 2020-10-16T17:08:41Z

@raj5287 MajorityLabelVoter requires L be integer type. LFAnalysis requires the matrix be dense. Other operations would prefer np.array instead of np.matrix.

Try this as a tactical fix:

L = np.asarray(L.astype(np.int8).todense())

rjurney · 2022-07-13T02:10:06Z

The thing to do here is to use skweak, not Snorkel. It is a commercial tool now and investments in this area are going into other projects.

Github - https://github.com/NorskRegnesentral/skweak
PyPi - https://pypi.org/project/skweak/
Paper - skweak: Weak Supervision Made Easy for NLP - https://arxiv.org/abs/2104.09683
Rubrix and skweak - https://rubrix.readthedocs.io/en/stable/tutorials/skweak.html

paroma added the Q&A label Jul 18, 2019

ajratner added feature request and removed Q&A labels Sep 1, 2019

vincentschen added the no-stale Auto-stale bot skips this issue label Nov 18, 2019

jason-fries mentioned this issue Apr 12, 2021

Tagging sequence markup for entity extraction #810

Closed

Luceliafn mentioned this issue Sep 20, 2021

Ler Artigos da ISSUE 1254 UnB-KnEDLe/timeline-contratos#52

Closed

vitorararuna mentioned this issue Sep 20, 2021

[1] Realizar testes para a extração de contratos com vários pdfs e estudas issue snorkel-ner UnB-KnEDLe/timeline-contratos#53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to create training data for NER task using snorkel ? #1254

How to create training data for NER task using snorkel ? #1254

thak123 commented Jul 10, 2019

Mageswaran1989 commented Aug 9, 2019

hpeiyan commented Aug 20, 2019 •

edited

Loading

ajratner commented Sep 1, 2019

marctorsoc commented Oct 28, 2019 •

edited

Loading

christopheratfarmjournal commented Oct 30, 2019

maciejbiesek commented Nov 20, 2019

pfllo commented Nov 21, 2019

thak123 commented Feb 19, 2020

blah-crusader commented Feb 27, 2020

jason-fries commented Feb 27, 2020 •

edited

Loading

ajratner commented Feb 27, 2020

ajratner commented Feb 27, 2020

blah-crusader commented Feb 28, 2020

marctorsoc commented Feb 28, 2020

raj5287 commented Jul 20, 2020 •

edited

Loading

alvin-c-shih commented Oct 16, 2020

rjurney commented Jul 13, 2022

How to create training data for NER task using snorkel ? #1254

How to create training data for NER task using snorkel ? #1254

Comments

thak123 commented Jul 10, 2019

Mageswaran1989 commented Aug 9, 2019

hpeiyan commented Aug 20, 2019 • edited Loading

ajratner commented Sep 1, 2019

marctorsoc commented Oct 28, 2019 • edited Loading

christopheratfarmjournal commented Oct 30, 2019

maciejbiesek commented Nov 20, 2019

pfllo commented Nov 21, 2019

thak123 commented Feb 19, 2020

blah-crusader commented Feb 27, 2020

jason-fries commented Feb 27, 2020 • edited Loading

ajratner commented Feb 27, 2020

ajratner commented Feb 27, 2020

blah-crusader commented Feb 28, 2020

marctorsoc commented Feb 28, 2020

raj5287 commented Jul 20, 2020 • edited Loading

alvin-c-shih commented Oct 16, 2020

rjurney commented Jul 13, 2022

hpeiyan commented Aug 20, 2019 •

edited

Loading

marctorsoc commented Oct 28, 2019 •

edited

Loading

jason-fries commented Feb 27, 2020 •

edited

Loading

raj5287 commented Jul 20, 2020 •

edited

Loading