- 1. x-tagger Dataset
- 2. Models
- 3. Metrics
- 4.
xtagger.EnglishRegExTagger
- 5. PyTorch Sequence Labeling Wrapper
- 6.
xtagger.Checkpointing
x-tagger dataset is basically the most simples dataset for token classification.It is a list of sentences, and sentences is a list contains token/tag tuple pair:
[
[('This', 'DET'),
('phrase', 'NOUN'),
('once', 'ADV'),
('again', 'ADV')
],
[('their', 'PRON'),
('employees', 'NOUN'),
('help', 'VERB'),
('themselves', 'PRON')
]
]
x-tagger dataset does not have cool methods like .map()
, .build_vocab
, .get_batch_without_pads()
. It is jus a Python list as usual. Two questions: how can you use it for complex models, or how to get this form from custom datasets?
from xtagger import df_to_xtagger_dataset
import pandas as pd
df_train = pd.read_csv("/path/to/train.csv")
df_test = pd.read_csv("/path/to/test.csv")
data_train = df_to_xtagger_dataset(df_train)
data_test = df_to_xtagger_dataset(df_test)
NLTK Penn Treebank is most used treebank for POS tagging:
import nltk
from sklearn.model_selection import train_test_split
nltk_data = list(nltk.corpus.treebank.tagged_sents(tagset='universal'))
train_set, test_set = train_test_split(nltk_data,train_size=0.8,test_size=0.2,random_state = 2112)
now the nltk_data
variable is in the form x-tagger dataset. In this documentation, train_set
and test_set
variables refer here.
in_channels
: x-tagger dataset.row_as_list
: returns samples with list.
Example:
from xtagger import xtagger_dataset_to_df
df_train = xtagger_dataset_to_df(train_set, row_as_list = True)
df_test = xtagger_dataset_to_df(test_set, row_as_list = False)
sentences | tags |
---|---|
["This", "phrase", "once", "again"] | ["DET", "NOUN", "ADV", "ADV"] |
["their", "employees", "help", "themselves"] | ["PRON", "NOUN", "VERB", "PRON"] |
df
: pandas DataFrame with rows "sentence" and "tags".
Most easiest way to train pytorch models comes from torchtext. You can train any token classification model that support torchtext, by converting the simples dataset x-tagger to torchtext dataset. The procedure has 2 step. First you need to transfer x-tagger dataset to pandas.DataFrame
. Then you can transfer pandas.DataFrame
to torchtext.data.BucketIterator
.
xtagger.df_to_torchtext_data(df_train, df_test, device, batch_size, pretrained_embeddings = False)
df_train
: pandas DataFrame withrow_as_list = False
.df_test
: pandas DataFrame withrow_as_list = False
.device
: Hardware variabletorch.device
.batch_size
: Batch size for both df_train and df_test.pretrained_embeddings
: Default glove.6B.100d embeddings (not tested).- returns: three
torchtext.data.iterator.BucketIterator
for train, test, val and 2torchtext.data.field.Field
for TEXT and TAG vocabs.
- returns: three
Example:
import torch
from xtagger import xtagger_dataset_to_df
from xtagger import df_to_torchtext_data
df_train = xtagger_dataset_to_df(train_set)
df_test = xtagger_dataset_to_df(test_set)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_iterator, valid_iterator, test_iterator, TEXT, TAGS = df_to_torchtext_data(
df_train,
df_test,
device,
batch_size=32
)
df
: pandas DataFrame withrow_as_list = True
.tags
: A list of tags in dataset.tokenizer
: An object oftransformers.AutoTokenizer
.device
: Hardware variabletorch.device
.
Example:
import torch
from xtagger import xtagger_dataset_to_df
from xtagger import df_to_hf_dataset
df_train = xtagger_dataset_to_df(train_set, row_as_list=True)
df_test = xtagger_dataset_to_df(test_set, row_as_list=True)
train_tagged_words = [tup for sent in train_set for tup in sent]
tags = {tag for word,tag in train_tagged_words}
tags = list(tags)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("./path_to_tokenizer")
dataset_train = df_to_hf_dataset(df_train, tags, tokenizer, device)
dataset_test = df_to_hf_dataset(df_test, tags, tokenizer, device)
x-tagger support many algorithms! From Deep Learning to Computational Lingustics: x-tagger has LSTMs, BERTs and different types of Hidden Markov Models. Besides all, you can train any PyTorch model for pos tagging with x-tagger PyTorchTrainer wrapper! Before diving in those types and wrappers, we introduce basics of x-tagger models.
extend_to
: type of HiddenMarkovModel. Current implementations: ["bigram", "trigram", "deleted_interpolation]language
: Language of model. Not important but best practice to use.morphological
:xtagger.[Language]RegexTagger
object withmode = "morphological"
parameter.prior
:xtagger.[Language]RegexTagger
object withmode = "prior"
parameter.HiddenMarkovModel.fit(train_set, start_token = ".")
train_set
: x-tagger Dataset for training.start_token
: Start token in your training tags.
HiddenMarkovModel.evaluate(test_set, random_size = 30, seed = None, eval_metrics=["acc"], result_type = "%", morphological = True, prior = True)
test_set
: x-tagger Dataset for evaluating.random_size
: Select random samples in evaluation for efficiency.seed
: Random seed.eval_metrics
: Evaluation metrics. See more atxtagger.utils.metrics
section.result_type
: For percentage "%", else decimal number.morphological
: For using initializedxtagger.[Language]RegexTagger
. Flexibility comes from passing it at initialiation but not using withmorphological = False
.prior
: For using initializedxtagger.[Language]RegexTagger
. Flexibility comes from passing it at initialiation but not using withprior = False
.
HiddenMarkovModel.predict(words, morphological = False, prior = False)
words
: List of words for your sentence.morphological
: For using initializedxtagger.[Language]RegexTagger
.prior
: For using initializedxtagger.[Language]RegexTagger
.
HiddenMarkovModel.set_test_set(test)
HiddenMarkovModel.get_tags(test)
HiddenMarkovModel.get_start_token(test)
HiddenMarkovModel.get_extension(test)
HiddenMarkovModel.get_transition_matrix(test)
HiddenMarkovModel.get_eval_metrics(test)
HiddenMarkovModel.get_metric_onehot_indexing(test)
HiddenMarkovModel.eval_indices(test)
Note: Evaluation takes much more time than fitting. This is because of complexity of viterbi decoding. random_size
can give convergent results when considering law of large numbers. As a result, complexity of trigram and deleted_interpolation is higher than bigram. We will release benchmarks of x-tagger.
input_dim
: Size of your vocab. Should be len(TEXT).output_dim
: Number of tags + <pad> token. Should be len(TAGS).TEXT
: Word vocabulary withtorchtext.data.field.Field
.TAGS
: Tag vocabulary withtorchtext.data.field.Field
.embedding_dim
: Embedding dimension oftorch.nn.Embedding
.hidden_dim
: Hidden dimension oftorch.nn.LSTM
.n_layers
: Number of layers of LSTM.bidirectional
: True for bidirectional LSTM, else false.dropout
: Dropout rate for fully connected network out.cuda
: If you have cuda but not to use it.tag_pad_idx
: Tag pad index. Should beTAGS.vocab.stoi[TAGS.pad_token]
.pad_idx
: Text pad index. Should beTEXT.vocab.stoi[TEXT.pad_token]
.LSTMForTagging.fit(train_set, test_set, epochs = 10, eval_metrics = ["acc"], result_type = "%", checkpointing = None)
train_set
: Training dataset withtorchtext.data.iterator.BucketIterator
.test_set
: Evaluation datasettorchtext.data.iterator.BucketIterator
.epochs
: Number of epochs for training.eval_metrics
: Evaluation metrics. See more forxtagger.utils.metrics
section.result_type
: For percentage "%", else decimal number.checkpointing
: Checkpointing object. See more atxtagger.utils.callbacks.Checkpointing
section.
LSTMForTagging.evaluate(test_set = None, eval_metrics = ["acc"], result_type = "%")
test_set
: Test datasettorchtext.data.iterator.BucketIterator
. If None, usestest_set
from initialization automatically.eval_metrics
: Evaluation metrics. See more forxtagger.utils.metrics
section.result_type
: For percentage "%", else decimal number.
LSTMForTagging.predict(sentence)
:sentence
: List of words or single string.- returns zipped version of words and tags and unk words:
zipped, unk = model.predict("hello world")
Not tested yet :( Working on new release.
xtagger.utils.metrics
is a module that is hidden from the user. It provides metrics to user at training and evaluation. There is 8 built-in metric you can choose for eval_metric
parameter of fit and initialization methods:
- avg_f1
- avg_precision
- avg_recall
- acc
- classwise_f1
- classwise_precision
- classwise_recall
- report
Metrics that start with "avg" (f1, precision, recall) provides micro, macro and weighted calculations. Metric that start with "classwise" (f1, precision, recall) returns metrics classwise. For example f1 score of "ADV" tag. "report" metric prints beautiful sklearn.metrics.classification_report
. For LSTM and BERT, those metrics are calculated for both train set and eval set at training. model.evaluation
returns for only test set. Each score provided with dictionary.
User might want to calculate specific metric for the task. xtagger.utils.metrics.xMetrics
handles it.
y_true
: Suppose it as one-hot representation of each class.y_pred
: Suppose it as one-hot representation of each class.tags
: Not necessary but can be useful for beautiful printing.
Example: From scratch accuracy for HMM.
from xtagger import xMetrics
from xtagger import HiddenMarkovModel
class MyAcc(xMetrics):
def __init__(self, y_true, y_pred, tags):
super(MyAcc, self).__init__(y_true, y_pred, tags)
def __call__(self):
import numpy as np
acc = 0
for gt, pred in zip(self.y_true, self.y_pred):
gt_index = np.where(gt == 1)[0].item()
pred_index = np.where(pred == 1)[0].item()
if gt_index == pred_index:
acc += 1
return acc / self.y_true.shape[0]
model = HiddenMarkovModel(
extend_to = "bigram",
language = "en"
)
model.fit(train_set)
model.evaluate(
test_set,
random_size = -1,
seed = 15,
eval_metrics = ['acc', 'avg_f1', MyAcc],
result_type = "%"
)
Handling unknown words or computation time of HMM can be stressful. xtagger.EnglishRegExTagger
a helper module. With its morphological mode, it handles unknown words at evaluation which are not seen at train set. With that property, precision of the task increases. With its prior mode, it handles the computational complexity of decoding. If a prior regular expression rule matches with a word, HMM does not compute probability for that word, it assigns its RegEx tag. But, everything has a price. With a weak prior, you can encounter poor precision.
-
rules
: A list of rules that are given as tuple. -
use_default
: Default RegEx matching:[(r'.*ing$', 'VERB'), (r'.*ed$', 'VERB'), (r'.*es$', 'VERB'), (r'.*\'s$', 'NOUN'), (r'.*s$', 'NOUN'), (r'\*T?\*?-[0-9]+$', 'X'), (r'^-?[0-9]+(.[0-9]+)?$', 'NUM'), (r'.*', 'NOUN')]
-
mode
: "prior" or "morphological".
Example:
from xtagger import EnglishRegExTagger
from xtagger import HiddenMarkovModel
rules = [
(r'.*ing$', 'VERB'),
(r'.*ed$', 'VERB'),
(r'.*es$', 'VERB')
]
morphological_tagger = EnglishRegExTagger(
rules = rules,
use_default = False,
mode = "morphological"
)
model = HiddenMarkovModel(
extend_to = "bigram",
language = "en",
morphological = morphological_tagger
)
model.fit(train_set)
model.evaluate(
test_set,
random_size = -1,
seed = 15,
eval_metrics = ['acc', 'avg_f1'],
result_type = "%",
morphological = True,
)
User might want to develop a novel pos-tagging model and train. xtagger.PyTorchTrainer
handles this.
xtagger.PyTorchTrainer(model, optimizer, criterion, train_iterator, val_iterator, TEXT, TAGS, device, eval_metrics = ["acc"], checkpointing = None, result_type = "%")
model
:torch.nn.Module
model withbatch_first = True
rule.optimizer
: An object fromtorch.optim
.criterion
: A loss function fromtorch.nn
train_iterator
: Training dataset withtorchtext.data.iterator.BucketIterator
.val_iterator
: Evaluation datasettorchtext.data.iterator.BucketIterator
.TEXT
: Word vocabulary withtorchtext.data.field.Field
.TAGS
: Tag vocabulary withtorchtext.data.field.Field
.device
: Hardware variabletorch.device
.eval_metrics
: Evaluation metrics. See more forxtagger.utils.metrics
section.checkpointing
: Checkpointing object. See more atxtagger.utils.callbacks.Checkpointing
section.result_type
: For percentage "%", else decimal number.PyTorchTrainer.train(epochs = 10)
epochs
: Number of epochs for training.- This method computes metrics for both train and val as well as other
.fit()
methods.
xtagger.Checkpointing(model_path, model_name, monitor = None, mode = "maximize", save_best = True, verbose = 0)
-
model_path
: Path to save checkpoints. -
model_name
: Checkpoint file name. Should be end with ".pt". -
monitor
: Monitoring metric for saving best model. Must be one of from- train_loss - train_acc - train_avg_f1 - train_avg_precision - train_avg_recall - eval_loss - eval_acc - eval_avg_f1 - eval_avg_precision - eval_avg_recall
-
mode
: Maximize of minimize that metric. -
save_best
: If false, then saves model at each epoch. -
verbose
: If true, prints best model monitor metric score.Checkpointing.load(model)
model
: PyTorch model.