Skip to content

Latest commit

 

History

History
211 lines (171 loc) · 6.59 KB

README.md

File metadata and controls

211 lines (171 loc) · 6.59 KB

Penelope

Natural Language Processing (NLP) and Machine Learning (ML) library for Elixir. Penelope provides a scikit-learn-inspired interface to the the LIBSVM, LIBLINEAR, and CRFsuite C/C++ libraries in Elixir, which can be used for many ML/NLP applications.

Status

Hex CircleCI Coverage

The API reference is available here.

Installation

Dependencies

First, clone the project's submodules.

git submodule update --init

This package requires an implementation of BLAS for efficient matrix math. It can be installed on each platform as follows:

OSX

BLAS is built into OSX.

Alpine

Install openblas-dev via apk.

sudo apk add openblas-dev

Ubuntu

Install libblas-dev via apt.

sudo apt install libblas-dev

Hex

def deps do
  [
    {:penelope, "~> 0.4"}
  ]
end

Usage

Intent Classification/Entity Recognition

Penelope can be used to build a machine learning model for identifying natural language utterances and extracting parameters from them. The Penelope.NLP.IntentClassifier module uses a predictor pipeline for recognizing intents and a recognizer pipeline for extracting named entities from the utterance. The following is a contrived example that classifies intents based on the token length of the utterance.

alias Penelope.NLP.IntentClassifier

pipeline = %{
  tokenizer: [{:ptb_tokenizer, []}],
  classifier: [{:count_vectorizer, []},
               {:linear_classifier, [probability?: true]}],
  recognizer: [{:crf_tagger, []}],
}
x = [
  "you have four pears",
  "three hundred apples would be a lot"
]
y = [
  {"intent_1", ["o", "o", "b_count", "b_fruit"]},
  {"intent_2", ["b_count", "i_count", "b_fruit", "o", "o", "o", "o"]}
]
classifier = IntentClassifier.fit(%{}, x, y, pipeline)

{intents, params} = IntentClassifier.predict_intent(
  classifier,
  %{},
  "I have three bananas"
)

Pipeline Definition

pipeline = %{
  tokenizer: [{:ptb_tokenizer, []}],
  classifier: [{:count_vectorizer, []},
               {:linear_classifier, [probability?: true]}],
  recognizer: [{:crf_tagger, []}],
}

This block configures the tokenizer, classifier, and recognizer pipelines used by the intent classifier. A pipeline in Penelope is a list of components and configuration that are used to train/predict a machine learning model, with an interface similar to that used in scikit-learn.

The tokenizer converts a string utterance into a sequence of tokens. In this example, we use the Penn Treebank tokenizer (:ptb-tokenizer). The tokenizer pipeline is run before either of the other two pipelines, so that they can share its output.

The classifier pipeline receives a tokenized utterance (x) and class labels (y) and learns a model that can predict the label from the utterance. In this example, we use a simple token count vectorizer (number of tokens in the utterance) and a logistic regression classifier to predict the class labels.

Finally, the recognizer pipeline receives a tokenized sequence (x) and sequence tags (y) to learn a model that can predict the label of each tag in the sequence. This allows the recognizer to extract slot values from natural language utterances. This example uses a Conditional Random Field (CRF) model, which can be thought of as a sequence extension of logistic regression, to tag the tokens in the utterance.

Training

x = [
  "you have four pears",
  "three hundred apples would be a lot"
]
y = [
  {"intent_1", ["o", "o", "b_count", "b_fruit"]},
  {"intent_2", ["b_count", "i_count", "b_fruit", "o", "o", "o", "o"]}
]
classifier = IntentClassifier.fit(%{}, x, y, pipeline)

Inputs (x) to the intent classifier are simple natural language utterances. These inputs are tokenized and converted to feature vectors/maps as needed by the classifier/recognizer.

Each label (y) is a tuple of {intent, tags}, where intent is the class label of the intent for the corresponding x value. tags is a list of token tags, each of which is a label for the corresponding token in the utterance x. Tag labels are expressed using the Inside-Outside-Beginning (IOB) format. In the above snippet, the following are the token tags for the first utterance.

token tag
you o
have o
four b_count
pears b_fruit

Prediction

{intents, params} = IntentClassifier.predict_intent(
  classifier,
  %{},
  "I have three bananas"
)

The snippet above returns the following intents map and params map that classify the utterance. The intents map contains the posterior probability of each intent, all of which sum to 1.0. The params map contains the map of entity names extracted from the utterance, based on the names specified in the training examples.

{
    %{
        "intent_1" => 0.6666666661872298,
        "intent_2" => 0.3333333338127702
    },
    %{
        "count" => "three",
        "fruit" => "bananas"
    }
}

Improvements

Obviously, using the token count as the only feature to try to predict an intent is silly, and using only the input tokens to train the entity recognizer will not generalize well. For better classification/recognition, Penelope includes several feature generation components/vectorizers, including support for pretrained embeddings (word vectors) and regexes. Examples of these can be found in the API reference.

License

Copyright 2017 Pylon, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.