Skip to content

Conditional Random Field framework for Persian POS tagging.

License

Notifications You must be signed in to change notification settings

MohammadForouhesh/crf-pos-persian

Repository files navigation

Persian Parts-of-Speech tagger

github-action-deploy Scrutinizer Code Quality Code Coverage Build Status Code Intelligence Status Maintainability Last commit ask

Downloads Downloads_per_month

Mohammad H. Forouhesh

Metodata Inc ®

April 25, 2022

This repository contains Persian Part of Speech tagger based on Conditional Random Fields and a native Text Normalizer.

Table of Contents

  1. TO-DO
  2. Docker
  3. Installation
    1. Using Pip
    2. From Source
    3. On CoLab
  4. Usage
  5. Implementation Details
  6. Evaluation
  7. How To Contribute

TO-DO:

Docker

A tiny interactive docker container is provided for production.

git clone https://github.com/MohammadForouhesh/crf-pos-persian.git
docker-compose run --rm crf-pos

Installation:

Using Pip

! pip install crf_pos

From Source

$ git clone https://github.com/MohammadForouhesh/crf-pos-persian 
$ cd crf-pos-persian
$ python setup.py install

On CoLab

! pip install git+https://github.com/MohammadForouhesh/crf-pos-persian.git

Usage

from crf_pos.pos_tagger.wapiti import WapitiPosTagger
pos_tagger = WapitiPosTagger()
tokens = 'او رئیس‌جمهور حجتالاسلاموالمسلمین ابرهیم رئیسی رئیس جمهور ایران اسلامی می باشد'
pos_tagger[tokens]

[1]: 
[('او', 'PRO'),
('رئیس\u200cجمهور', 'N'),
('حجت\u200cالاسلام\u200cوالمسلمین', 'N'),
('ابرهیم', 'N'),
('رئیسی', 'N'),
('رئیس\u200cجمهور', 'N'),
('ایران', 'N'),
('اسلامی', 'ADJ'),
('می\u200cباشد', 'V')]

Implementation Details

Evaluation

Test and training is perfomed on Mojgan Seraji's Uppsala Persian Corpus

Part-of-Speech Description precision recall f1-score support
N Noun 0.985 0.970 0.977 186585
P Preposition 0.998 0.998 0.998 89450
V Verb 0.999 0.999 0.999 87762
ADV Adverb 0.976 0.972 0.974 15983
FW Foreign Word 0.989 0.992 0.991 2784
DET Determiner 0.973 0.977 0.975 19786
ADJ Adjective 0.978 0.975 0.977 61526
INT Interjection 1.000 1.000 1.000 73
CONJ Conjunction 0.996 0.997 0.997 74796
PRO Pronoun 0.973 0.974 0.973 23094
NUM Numeral 0.988 0.992 0.990 24864
avg/total - 0.985 0.985 0.985 586703

How To Contribute

  1. Report any encountered error trough [BUG]
  2. Report if Normalizer mis-out half-space correction trough [ZWNJ]