segment

Introduction

Segment program is used to split text into segments, for example sentences.

Splitting rules are read from SRX file, which is standard format for this task.

This fork provides a custom version of loomchild/segment, enhanced to be easily wrapped and called from a Python program (for example, Bifixer). For this purpose, a new option (-c) has been added. This allows to load the segmenter in memory, ready to be invoked one-line-at-a-time each time it's needed. See below for an example of usage from Python.

Requirements

To run the project Java Runtime Environment (JRE) 1.8 is required.

Program should run on any operating system supported by Java.

Installation

Clone this repository:

git clone https://github.com/mbanon/segment.git

Be sure to have the 'JAVA_HOME' variable properly setted, for example:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/

Now compile and install:

cd segment/segment
mvn clean install
cd ../segment-ui
mvn clean install
cd target
unzip segment-2.0.4-SNAPSHOT.zip

Running

Example:

LC=en; \
java -cp segment-ui-2.0.4-SNAPSHOT.jar:./segment-2.0.4-SNAPSHOT/lib/* net.loomchild.segment.ui.console.Segment\ 
-l $LC \
-i inputfile \
-o outputfile \
-s ../../srx/language_tools.segment.srx

SRX files

Some SRX files are provided in the srx folder:

language_tools.segment.srx : LanguageTool rules
OmegaT.srx: official OmegaT segmentation rules.
PTDR.srx : Modified OmegaT segmentation rules
Aggressive.srx : segments by all punctuation marks: .,;:!?
NonAggressive.srx : segments by .;:!? (that is, all punctuation marks except comma)

If the parameter -s is not used, a default SRX file will be used.

Don't hesitate to build your own SRX files! Standard SRX 2.0 specs can be found here.

Benchmarks

Some benchmarks results can be found here.

Benchmarks ran on a Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz machine, using Universal Dependecies datasets and gold standards.

Example: testing the English dataset with the Language Tools SRX rules:

L=English; LC=en;\
time java -cp segment-ui-2.0.2-SNAPSHOT.jar:./segment-2.0.2-SNAPSHOT/lib/* net.loomchild.segment.ui.console.Segment \
-l $LC -i pipeline_evaluation_data/sentence_splitting/UD_$L.dataset -o loomchild.language-tools.$LC \
-s ../../srx/language_tools.segment.srx  \
&& python3.6 segmenteval.py pipeline_evaluation_data/sentence_splitting/UD_$L.dataset.gold loomchild.$LC

The segmenteval.py script can be downloaded from here

(Other segmenting tools, such as Moses, Ulysses and NTLK, are included in the benchmarks)

Python wrapping example

Visit this branch of Bifixer to find a super simple example of a Python3 program that reads a file line-by-line, performs some operations on each line, and then calls the Java segmenter for each line, by using ToolWrapper.

More information

More detailed information can be found here.

You can reach me by email at mbanon[at]prompsit[dot]com

Name		Name	Last commit message	Last commit date
Latest commit History 197 Commits
segment-ui		segment-ui
segment		segment
srx		srx
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

segment

Introduction

Requirements

Installation

Running

SRX files

Benchmarks

Python wrapping example

More information

About

Releases

Packages

Languages

License

mbanon/segment

Folders and files

Latest commit

History

Repository files navigation

segment

Introduction

Requirements

Installation

Running

SRX files

Benchmarks

Python wrapping example

More information

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages