Skip to content

mbanon/segment

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

segment

Introduction

Segment program is used to split text into segments, for example sentences.

Splitting rules are read from SRX file, which is standard format for this task.

This fork provides a custom version of loomchild/segment, enhanced to be easily wrapped and called from a Python program (for example, Bifixer). For this purpose, a new option (-c) has been added. This allows to load the segmenter in memory, ready to be invoked one-line-at-a-time each time it's needed. See below for an example of usage from Python.

Requirements

To run the project Java Runtime Environment (JRE) 1.8 is required.

Program should run on any operating system supported by Java.

Installation

Clone this repository:

git clone https://github.com/mbanon/segment.git

Be sure to have the 'JAVA_HOME' variable properly setted, for example:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/ 

Now compile and install:

cd segment/segment
mvn clean install
cd ../segment-ui
mvn clean install
cd target
unzip segment-2.0.4-SNAPSHOT.zip

Running

Example:

LC=en; \
java -cp segment-ui-2.0.4-SNAPSHOT.jar:./segment-2.0.4-SNAPSHOT/lib/* net.loomchild.segment.ui.console.Segment\ 
-l $LC \
-i inputfile \
-o outputfile \
-s ../../srx/language_tools.segment.srx 

SRX files

Some SRX files are provided in the srx folder:

  • language_tools.segment.srx : LanguageTool rules
  • OmegaT.srx: official OmegaT segmentation rules.
  • PTDR.srx : Modified OmegaT segmentation rules
  • Aggressive.srx : segments by all punctuation marks: .,;:!?
  • NonAggressive.srx : segments by .;:!? (that is, all punctuation marks except comma)

If the parameter -s is not used, a default SRX file will be used.

Don't hesitate to build your own SRX files! Standard SRX 2.0 specs can be found here.

Benchmarks

Some benchmarks results can be found here.

Benchmarks ran on a Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz machine, using Universal Dependecies datasets and gold standards.

Example: testing the English dataset with the Language Tools SRX rules:

L=English; LC=en;\
time java -cp segment-ui-2.0.2-SNAPSHOT.jar:./segment-2.0.2-SNAPSHOT/lib/* net.loomchild.segment.ui.console.Segment \
-l $LC -i pipeline_evaluation_data/sentence_splitting/UD_$L.dataset -o loomchild.language-tools.$LC \
-s ../../srx/language_tools.segment.srx  \
&& python3.6 segmenteval.py pipeline_evaluation_data/sentence_splitting/UD_$L.dataset.gold loomchild.$LC

The segmenteval.py script can be downloaded from here

(Other segmenting tools, such as Moses, Ulysses and NTLK, are included in the benchmarks)

Python wrapping example

Visit this branch of Bifixer to find a super simple example of a Python3 program that reads a file line-by-line, performs some operations on each line, and then calls the Java segmenter for each line, by using ToolWrapper.

More information

More detailed information can be found here.

You can reach me by email at mbanon[at]prompsit[dot]com

About

Program used to split text into segments

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 97.6%
  • XSLT 2.2%
  • Other 0.2%