Segment program is used to split text into segments, for example sentences.
Splitting rules are read from SRX file, which is standard format for this task.
This fork provides a custom version of loomchild/segment, enhanced to be easily wrapped and called from a Python program (for example, Bifixer). For this purpose, a new option (-c
) has been added. This allows to load the segmenter in memory, ready to be invoked one-line-at-a-time each time it's needed. See below for an example of usage from Python.
To run the project Java Runtime Environment (JRE) 1.8 is required.
Program should run on any operating system supported by Java.
Clone this repository:
git clone https://github.com/mbanon/segment.git
Be sure to have the 'JAVA_HOME' variable properly setted, for example:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
Now compile and install:
cd segment/segment
mvn clean install
cd ../segment-ui
mvn clean install
cd target
unzip segment-2.0.4-SNAPSHOT.zip
Example:
LC=en; \
java -cp segment-ui-2.0.4-SNAPSHOT.jar:./segment-2.0.4-SNAPSHOT/lib/* net.loomchild.segment.ui.console.Segment\
-l $LC \
-i inputfile \
-o outputfile \
-s ../../srx/language_tools.segment.srx
Some SRX files are provided in the srx
folder:
- language_tools.segment.srx : LanguageTool rules
- OmegaT.srx: official OmegaT segmentation rules.
- PTDR.srx : Modified OmegaT segmentation rules
- Aggressive.srx : segments by all punctuation marks: .,;:!?
- NonAggressive.srx : segments by .;:!? (that is, all punctuation marks except comma)
If the parameter -s
is not used, a default SRX file will be used.
Don't hesitate to build your own SRX files! Standard SRX 2.0 specs can be found here.
Some benchmarks results can be found here.
Benchmarks ran on a Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz machine, using Universal Dependecies datasets and gold standards.
Example: testing the English dataset with the Language Tools SRX rules:
L=English; LC=en;\
time java -cp segment-ui-2.0.2-SNAPSHOT.jar:./segment-2.0.2-SNAPSHOT/lib/* net.loomchild.segment.ui.console.Segment \
-l $LC -i pipeline_evaluation_data/sentence_splitting/UD_$L.dataset -o loomchild.language-tools.$LC \
-s ../../srx/language_tools.segment.srx \
&& python3.6 segmenteval.py pipeline_evaluation_data/sentence_splitting/UD_$L.dataset.gold loomchild.$LC
The segmenteval.py
script can be downloaded from here
(Other segmenting tools, such as Moses, Ulysses and NTLK, are included in the benchmarks)
Visit this branch of Bifixer to find a super simple example of a Python3 program that reads a file line-by-line, performs some operations on each line, and then calls the Java segmenter for each line, by using ToolWrapper.
More detailed information can be found here.
You can reach me by email at mbanon[at]prompsit[dot]com