The Termium Ruby gem parses export data formats from the TERMIUM Plus terminology database service from the Government of Canada.
The default Termium XML output is invalid where the term domains using angular brackets have the "greater than" sign not escaped:
<textualSupport order="1" type="DEF">
<value><artificial intelligence> operation that allows the firing of a rule, or the
invocation of a program or a subprogram</value>
<sourceRef order="1" />
</textualSupport>
The remedy is to manually escape the "greater than" sign using a find/replace or a regular expression:
string.gsub(/<([^>]+)>/, '<\1>')
Results in:
<textualSupport order="1" type="DEF">
<value><artificial intelligence> operation that allows the firing of a rule, or the
invocation of a program or a subprogram</value>
<sourceRef order="1" />
</textualSupport>
termium convert
-
Convert a TERMIUM Plus export XML file to a Paneron Glossarist dataset.
This command converts a TERMIUM Plus export XML (<ns2:termium_extract>
) file
to a Paneron Glossarist dataset.
The resulting dataset will look like this:
{OUTPUT_PATH}/
├── concepts/
│ ├── {CONCEPT_ID}.yaml
│ ├── ...
├── localized_concepts/
├── {LOCALIZED_CONCEPT_ID}.yaml
├── ...
Flag | Description |
---|---|
|
Source path to TERMIUM Plus XML export file.
The file needs to start with the |
|
Destination path to Glossarist dataset directory.
If the directory doesn’t exist it will be created.
If not provided, defaults to the basename of the input file, e.g. |
|
Date of acceptance for the dataset. This fills in the |
The data structures of these files can be seen in the following examples.
{CONCEPT_ID}.yaml
This is 88a7dd87-6199-3516-9cec-f4cd79ff09c6.yaml
.
---
data:
identifier: '2120638'
localized_concepts:
eng: e114ee44-e601-5623-9099-48cfc2be2224
fre: 9a7b88cb-4ee6-5d59-89bb-230425a3c96a
related: []
date_accepted: 2015-05-01
status: valid
id: 88a7dd87-6199-3516-9cec-f4cd79ff09c6
{LOCALIZED_CONCEPT_ID}.yaml
This is e114ee44-e601-5623-9099-48cfc2be2224.yaml
.
---
data:
dates: []
definition:
- content: layer whose nodes directly communicate with external systems
examples: []
id: '2120638'
notes:
- content: 'visible layer: term and definition standardized by ISO/IEC [ISO/IEC
2382-34:1999].'
- content: 34.02.09 (2382)
sources:
- origin:
ref: ISO/IEC 2382-34:1999
type: lineage
status: identical
- origin:
ref: Ranger, Natalie * 2006 * Bureau de la traduction / Translation Bureau *
Services linguistiques / Linguistic Services * Bur. dir. Centre de traduction
et de terminologie / Dir's Office Translation and Terminology Centre * Div.
Citoyenneté et Protection civile / Citizen. & Emergency preparedness Div.
* Normalisation terminologique / Terminology Standardization
type: lineage
status: identical
terms:
- type: expression
normative_status: preferred
designation: visible layer
grammar_info:
- preposition: false
participle: false
adj: false
verb: false
adverb: false
noun: false
gender: []
number:
- singular
language_code: eng
This gem makes heavy use of the lutaml-model
classes for XML serialization.
The following code converts the Termium extract into a Glossarist dataset.
termium_extract = Termium::Extract.from_xml(IO.read(termium_extract_file))
glossarist_col = termium_extract.to_concept
FileUtils.mkdir_p(glossarist_output_file)
glossarist_col.save_to_files(glossarist_output_file)
This gem is developed, maintained and funded by Ribose Inc.
The gem is available as open source under the terms of the 2-Clause BSD License.