Skip to content

Latest commit

 

History

History
200 lines (154 loc) · 4.57 KB

README.adoc

File metadata and controls

200 lines (154 loc) · 4.57 KB

Termium Ruby gem

Purpose

The Termium Ruby gem parses export data formats from the TERMIUM Plus terminology database service from the Government of Canada.

WARNING - Termium XML output requires manual correction

The default Termium XML output is invalid where the term domains using angular brackets have the "greater than" sign not escaped:

<textualSupport order="1" type="DEF">
  <value>&lt;artificial intelligence> operation that allows the firing of a rule, or the
    invocation of a program or a subprogram</value>
  <sourceRef order="1" />
</textualSupport>

The remedy is to manually escape the "greater than" sign using a find/replace or a regular expression:

string.gsub(/&lt;([^>]+)>/, '<\1>')

Results in:

<textualSupport order="1" type="DEF">
  <value>&lt;artificial intelligence&gt; operation that allows the firing of a rule, or the
    invocation of a program or a subprogram</value>
  <sourceRef order="1" />
</textualSupport>

Commands

termium convert

Convert a TERMIUM Plus export XML file to a Paneron Glossarist dataset.

termium convert

Purpose

This command converts a TERMIUM Plus export XML (<ns2:termium_extract>) file to a Paneron Glossarist dataset.

The resulting dataset will look like this:

{OUTPUT_PATH}/
├── concepts/
│   ├── {CONCEPT_ID}.yaml
│   ├── ...
├── localized_concepts/
    ├── {LOCALIZED_CONCEPT_ID}.yaml
    ├── ...

Usage

$ termium convert -i INPUT_XML_FILE [-o OUTPUT_PATH] [-o DATE_ACCEPTED]

Options

Flag Description

-i, --input-path

Source path to TERMIUM Plus XML export file. The file needs to start with the <ns2:termium_extract> element.

-o, --output-path

Destination path to Glossarist dataset directory. If the directory doesn’t exist it will be created. If not provided, defaults to the basename of the input file, e.g. foo/bar.xml will export to foo/bar/.

--date-accepted

Date of acceptance for the dataset. This fills in the date_accepted value of the universal concept (which is exported to a YAML file).

Examples

The data structures of these files can be seen in the following examples.

Example 1. Sample of {CONCEPT_ID}.yaml

This is 88a7dd87-6199-3516-9cec-f4cd79ff09c6.yaml.

---
data:
  identifier: '2120638'
  localized_concepts:
    eng: e114ee44-e601-5623-9099-48cfc2be2224
    fre: 9a7b88cb-4ee6-5d59-89bb-230425a3c96a
related: []
date_accepted: 2015-05-01
status: valid
id: 88a7dd87-6199-3516-9cec-f4cd79ff09c6
Example 2. Sample of {LOCALIZED_CONCEPT_ID}.yaml

This is e114ee44-e601-5623-9099-48cfc2be2224.yaml.

---
data:
  dates: []
  definition:
  - content: layer whose nodes directly communicate with external systems
  examples: []
  id: '2120638'
  notes:
  - content: 'visible layer: term and definition standardized by ISO/IEC [ISO/IEC
      2382-34:1999].'
  - content: 34.02.09 (2382)
  sources:
  - origin:
      ref: ISO/IEC 2382-34:1999
    type: lineage
    status: identical
  - origin:
      ref: Ranger, Natalie * 2006 * Bureau de la traduction / Translation Bureau *
        Services linguistiques / Linguistic Services * Bur. dir. Centre de traduction
        et de terminologie / Dir's Office Translation and Terminology Centre * Div.
        Citoyenneté et Protection civile / Citizen. & Emergency preparedness Div.
        * Normalisation terminologique / Terminology Standardization
    type: lineage
    status: identical
  terms:
  - type: expression
    normative_status: preferred
    designation: visible layer
    grammar_info:
    - preposition: false
      participle: false
      adj: false
      verb: false
      adverb: false
      noun: false
      gender: []
      number:
      - singular
  language_code: eng

Library

Usage

This gem makes heavy use of the lutaml-model classes for XML serialization.

The following code converts the Termium extract into a Glossarist dataset.

termium_extract = Termium::Extract.from_xml(IO.read(termium_extract_file))
glossarist_col = termium_extract.to_concept
FileUtils.mkdir_p(glossarist_output_file)
glossarist_col.save_to_files(glossarist_output_file)

Credits

This gem is developed, maintained and funded by Ribose Inc.

License

The gem is available as open source under the terms of the 2-Clause BSD License.