Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to import parallel corpora? #128

Open
vintagentleman opened this issue Sep 22, 2019 · 0 comments
Open

How to import parallel corpora? #128

vintagentleman opened this issue Sep 22, 2019 · 0 comments

Comments

@vintagentleman
Copy link

Hi,

I’m struggling to convert TEI-encoded parallel corpora with Pepper.

The most straightforward approach proposed by TEI seems to involve constructing link groups connecting the aligned linguistic units together. Such is the approach I have witnessed in the Opus-MontenegrinSubs corpus, where along with the English and Montenegrin texts themselves there is a separate file containing nothing but the alignment links:

<linkGrp xmlns="http://www.tei-c.org/ns/1.0" type="alignment"
    corresp="opusmonte_en.ana.xml opusmonte_cnr.ana.xml">
  <link n="0:0" target="#Damages.S1.dam0101.SL1-en #Damages.S1.dam0101.SL1-cnr"/>
  ...
</linkGrp>

Additionally, every aligned segment has a @corresp attribute pointing to the @xml:id of its translation equivalent, like this:

<ab n="10" xml:id="Damages.S1.dam0101.SL15-cnr"
    corresp="#Damages.S1.dam0101.SL15-en">
  ...
</ab>

However, the TEI importer fails to process this corpus with the errors of this kind:

Cannot map 'salt:/0/OpusMonte.TEI/opusmonte_cnr.ana' with module 'TEIImporter', because of a mapping result was 'FAILED'.
Cannot map 'salt:/0/OpusMonte.TEI/opusmonte_en.ana' with module 'TEIImporter', because of a mapping result was 'FAILED'.
An exception was thrown by the mapper threads 'Thread[TEIImporter_mapper(salt:/OpusMonte.TEI/opusmonte_cnr.ana),5,TEIImporter_mapperGroup]'.
org.corpus_tools.pepper.modules.exceptions.PepperModuleXMLResourceException: Cannot read xml-file'file:/D:/Users/k.sipunin/Downloads/OpusMonte.TEI/opusmonte_cnr.ana.xml', because of a nested exception.
        at org.corpus_tools.pepper.common.PepperUtil.readXMLResource(PepperUtil.java:661)
        at org.corpus_tools.pepper.impl.PepperMapperImpl.readXMLResource(PepperMapperImpl.java:278)
        at org.corpus_tools.peppermodules.TEIModules.TEIMapper.mapSDocument(TEIMapper.java:58)
        at org.corpus_tools.pepper.impl.PepperMapperControllerImpl.map(PepperMapperControllerImpl.java:251)
        at org.corpus_tools.pepper.impl.PepperMapperControllerImpl.run(PepperMapperControllerImpl.java:188)
Caused by: org.corpus_tools.salt.exceptions.SaltInsertionException: Cannot insert object 'lemma=opasni' into container 'SStructureImpl(null)[lemma=opasni], salt::unit=word], ana=mte:Agpfpny]'.  Because an id already exists: lemma=opasni.

What might be the problem? And more generally, what is the proper way to encode parallel corpora importable into ANNIS (the presence of a sample here suggests that it’s doable)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant