Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual support #5

Open
westei opened this issue Jul 3, 2013 · 1 comment
Open

Multilingual support #5

westei opened this issue Jul 3, 2013 · 1 comment

Comments

@westei
Copy link
Contributor

westei commented Jul 3, 2013

This aims to discuss things related the usage of the SolrTextTagger to process texts in different languages and tag them against a vocabulary with labels in multiple languages (e.g. freebase.org).

Multilingual Vocabularies

Expected properties of the vocabulary (numbered to allow referring them later in the text)

  • (1) defines labels in different languages
  • (2) labels without language tag should be used for all languages
  • (3) not all entities define labels in all languages
  • (4) for non common languages only a few entities do define labels

Within the Solr index labels of different language will be stored in different fields (as user will want to configure different Analyzers). For some languages a dynamic field with a generic text analyzer could be used - e.g.

<field name="label-en" type="text-en" ... />
<field name="label-de" type="text-de" ... />
<!-- other label fields for specific languages -->
<!-- finally the field for labels without language and
       a dynamic field for other languages -->
<field name="label" type="text-gen" ... />
<dynamicField name="label-*" type="text-gen ... />

Multilingual Tagging Process

Assuming that we do know the language of the processed text (parsed or detected) we would like to tag the content by using labels of the detected language as well as default labels (2).

For achieving this I see several solutions:

  1. Building language specific FST corpora and calling the SolrTextTagger twice: To allow this the TaggerFstCorpus needs to be adapted to NOT throwing an RuntimeExcpetion on documents where the storedField is not present as this will happen because of (3). Also building the FST is inefficient for (4) as it iterates over all documents in the index and most of them will be skipped because they do not define a label in that language. An other potential drawback is that the TagClusterReducer will only work within a single language. Results of the the two calls will still need to be merged / reduced.
  2. Building language specific FST corpora that do include default labels (2): While this would allow to use a single FST corpus for tagging a text based on a multi lingual vocabulary it would cause a lot of duplication. Especially for vocabularies that would contain a lot of default labels. TaggerFstCorpus would need to learn some new tricks as it would need to be built based on two fields with potential different analyzers. The problem of different Analyzers would also affect the Tagger - as it does use the same Analyzer to process the parsed text. If the Tagger would only use the Analyzer as defined by the Field of the language of the parsed text one would risk miss matches for default labels.
  3. Building a multi lingual FST corpora: This would require to merge labels in different labels (stored in different fields using different analyzers) to a single FST corpora. This corpora would need to be aware of the languages Phrases are present so that it can only suggest matches with labels of the language of the text as well as default labels. Same as with the 2nd option one would also need to solve the problem of supporting two Analyzers in the Tagger.

For now I am aiming for the first option, as it requires the least changes to SolrTextTagger, but I would be eager to have an opinion/feedback on the other two options.

best
Rupert

@dsmiley
Copy link
Member

dsmiley commented Jul 4, 2013

I'm leaning to option 1 as well. The tricky part is going to be the TagClusterReducer merging from 2 passes.

Regarding option 2&3 I think the default label could be analyzed with each language (as if this field didn't exist, as if it was put into all the language variants for each document). Even though each word would end up being analyzed many times, there aren't going to be that many variations for each indexed term for a word. Any way, it would be a lot of work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants