-
Notifications
You must be signed in to change notification settings - Fork 30
Indexing non English content
Note - the information in this page only refers to our (The Royal Library of Denmark) experience with this problem. Please feel free to edit or update this page with corrections or additional information.
By default, Hydra will index all text content as Solr dynamic fields of type *_tesim
.
<dynamicField name="*_tesim" type="text_en" stored="true" indexed="true" multiValued="true"/>
This means that all text stored like this will be indexed according to the rules specified in the text_en
field type. This is defined to use stemming rules appropriate for the English language. For example, the text appointment
will also be stored as appoint
and will be retrievable by searches for both values.
Obviously, this is inappropriate if your Hydra head will store content in a language other than English as users will need to specify the exact text string they are searching for in order to retrieve content. To give an example from our case, the search Minister
will not retrieve documents with titles such as Ministeren
(Danish, the minister).
The dynamic field name *_tesim
is generated by Solrizer. The optimal solution would be to pass Solrizer extra arguments when calling it in order to generate a different type of dynamic field which would in turn refer to a different Solr field type. I couldn't find any obvious way to do this, so instead I ended up customising the text_en
field type as follows:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/> <!-- NFKC, case folding, diacritics removed -->
<filter class="solr.SnowballPorterFilterFactory" language="Danish"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
</fieldType>
Here, I have removed the English specific stemming filters and added a filter with a Danish configuration. A huge number of different languages are supported by Solr without any extra configuration needed. See the Language Analysis page in the Solr Wiki and look under your language to see if it is supported.
If your content is already indexed in Solr, you can re-index without needing to re-import. Simply restart Solr with the new configuration, log into a rails console for the appropriate environment and enter:
ActiveFedora::Base.all.each{ |e| e.update_index }
This will run through all objects in your repository and update the index according to the new configuration. It may take a bit of time if you have a lot of content in your repository.
The above solution is problematic in that it modifies the text_en
field type to store non-English content. This is a bit confusing. A better solution would be to define a new field type e.g. text_da
containing the same values which can be referenced from the *_tesim
dynamic field definition e.g.
<dynamicField name="*_tesim" type="text_da" stored="true" indexed="true" multiValued="true"/>
Alternatively, if Solrizer can be called to generate custom fields type, it should be utilised to generate a custom dynamic field such as *_tdsim
which in turn references the text_da
field type. I don't know how to do this, but anyone who does is more than welcome to update this guide with that information.
In writing this documentation, I discovered that Solr's example schema contains example field configurations for a wide range of different languages which are more detailed than the example I have provided above. Try and apply these configurations for your language and see if they work as expected. Please note however that I have not tried these examples myself, so I cannot promise that they will work with the Solr shipped with Jetty.