eolas-nua.html

<p>
<i>
This is the old “About” page for the now-defunct aimsigh.com search engine that 
I created in 2005. Even though hardly anyone used the site, it had some
nice features, and I'm posting this now since contains a pretty good summary
of the main issues arising in Irish language Information Retrieval.<br>
Kevin Scannell<br>
October 2015
</i>
</p>
<p>
...[this page] is directed primarily at my colleagues working
on natural language processing and minority languages, particularly
at those who might be interested in creating similar
“linguistically sophisticated” search engines.
Because of this, I won't assume any prior knowledge of 
Irish linguistics in what follows.
</p>
<p>
There were two primary motivations for creating this site;
the first was that existing search engines like
<a target="_blank" href="https://www.google.com/">Google</a> and
<a target="_blank" href="https://www.yahoo.com/">Yahoo!</a>,
as powerful as they can be for English search, are unsuitable for
Irish in various ways that
will be discussed in detail below.
The second was a desire on my part to create a single tool 
that harnesses most of my 
previous work on language technology for Irish. 
These earlier projects
include the first Irish spellchecker
(<a target="_blank" href="/gaelspell/index-en.html">GaelSpell</a>),
part-of-speech tagger, morphological analyzer, grammar checker
(<a target="_blank" href="/gramadoir/index-en.html"><i>An Gramadóir</i></a>),
electronic thesaurus, and web-crawled corpora
(both <a target="_blank" href="http://crubadan.org/">monolingual</a> and
bilingual).
</p>
<p>
Below I've listed some of the features of <i>aimsigh.com</i>
that are not available with standard general-purpose search
engines (and because of the marginalized position of Irish
are unlikely ever to be so &mdash; more on this point below).
</p>
<p>
<b>1. Spelling standardization</b><br>
Irish underwent a major spelling reform in the 1940's
and 1950's, introducing the so-called <i>Caighdeán Oifigiúil</i> (Official Standard):<br><br>
<i>An t-árd-cheannas ar na Fórsaíbh Cosanta is le dligheadh a riaghlóchar an modh ar a n-oibreochar é.</i><br>
<i>An t-ardcheannas ar na Fórsaí Cosanta is le dlí a rialófar an modh ar a n-oibreofar é.</i><br><br>
Most writing in Irish today conforms to the standard,
though certainly not 100%: many writers
still prefer to use spellings and grammatical
constructs that reflect more accurately their
own dialect of Irish.  In addition there are quite
a few historical and legal documents on the web that use
pre-standard orthography, most of them produced
and published by
the Irish government (one major source is the site
<a target="_blank" href="http://www.achtanna.ie/">www.achtanna.ie</a>,
which contains the full text of all Acts enacted by the Irish
Parliament since 1922).
So independent of the side one might take in the debate over the
merits of the <i>Caighdeán</i>, these pre-standard and dialect
documents are “out there”, and we are faced with the
inescapable engineering challenge of making them easily available
through a search interface.
</p>
<p>
To achieve this, the <i>aimsigh.com</i> engine employs a sophisticated 
“Irish standardizer” which amounts to a finite state transducer
encoding the morphological rules of non-standard Irish together
with mappings to standardized forms.  These rules are augmented
with a large database of non-standard/standard word pairs that was extracted in
part from a parallel corpus of
English and Irish texts (read more about this here:
<a target="_blank" href="https://kevinscannell.com/files/ccgb.pdf">Applications of parallel corpora to the development of monolingual language technologies</a>).
The end result is that if a user selects the box
<i>Litriú neamhchaighdeánach</i> (Non-standard spelling)
on the main <i>aimsigh.com</i> page, and enters in word like
<i>Gaeilge</i> (Irish language), any documents containing either
<i>Gaeilge</i> or one
of the non-standard spellings <i>Gaolainn, Gaedhilge, Gaedhilg, Gaedhealg, Gaedhealaing, Gaeilic, Gaoidhealg, Gaodhalainn, Gaelainn, Gaeluinn, ...</i>, etc.
will be retrieved...
</p>
<p>
A nice side-effect of this feature is that the standardization process
also corrects common spelling errors, so if you can't remember
how to spell <i>ionannas</i> and you search for <i>ionnanas</i>, you will
still retrieve all documents containing the correct spelling.
Conversely a search for <i>Údarás</i> (“Authority”, spelled correctly)
will turn up documents containing
misspellings <i>Udarás</i> or <i>Údaras</i> which is probably the
desired behavior, since such misspellings are remarkably common,
even in presumably-edited texts.
</p>
<p>
<b>2. Initial mutations</b><br>
The beginning of a word in Irish can be written in 
different ways depending on the grammatical context.
For example, <i>bean</i> (woman) becomes <i>an bhean</i> after
the definite article <i>an</i> and <i>ár mbean</i> after the 
possessive pronoun <i>ár</i> (our).    Several other possibilities
occur when a word begins with a vowel: <i>athair</i> (father)
can become <i>t-athair, n-athair, d'athair</i>, etc.
In most cases, the presence or absence of one of these 
mutations has no real effect on the semantics of the word
in question, somewhat like the presence or absence of an
initial capital on a (non-proper) noun in English.
In other words, someone searching for information on lexicography
(<i>foclóireacht</i>) would surely be just as happy
to retrieve documents containing <i>fhoclóireacht</i> or <i>bhfoclóireacht</i>.
</p>
<p>
This behavior can be achieved by selecting
the button <i>Focail chlaochlaithe</i> (Mutated words) on
the main <i>aimsigh.com</i> page (I generally select this button
for all of my own searches).  For example, to find documents concerning
the country of Sudan, one might search for the term <i>Súdáin</i>,
but since this word generally follows the definite article in Irish,
and is therefore prefixed with a “t”, it is much more effective
to search with <i>aimsigh.com</i>...
</p>
<p>
<b>3. Inflectional morphology</b><br>
Irish morphology is much more complicated than English
morphology, and because of this, it is desirable to perform
“stemmed” searches in many instances.  For example, 
if one is interested
in Irish language schools it is convenient to be able
to search for the single term <i>gaelscoil</i> and retrieve all documents
containing <i>gaelscoil, gaelscoile</i> (genitive), 
or <i>gaelscoileanna</i> (plural) as well as the mutated forms
of these words (<i>ghaelscoil, ngaelscoil</i>, etc., nine words in all).
Verbal morphology is even more complicated,
with a single root word typically producing more than 50
inflected/mutated forms.
</p>
<p>
Stemmed searches can be performed by selecting
the button <i>Focail chlaochlaithe infhillte</i> (Mutated and inflected words).
For example, if you are interested in monetary policy, you would
naturally try to search for <i>airgeadaíocht</i>; while Google returns 
only a couple of results for this query, we get several hundred with the
<i>aimsigh.com</i> stemming feature activated...<br><br>
</p>
<p>
Stemmed searching has a mixed reputation in information retrieval circles,
though this might be due in part to the fact that most research has
been done on English or other languages with similarly limited 
morphological complexity.
The other issue is that a lot of online stemming is done with
“resource-light” approaches like the Porter algorithm;
<i>aimsigh.com</i> instead uses a full lexicon and morphological analysis to
guarantee correct stemming.
It is worth noting also that Irish morphology is not nearly as complicated
as languages like Basque, Swahili, or Hiligaynon, 
where a certain amount of stemming
would seem to be absolutely essential.
</p>
<p>
<b>4. Only Irish language documents</b><br>
Irish language documents on the web are drowned in a veritable 
sea of English, Spanish, German, etc., and 
simple searches with Google or Yahoo! are often fruitless 
because of this.
One issue of course is that some Irish words
accidentally coincide with English words: think of <i>bean</i> (woman), 
<i>punt</i> (pound), <i>file</i> (poet), or <i>tine</i> (fire).
And English is not the only problem;
if you search for a very Irish-looking word like <i>luach</i> (value)
with a standard search engine, you'll turn up very few
Irish documents because this word
also means “calendar” in Hebrew.
In addition, there are many lexical conflicts with
Scottish Gaelic, which has a somewhat smaller
but not inconsequential web presence; a search
for a word like <i>ceist</i> will yield documents
split roughly half-and-half between Scottish and
Irish Gaelic.
So perhaps, in frustration, you swear off ever using
any words that happen to coincide with anything from
any other written language, and decide to search for
<i>“Bunreacht na hÉireann”</i> (Constitution of Ireland). 
As it turns out, the first seven hits on Google point
to English documents!
A related (but obviously less important) issue is the irritation
of having to click through
“Choose your language” splash pages on Irish governmental web sites
(when indeed a choice is offered).   You are taken to such a page
for example when you
search for <i>Foras na Gaeilge</i> with Google;
the top hit for <i>aimsigh.com</i> is instead
the Irish language home page....
</p>
<p>
This feature is, of course, not particularly remarkable 
in a technological sense; most search 
engines offer the ability to restrict results
to particular languages.  The problem is that they usually
only offer a selection of the most prominent 30-35 languages
on the web.  Now without too much work (and granted a certain amount
of volunteer help from native speakers) I was able
to train statistical language recognizers for the “next” 150
or so languages and run web crawlers
for each of them (see <a target="_blank" href="http://crubadan.org/">Corpus building for minority languages</a> for more information).  So I suspect that the
restriction to 30+ languages on Google's site must be a user interface
decision on their part, so that 
Swedish speakers won't have to scan through a list of 
200 (or 500 or 1000) languages to find Swedish in a pull-down menu.
</p>
<p>
<b>5. Non-standard representations of <i>síntí fada</i></b><br>
For the non-Irish-speaking readers,
<i>síntí fada</i> are the acute accents that appear on many
vowels in Irish.   Back in the day before 8-bit email was
widely available, messages sent to popular email discussion
lists like
<a target="_blank" href="https://listserv.heanet.ie/gaelic-l.html">GAELIC-L</a>
were written with the infamous <i>slaiseanna</i>
to indicate the accents: <tt>u/rsce/alai/</tt> = <i>úrscéalaí</i> (novelist)
or <tt>te/acschomhad</tt> = <i>téacschomhad</i> (text file).
So, as a consequence, the archives of such mailing lists
(which form the single largest source
of Irish language material on the web as of this writing),
are essentially invisible to standard search engines (which
would all index the above words as separate units
<i>u</i> + <i>rsce</i> + <i>alai</i>
or
<i>te</i> + <i>acschomhad</i>).
In contrast, the <i>aimsigh.com</i> engine automatically detects pages
that use unusual conventions and converts them to a standard format for
indexing....
</p>
<p>
<b>6. Decapitalization according to Irish conventions</b><br>
When certain initial mutations (“t” and “n” before vowels) are
added to uppercase words, the mutating letter is written in 
lowercase and without a hyphen: <i>Acht</i> &gt; <i>tAcht</i> (a legislative act),
or <i>Ocht</i> &gt; <i>nOcht</i> (eight).  
On the other hand, the lowercase versions
of the same words are written <em>with</em> hyphens: <i>t-acht</i>, <i>n-ocht</i>.
The really bad news is that in these two cases, naïve conversion to lowercase
produces completely different Irish words (<i>tacht</i> is a verb meaning
“choke” and <i>nocht</i> is either a verb or adjective meaning “bare”). 
So if (for whatever reason)
you enter <i>tacht</i> into Google, the Irish language results that are
retrieved are all incorrect, referring to <i>tAcht</i>...
</p>
<p>
<b>7. Automatic translation and augmentation of document titles</b><br>
One of my main interests is in machine translation, and I have
a rudimentary system in place for translating
English text to Irish. 
This is used when documents are harvested from the web
to translate boilerplate English titles into Irish; 
for example “GAELIC-L Archives — June 2000” would appear in 
<i>aimsigh.com</i>
search results as “Cartlann GAELIC-L — Meitheamh 2000”.
Or bilingual titles like “TG4 — Irish language television channel — Teilifis Gaeilge” [sic] are shortened and corrected to 
“TG4 — Teilifís Ghaeilge”.   This makes for more effective
searching (since we assume primarily Irish language search terms
will be used) and also a more pleasing Irish-only visual experience
when scanning results.
</p>
<p>
In addition, we augment useless titles with additional 
information to help clarify the contents of the document.
Here “useless” means, in the worst case, no title at all, as is the case 
for articles in the newspaper <i>Lá</i>,
but also refers to situations in which the title, even if translated, fails
to provide useful clues as to the contents of the document, or fails
to distinguish the document from others on the site; good examples
are pages from the Irish Times <i>Teanga Bheo</i> site, 
the majority of which are titled simply
“An Teanga Bheo — The Irish Times weekly Irish language site”.
For the <i>Lá</i> articles, we do our best to extract the
title from the body of the HTML document and construct a title from that.
For the <i>Teanga Bheo</i> articles, it is possible to extract the date of
publication from the URL; this is translated to Irish and appended
to the title.  Similar tricks are used for a number of other sites...
</p>
<p>
<b>8. Manually-curated document database</b><br>
Above, we bemoaned the fact that Irish language documents form
only the tiniest fraction of all those available on the web.
On the other hand, once the Irish documents have been separated
from the rest, the small size of the resulting database
(hundreds of thousands of documents vs. billions) is a great
advantage. On the one hand, there seem to be quite a few
Irish documents (thousands) out there that are not indexed at all by Google
(some examples linked below); I suspect that we had better luck in
turning these up by focusing our crawling on a relatively small number
of especially productive (mostly .ie) domains...
</p>
<p>
Conversely, “spam sites” are becoming more
of a problem for the large search engines, and quite a few searches
for Irish language documents will turn up useless sites that
simply repeat verbatim excerpts from various other pages.
For example, if you search for the title “Irish Times weekly Irish language”
on Google, there are 44 unique results returned of which only 5 appear
to be legitimate.  Because of the smaller scale of <i>aimsigh.com</i>
it is an easy matter to detect such sites semi-automatically
and then remove them from our indices...
</p>
<p>
<b>9. Irish-centric page ranking</b><br>
Even when there are documents available that contain 
legitimate Irish text, it sometimes happens that the page ranking algorithms
are skewed in favor of sites that are heavily linked
from non-Irish web pages.  A good example is the Open Directory
Project <a target="_blank" href="https://dmoz-odp.org/">dmoz.org</a> which often 
gets high page rankings because of its wide popularity (especially
among English and German speakers).   Unfortunately almost
no Irish language sites are contained in the directory and so
results from dmoz.org or one of its many mirrors are of little
use to an Irish speaker.   This can be illustrated by searching for
something like <i>Eolaíocht</i> with Google; about half of the returned
results are from Open Directory mirror sites...
</p>
<p>
Another example is the “Foras na Gaeilge” query discussed and linked above; 
the default English page is listed as the second search result,
while the default Irish page is nowhere to be found.
When ranking pages, <i>aimsigh.com</i> only considers links
emanating from other Irish language pages; while this system is
still not perfect it seems to improve matters greatly.
</p>