A Knowledge-Based CLIR Model for Specific Domain Collections

Chapter

Publication Date:

2015

abstract:

Cross-language information Retrieval (CLIR) applications aimed toward accessing information on the web in
many languages are attracting several important players within the information Retrieval (IR) field, like Google
and Microsoft. Usually in CLIR applications, information is searched by means of a query expressed within the
user’s first language. This query is automatically translated in the desired foreign language and also the results
are translated back within the user’s first language.
This process relies on two completely different translation stages: query translation and document translation.
The query translation concerns the translation in the desired foreign language of the query expressed in the
user’s first language, whereas the document translation is the back translation in the user’s language of the
relevant documents found by means of the translated query. Translation is usually based on bilingual or
multilingual Machine Readable Dictionaries (MRD), Machine Translation (MT) and parallel corpora.
CLIR success clearly depends on the quality of translation and thus inaccurate or incorrect translations might
cause serious problems in retrieving relevant information. A very frequent source of mistranslations in specific
domain texts is, indeed, represented by multiword units (MWUs), and particularly terminological word
compounds, that designate a large gamut of lexical constructions, composed of two or more words with an
opaque meaning, i.e. the meaning of a unit is not always the results of the sum of the meanings of the single
words that are part of the unit.
MWUs are not always easy to identify since co-occurrence among the lexemes forming the units might vary a
great deal. In domain specific texts compound terms, primarily noun compounds, are very frequent. In all
languages there is indeed a close relationship between terminology and multi-words and, particularly, word
compounds. In fact, word compounds account in some cases for 90% of the terms belonging to a domain
specific language.
Contrary to generic simple words, terminological word compounds are mono-referential, i.e. they are
unambiguous and refer only to one specific concept in one special language, even though they might occur in
more than one domain. Their meaning, similar to all compound words, cannot be directly inferred by a nonexpert
from the various parts of the compounds because it depends on the specific area and the concept it refers
to.
CLIR applications are typically used in domain specific collections, like the Europeana Connect, that is aimed
toward facilitating multilingual access to Europeana.eu, a web portal that acts as an interface to millions of
books, paintings, films, museum objects and archival records that have been digitized throughout Europe,
regardless of the users’ native language. In Europeana Connect, indeed, users can submit queries in their native
language and are able to retrieve documents in different languages and acquire information regarding objects
from several sources across all European countries. The retrieved information is translated back to the user’s
language by means of MT. Typical Europeana item descriptions contain many compound terms, and, as shown
in Monti (2013), translation produced by MT are filled with mistranslations. Processing and translating these
forms of compound words is not a straightforward task since their morpho-syntactic and linguistics behavior is
quite complex and varied according to the various types and their translations are practically unpredictable.

Our contribution focuses on the outline of the knowledge-based resources (dictionary, ontology and rules),
developed by means of Nooj and used in the experimentation of a knowledge-based

Iris type:

2.1 Contributo in volume (Capitolo o Saggio)

Keywords:

Cross-Language Information Retrieval; Machine Translation ; Multiword Expressions; ontology; linguistic resources

List of contributors: