Introduction to Cross-language Information Retrieval
– CLIR is a subfield of information retrieval.
– CLIR deals with retrieving information in a different language from the user’s query.
– CLIR has synonyms like cross-lingual information retrieval and multilingual information retrieval.
– CLIR can be used for both retrieval of multilingual collections and translation of material from one language to another.
– CLIR systems use various translation techniques such as dictionary-based, parallel corpora based, comparable corpora based, and machine translator based.
Improvements in CLIR Systems
– CLIR systems have improved significantly and are nearly as effective as monolingual systems.
– CLIR technology benefits users with poor to moderate competence in the target language.
– CLIR services include technologies like morphological analysis, decompounding, and translation mechanisms.
– CLIR systems face challenges with coverage due to variation in human language.
– CLIR is particularly useful when users know the target language only to some extent.
Related Information Access Tasks
– Other information access tasks like media monitoring, information filtering, sentiment analysis, and information extraction require sophisticated models.
– These tasks typically involve more processing and analysis of the information items of interest.
– The processing for these tasks needs to be aware of the specifics of the target languages.
– CLIR technology can be applied to these tasks to improve their effectiveness.
– CLIR systems can handle inflection, compound terms, and translation of queries.
Workshops and Conferences Related to CLIR
– The first workshop on CLIR was held in Zürich during the SIGIR-96 conference.
– Workshops on CLIR have been held yearly since 2000 at the Cross Language Evaluation Forum (CLEF) meetings.
– The Text Retrieval Conference (TREC) serves as a point of reference for the CLIR subfield.
– Early CLIR experiments were conducted at TREC-6 in 1997.
– Researchers discuss their findings regarding different CLIR systems and methods at TREC.
Additional Resources and References
– EXCLAIM (EXtensible Cross-Linguistic Automatic Information Machine) is a related technology.
– CLEF (Conference and Labs of the Evaluation Forum) is a forum for evaluating CLIR systems.
– References include articles on matching meaning for CLIR, introduction to CLIR approaches, and multilingual information access.
– The proceedings of the first CLIR workshop can be found in the book ‘Cross-Language Information Retrieval.’
– External links include a resource page and a search engine for CLIR.
This article needs additional citations for verification. (September 2014) |
Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query. The term "cross-language information retrieval" has many synonyms, of which the following are perhaps the most frequent: cross-lingual information retrieval, translingual information retrieval, multilingual information retrieval. The term "multilingual information retrieval" refers more generally both to technology for retrieval of multilingual collections and to technology which has been moved to handle material in one language to another. The term Multilingual Information Retrieval (MLIR) involves the study of systems that accept queries for information in various languages and return objects (text, and other media) of various languages, translated into the user's language. Cross-language information retrieval refers more specifically to the use case where users formulate their information need in one language and the system retrieves relevant documents in another. To do so, most CLIR systems use various translation techniques. CLIR techniques can be classified into different categories based on different translation resources:
- Dictionary-based CLIR techniques
- Parallel corpora based CLIR techniques
- Comparable corpora based CLIR techniques
- Machine translator based CLIR techniques
CLIR systems have improved so much that the most accurate multi-lingual and cross-lingual adhoc information retrieval systems today are nearly as effective as monolingual systems. Other related information access tasks, such as media monitoring, information filtering and routing, sentiment analysis, and information extraction require more sophisticated models and typically more processing and analysis of the information items of interest. Much of that processing needs to be aware of the specifics of the target languages it is deployed in.
Mostly, the various mechanisms of variation in human language pose coverage challenges for information retrieval systems: texts in a collection may treat a topic of interest but use terms or expressions which do not match the expression of information need given by the user. This can be true even in a mono-lingual case, but this is especially true in cross-lingual information retrieval, where users may know the target language only to some extent. The benefits of CLIR technology for users with poor to moderate competence in the target language has been found to be greater than for those who are fluent. Specific technologies in place for CLIR services include morphological analysis to handle inflection, decompounding or compound splitting to handle compound terms, and translations mechanisms to translate a query from one language to another.
The first workshop on CLIR was held in Zürich during the SIGIR-96 conference. Workshops have been held yearly since 2000 at the meetings of the Cross Language Evaluation Forum (CLEF). Researchers also convene at the annual Text Retrieval Conference (TREC) to discuss their findings regarding different systems and methods of information retrieval, and the conference has served as a point of reference for the CLIR subfield. Early CLIR experiments were conducted at TREC-6, held at the National Institute of Standards and Technology (NIST) on November 19–21, 1997.
Google Search had a cross-language search feature that was removed in 2013.