History and Significance of Information Extraction
– Information extraction dates back to the late 1970s in the early days of NLP.
– JASPER, a commercial system built in the mid-1980s, provided real-time financial news to financial traders.
– Message Understanding Conferences (MUC) from 1987 onwards spurred advancements in IE.
– MUC focused on domains such as naval operations messages, terrorism in Latin American countries, joint ventures and microelectronics, news articles on management changes, and satellite launch reports.
– Support from the U.S. Defense Advanced Research Projects Agency (DARPA) helped automate tasks for government analysts.
– IE is significant due to the growing amount of unstructured information available.
– Tim Berners-Lee advocates for more content to be made available as structured data.
– Unstructured documents on the web lack semantic metadata, hindering machine processing.
– IE can transform unstructured data into a more machine-readable format.
– IE is used to scan and extract information from documents to populate databases.
Tasks and Subtasks of Information Extraction
– Template filling involves extracting a fixed set of fields from a document.
– Event extraction aims to output event templates from an input document.
– Knowledge Base Population involves filling a database with facts extracted from documents.
– Named entity recognition identifies known entity names, place names, temporal expressions, and numerical expressions.
– Coreference resolution detects links between previously extracted named entities.
Relationship Extraction and Table Extraction
– Relationship extraction identifies relations between entities.
– Examples include ‘PERSON works for ORGANIZATION’ and ‘PERSON located in LOCATION.’
– Semi-structured information extraction aims to restore lost information structure.
– Table extraction involves finding and extracting tables from documents.
– Table information extraction is a more complex task that involves understanding the roles and information presented in the tables.
Language and Vocabulary Analysis
– Terminology extraction finds relevant terms for a given domain or topic.
– Language and vocabulary analysis is crucial for understanding and processing text.
– Comments extraction involves extracting comments from articles to link authors to sentences.
– IE can help in text simplification to create a structured view of information.
– IE enables machine-readable text processing by transforming unstructured data.
Approaches and Software/Services
– Hand-written regular expressions or nested groups of regular expressions are widely used.
– Classifiers, such as naïve Bayes and maximum entropy models, are common approaches.
– Sequence models like recurrent neural networks and hidden Markov models are utilized.
– Conditional random fields are commonly used in conjunction with IE for various tasks.
– Hybrid approaches that combine different standard approaches are also employed.
– GATE is bundled with a free Information Extraction system.
– Apache OpenNLP is a Java machine learning toolkit for natural language processing.
– OpenCalais is an automated information extraction web service.
– Mallet is a Java-based package for various natural language processing tasks.
– DBpedia Spotlight is an open source tool for named entity recognition and name resolution.
Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. Typically, this involves processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction.
Recent advances in NLP techniques have allowed for significantly improved performance compared to previous years. An example is the extraction from newswire reports of corporate mergers, such as denoted by the formal relation:
- ,
from an online news sentence such as:
- "Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp."
A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow automated reasoning about the logical form of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context.
Information extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display. The discipline of information retrieval (IR) has developed automatic methods, typically of a statistical flavor, for indexing large document collections and classifying documents. Another complementary approach is that of natural language processing (NLP) which has solved the problem of modelling human language processing with considerable success when taking into account the magnitude of the task. In terms of both difficulty and emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a set of documents in which each document follows a template, i.e. describes one or more entities or events in a manner that is similar to those in other documents but differing in the details. An example, consider a group of newswire articles on Latin American terrorism with each article presumed to be based upon one or more terroristic acts. We also define for any given IE task a template, which is a(or a set of) case frame(s) to hold the information contained in a single document. For the terrorism example, a template would have slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the event happened. An IE system for this problem is required to "understand" an attack article only enough to find data corresponding to the slots in this template.