Skip to main content
Glossary Term

Information extraction

History and Significance of Information Extraction - Information extraction dates back to the late 1970s in the early days of NLP. - JASPER, a commercial system built in the mid-1980s, provided real-time financial news to financial traders. - Message Understanding Conferences (MUC) from 1987 onwards spurred advancements in IE. - MUC focused on domains such as naval operations messages, terrorism in Latin American countries, joint ventures and microelectronics, news articles on management changes, and satellite launch reports. - Support from the U.S. Defense Advanced Research Projects Agency (DARPA) helped automate tasks for government analysts. - IE is significant due to the growing amount of unstructured information available. - Tim Berners-Lee advocates for more content to be made available as structured data. - Unstructured documents on the web lack semantic metadata, hindering machine processing. - IE can transform unstructured data into a more machine-readable format. - IE is used to scan and extract information from documents to populate databases. Tasks and Subtasks of Information Extraction - Template filling involves extracting a fixed set of fields from a document. - Event extraction aims to output event templates from an input document. - Knowledge Base Population involves filling a database with facts extracted from documents. - Named entity recognition identifies known entity names, place names, temporal expressions, and numerical expressions. - Coreference resolution detects links between previously extracted named entities. Relationship Extraction and Table Extraction - Relationship extraction identifies relations between entities. - Examples include 'PERSON works for ORGANIZATION' and 'PERSON located in LOCATION.' - Semi-structured information extraction aims to restore lost information structure. - Table extraction involves finding and extracting tables from documents. - Table information extraction is a more complex task that involves understanding the roles and information presented in the tables. Language and Vocabulary Analysis - Terminology extraction finds relevant terms for a given domain or topic. - Language and vocabulary analysis is crucial for understanding and processing text. - Comments extraction involves extracting comments from articles to link authors to sentences. - IE can help in text simplification to create a structured view of information. - IE enables machine-readable text processing by transforming unstructured data. Approaches and Software/Services - Hand-written regular expressions or nested groups of regular expressions are widely used. - Classifiers, such as naïve Bayes and maximum entropy models, are common approaches. - Sequence models like recurrent neural networks and hidden Markov models are utilized. - Conditional random fields are commonly used in conjunction with IE for various tasks. - Hybrid approaches that combine different standard approaches are also employed. - GATE is bundled with a free Information Extraction system. - Apache OpenNLP is a Java machine learning toolkit for natural language processing. - OpenCalais is an automated information extraction web service. - Mallet is a Java-based package for various natural language processing tasks. - DBpedia Spotlight is an open source tool for named entity recognition and name resolution.