Glossary Term
Information extraction
History and Significance of Information Extraction
- Information extraction dates back to the late 1970s in the early days of NLP.
- JASPER, a commercial system built in the mid-1980s, provided real-time financial news to financial traders.
- Message Understanding Conferences (MUC) from 1987 onwards spurred advancements in IE.
- MUC focused on domains such as naval operations messages, terrorism in Latin American countries, joint ventures and microelectronics, news articles on management changes, and satellite launch reports.
- Support from the U.S. Defense Advanced Research Projects Agency (DARPA) helped automate tasks for government analysts.
- IE is significant due to the growing amount of unstructured information available.
- Tim Berners-Lee advocates for more content to be made available as structured data.
- Unstructured documents on the web lack semantic metadata, hindering machine processing.
- IE can transform unstructured data into a more machine-readable format.
- IE is used to scan and extract information from documents to populate databases.
Tasks and Subtasks of Information Extraction
- Template filling involves extracting a fixed set of fields from a document.
- Event extraction aims to output event templates from an input document.
- Knowledge Base Population involves filling a database with facts extracted from documents.
- Named entity recognition identifies known entity names, place names, temporal expressions, and numerical expressions.
- Coreference resolution detects links between previously extracted named entities.
Relationship Extraction and Table Extraction
- Relationship extraction identifies relations between entities.
- Examples include 'PERSON works for ORGANIZATION' and 'PERSON located in LOCATION.'
- Semi-structured information extraction aims to restore lost information structure.
- Table extraction involves finding and extracting tables from documents.
- Table information extraction is a more complex task that involves understanding the roles and information presented in the tables.
Language and Vocabulary Analysis
- Terminology extraction finds relevant terms for a given domain or topic.
- Language and vocabulary analysis is crucial for understanding and processing text.
- Comments extraction involves extracting comments from articles to link authors to sentences.
- IE can help in text simplification to create a structured view of information.
- IE enables machine-readable text processing by transforming unstructured data.
Approaches and Software/Services
- Hand-written regular expressions or nested groups of regular expressions are widely used.
- Classifiers, such as naïve Bayes and maximum entropy models, are common approaches.
- Sequence models like recurrent neural networks and hidden Markov models are utilized.
- Conditional random fields are commonly used in conjunction with IE for various tasks.
- Hybrid approaches that combine different standard approaches are also employed.
- GATE is bundled with a free Information Extraction system.
- Apache OpenNLP is a Java machine learning toolkit for natural language processing.
- OpenCalais is an automated information extraction web service.
- Mallet is a Java-based package for various natural language processing tasks.
- DBpedia Spotlight is an open source tool for named entity recognition and name resolution.