Information extraction

« Back to Glossary Index

History and Significance of Information Extraction
– Information extraction dates back to the late 1970s in the early days of NLP.
– JASPER, a commercial system built in the mid-1980s, provided real-time financial news to financial traders.
– Message Understanding Conferences (MUC) from 1987 onwards spurred advancements in IE.
– MUC focused on domains such as naval operations messages, terrorism in Latin American countries, joint ventures and microelectronics, news articles on management changes, and satellite launch reports.
– Support from the U.S. Defense Advanced Research Projects Agency (DARPA) helped automate tasks for government analysts.
– IE is significant due to the growing amount of unstructured information available.
Tim Berners-Lee advocates for more content to be made available as structured data.
– Unstructured documents on the web lack semantic metadata, hindering machine processing.
– IE can transform unstructured data into a more machine-readable format.
– IE is used to scan and extract information from documents to populate databases.

Tasks and Subtasks of Information Extraction
– Template filling involves extracting a fixed set of fields from a document.
– Event extraction aims to output event templates from an input document.
– Knowledge Base Population involves filling a database with facts extracted from documents.
– Named entity recognition identifies known entity names, place names, temporal expressions, and numerical expressions.
– Coreference resolution detects links between previously extracted named entities.

Relationship Extraction and Table Extraction
– Relationship extraction identifies relations between entities.
– Examples include ‘PERSON works for ORGANIZATION’ and ‘PERSON located in LOCATION.’
– Semi-structured information extraction aims to restore lost information structure.
– Table extraction involves finding and extracting tables from documents.
– Table information extraction is a more complex task that involves understanding the roles and information presented in the tables.

Language and Vocabulary Analysis
– Terminology extraction finds relevant terms for a given domain or topic.
Language and vocabulary analysis is crucial for understanding and processing text.
– Comments extraction involves extracting comments from articles to link authors to sentences.
– IE can help in text simplification to create a structured view of information.
– IE enables machine-readable text processing by transforming unstructured data.

Approaches and Software/Services
– Hand-written regular expressions or nested groups of regular expressions are widely used.
– Classifiers, such as naïve Bayes and maximum entropy models, are common approaches.
– Sequence models like recurrent neural networks and hidden Markov models are utilized.
– Conditional random fields are commonly used in conjunction with IE for various tasks.
– Hybrid approaches that combine different standard approaches are also employed.
– GATE is bundled with a free Information Extraction system.
– Apache OpenNLP is a Java machine learning toolkit for natural language processing.
– OpenCalais is an automated information extraction web service.
– Mallet is a Java-based package for various natural language processing tasks.
– DBpedia Spotlight is an open source tool for named entity recognition and name resolution.

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. Typically, this involves processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction.

Recent advances in NLP techniques have allowed for significantly improved performance compared to previous years. An example is the extraction from newswire reports of corporate mergers, such as denoted by the formal relation:


from an online news sentence such as:

"Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp."

A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow automated reasoning about the logical form of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context.

Information extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display. The discipline of information retrieval (IR) has developed automatic methods, typically of a statistical flavor, for indexing large document collections and classifying documents. Another complementary approach is that of natural language processing (NLP) which has solved the problem of modelling human language processing with considerable success when taking into account the magnitude of the task. In terms of both difficulty and emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a set of documents in which each document follows a template, i.e. describes one or more entities or events in a manner that is similar to those in other documents but differing in the details. An example, consider a group of newswire articles on Latin American terrorism with each article presumed to be based upon one or more terroristic acts. We also define for any given IE task a template, which is a(or a set of) case frame(s) to hold the information contained in a single document. For the terrorism example, a template would have slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the event happened. An IE system for this problem is required to "understand" an attack article only enough to find data corresponding to the slots in this template.

« Back to Glossary Index

Submit your RFP

We can't wait to read about your project. Use the form below to submit your RFP!

Gabrielle Buff
Gabrielle Buff

Just left us a 5 star review

Great customer service and was able to walk us through the various options available to us in a way that made sense. Would definitely recommend!

Stoute Web Solutions has been a valuable resource for our business. Their attention to detail, expertise, and willingness to help at a moment's notice make them an essential support system for us.

Paul and the team are very professional, courteous, and efficient. They always respond immediately even to my minute concerns. Also, their SEO consultation is superb. These are good people!

Paul Stoute & his team are top notch! You will not find a more honest, hard working group whose focus is the success of your business. If you’re ready to work with the best to create the best for your business, go Stoute Web Solutions; you’ll definitely be glad you did!

Wonderful people that understand our needs and make it happen!

Paul is the absolute best! Always there with solutions in high pressure situations. A steady hand; always there when needed; I would recommend Paul to anyone!

Vince Fogliani

The team over at Stoute web solutions set my business up with a fantastic new website, could not be happier

Steve Sacre

If You are looking for Website design & creativity look no further. Paul & his team are the epitome of excellence.Don't take my word just refer to my website ""that Stoute Web Solutions created.This should convince anyone that You have finally found Your perfect fit

Jamie Hill

Paul and the team at Stoute Web are amazing. They are super fast to answer questions. Super easy to work with, and knows their stuff. 10,000 stars.

Paul and the team from Stoute Web solutions are awesome to work with. They're super intuitive on what best suits your needs and the end product is even better. We will be using them exclusively for our web design and hosting.

Dean Eardley

Beautifully functional websites from professional, knowledgeable team.

Along with hosting most of my url's Paul's business has helped me with website development, graphic design and even a really cool back end database app! I highly recommend him as your 360 solution to making your business more visible in today's social media driven marketplace.

I hate dealing with domain/site hosts. After terrible service for over a decade from Dreamhost, I was desperate to find a new one. I was lucky enough to win...

Paul Stoute has been extremely helpful in helping me choose the best package to suite my needs. Any time I had a technical issue he was there to help me through it. Superb customer service at a great value. I would recommend his services to anyone that wants a hassle free and quality experience for their website needs.

Paul is the BEST! I am a current customer and happy to say he has never let me down. Always responds quickly and if he cant fix the issue right away, if available, he provides you a temporary work around while researching the correct fix! Thanks for being an honest and great company!!

Paul Stoute is absolutely wonderful. Paul always responds to my calls and emails right away. He is truly the backbone of my business. From my fantastic website to popping right up on Google when people search for me and designing my business cards, Paul has been there every step of the way. I would recommend this company to anyone.

I can't say enough great things about Green Tie Hosting. Paul was wonderful in helping me get my website up and running quickly. I have stayed with Green...