Glossary Term
Evaluation measures (information retrieval)
Background and Importance of Evaluation Measures in Information Retrieval
- Indexing and classification methods have a long history in information retrieval.
- Evaluation measures for IR systems began in the 1950s with the Cranfield paradigm.
- The Cranfield tests established the use of test collections, queries, and relevant items for evaluation.
- Cleverdon's approach influenced the Text Retrieval Conference series.
- Evaluation measures are crucial for search engines, databases, and library catalogues.
- Evaluation of IR systems is essential for internet search, website search, and library catalogues.
- Evaluation measures are used in information behavior studies and usability testing.
- Academic conferences like TREC, CLEF, and NTCIR focus on evaluation measures.
- IR research relies on test collections and evaluation measures to measure system effectiveness.
- Evaluation measures help assess business costs and efficiency.
- Evaluation measures in information retrieval are used to assess the effectiveness and performance of search algorithms and systems.
- These measures play a crucial role in determining the quality of search results and the overall user experience.
- They provide a standardized way to compare different retrieval systems and algorithms.
- Evaluation measures help researchers and practitioners in making informed decisions regarding the design and improvement of information retrieval systems.
Online and Offline Evaluation Metrics
- Online metrics are derived from search logs.
- Online metrics are used to determine the success of A/B tests.
- Session abandonment rate measures the ratio of search sessions without clicks.
- Click-through rate (CTR) measures the ratio of users who click on a specific link.
- Session success rate measures the ratio of user sessions that lead to a successful result.
- Offline metrics are based on relevance judgment sessions.
- Judges score the quality of search results using binary or multi-level scales.
- Precision measures the fraction of retrieved documents that are relevant.
- Recall measures the fraction of relevant documents successfully retrieved.
- Fall-out measures the proportion of non-relevant documents retrieved.
F-score / F-measure
- The F-score is the weighted harmonic mean of precision and recall.
- It provides a balanced evaluation of precision and recall.
- The F-score is commonly used in information retrieval evaluation.
- It helps assess the overall effectiveness of an IR system.
- The F-score considers both precision and recall in its calculation.
- F-measure is calculated using precision and recall.
- It is also known as the F1 measure.
- The general formula for F-measure is (2 * precision * recall) / (precision + recall).
- Other commonly used F-measures include F2 measure and F0.5 measure.
- F-measure combines information from both precision and recall to represent overall performance.
Precision at k and R-precision
- Recall is no longer a meaningful metric in modern information retrieval.
- Precision at k (P@k) is a useful metric that considers the top k retrieved documents.
- P@k fails to account for the positions of relevant documents among the top k.
- Scoring manually is easier for P@k as only the top k results need to be examined.
- Even a perfect system will have a score less than 1 on queries with fewer relevant results than k.
- R-precision requires knowing all relevant documents for a query.
- The number of relevant documents (R) is used as the cutoff for calculation.
- R-precision is equivalent to precision at the R-th position and recall at the R-th position.
- It is often highly correlated to mean average precision.
- R-precision is calculated as the fraction of relevant documents among the top R retrieved documents.
Mean Average Precision (MAP) and Discounted Cumulative Gain (DCG)
- MAP is the mean of the average precision scores for a set of queries.
- It provides an overall measure of performance.
- MAP is calculated by summing the average precision scores for each query and dividing by the number of queries.
- It is commonly used in information retrieval evaluation.
- MAP takes into account the precision at different recall levels.
- DCG evaluates the usefulness of documents based on their position in the result list.
- It uses a graded relevance scale and penalizes lower-ranked relevant documents.
- DCG is calculated as the sum of relevance values logarithmically proportional to the position.
- Normalized DCG (nDCG) compares performances using an ideal DCG.
- nDCG values can be averaged to measure the average performance of a ranking algorithm.