article

The Probabilistic Relevance Framework: BM25 and Beyond

Foundations and Trends in Information Retrieval Volume 3 Issue 4April 2009pp 333–389https://doi.org/10.1561/1500000019

Published:01 April 2009Publication History

Foundations and Trends in Information Retrieval

Abstract

The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, grounded in work done in the 1970—1980s, which led to the development of one of the most successful text-retrieval algorithms, BM25. In recent years, research in the PRF has yielded new retrieval models capable of taking into account document meta-data (especially structure and link-graph information). Again, this has led to one of the most successful Web-search and corporate-search algorithms, BM25F. This work presents the PRF from a conceptual point of view, describing the probabilistic modelling assumptions behind the framework and the different ranking algorithms that result from its application: the binary independence model, relevance feedback models, BM25 and BM25F. It also discusses the relation between the PRF and other statistical models for IR, and covers some related topics, such as the use of non-textual features, and parameter optimisation for models with free parameters.

References

S. Agarwal, C. Cortes, and R. Herbrich, eds., Proceedings of the NIPS 2005 Workshop on Learning to Rank, 2005.Google Scholar
G. Amati, C. J. van Rijsbergen, and C. Joost, "Probabilistic models of information retrieval based on measuring the divergence from randomness," ACM Transactions on Information Systems, vol. 20, no. 4, pp. 357-389, 2002. Google ScholarDigital Library
M. M. Beaulieu, M. Gatford, X. Huang, S. E. Robertson, S. Walker, and P. Williams, "Okapi at TREC-5," The Fifth Text Retrieval Conference (TREC- 5). NIST Special Publication 500-238, pp. 143-165, 1997.Google Scholar
F. V. Berghen, "Trust Region Algorithms," Webpage, http://www. lemurproject.org.Google Scholar
F. V. Berghen, "CONDOR: A constrained, non-linear, derivative-free parallel optimizer for continuous, high computing load, noisy objective functions," PhD thesis, Université Libre de Bruxelles, 2004.Google Scholar
C. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 2006. Google Scholar
D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003. Google ScholarDigital Library
D. Bodoff and S. E. Robertson, "A new unified probabilistic model," Journal of the American Society for Information Science and Technology, vol. 55, pp. 471-487, 2004. Google ScholarDigital Library
P. Boldi and S. Vigna, "MG4J at TREC 2005," in The Fourteenth Text Retrieval Conference (TREC 2005) Proceedings, NIST Special Publication 500-266, 2005. http://mg4j.dsi.unimi.it/.Google Scholar
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, "Learning to rank using gradient descent," in Proceedings of the International Conference on Machine Learning (ICML), vol. 22, p. 89, 2005. Google Scholar
W. Cooper, "Some inconsistencies and misidentified modelling assumptions in probabilistic information retrieval," ACM Transactions on Information Systems, vol. 13, pp. 110-111, 1995. Google ScholarDigital Library
N. Craswell, S. E. Robertson, H. Zaragoza, and M. Taylor, "Relevance weighting for query independent evidence," in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 472-479, ACM, 2005. Google Scholar
N. Craswell, H. Zaragoza, and S. E. Robertson, "Microsoft Cambridge at TREC-14: Enterprise track," in The Fourteenth Text Retrieval Conference (TREC 2005), 2005.Google Scholar
F. Crestani, M. Lalmas, C. J. van Rijsbergen, and I. Campbell, ""Is this document relevant? ... probably": A survey of probabilistic models in information retrieval," ACM Computing Surveys, vol. 30, no. 4, 1998. Google Scholar
W. B. Croft and D. J. Harper, "Using probabilistic models of document retrieval without relevance information," Journal of Documentation, vol. 35, pp. 285-295, 1979.Google ScholarCross Ref
W. Feller, An Introduction to Probability Theory and Its Applications, vol. 1. Wiley, 1968.Google Scholar
N. Fuhr, "Probabilistic Models in Information Retrieval," The Computer Journal, vol. 35, no. 3, 1992. Google ScholarDigital Library
G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, and K. E. Lochbaum, "Information retrieval using a singular value decomposition model of latent semantic structure," in Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 465-480, ACM, 1988. Google Scholar
S. P. Harter, "A probabilistic approach to automatic keyword indexing (parts 1 and 2)," Journal of the American Society for Information Science, vol. 26, pp. 197-206 and 280-289, 1975.Google ScholarCross Ref
D. Hiemstra, S. E. Robertson, and H. Zaragoza, "Parsimonious language models for information retrieval," in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178-185, ACM, 2004. Google Scholar
T. Hofmann, "Probabilistic latent semantic indexing," in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, ACM, 1999. Google ScholarDigital Library
Indri. Homepage. http://www.lemurproject.org/indri.Google Scholar
T. Joachims, H. Li, T. Y. Liu, and C. Zhai, "Learning to rank for information retrieval (LR4IR 2007)," SIGIR Forum, vol. 41, no. 2, pp. 58-62, 2007. Google ScholarDigital Library
J. Lafferty and C. Zhai, "Document language models, query models, and risk minimization for information retrieval," in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2001. Google Scholar
J. Lafferty and C. Zhai, "Probabilistic relevance models based on document and query generation," in Language Modelling for Information Retrieval, (W. B. Croft and J. Lafferty, eds.), pp. 1-10, Kluwer, 2003.Google Scholar
V. Lavrenko and W. B. Croft, "Relevance based language models," in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 120-127, ACM, 2001. Google Scholar
Lemur Toolkit. Homepage. http://www.lemurproject.org.Google Scholar
H. Li, T. Y. Liu, and C. Zhai, "Learning to rank for information retrieval (LR4IR 2008)," SIGIR Forum, vol. 42, no. 2, pp. 76-79, 2008. Google ScholarDigital Library
Lucene. Homepage. http://lucene.apache.org/.Google Scholar
M. E. Maron and J. L. Kuhns, "On relevance, probabilistic indexing and information retrieval," Journal of the ACM, vol. 7, no. 3, pp. 216-244, 1960. Google ScholarDigital Library
D. Metzler, "Automatic feature selection in the Markov random field model for information retrieval," in Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, pp. 253-262, ACM New York, NY, USA, 2007. Google Scholar
D. Metzler and W. B. Croft, "A Markov random field model for term dependencies," in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 472-479, ACM, 2005. Google Scholar
D. Metzler, T. Strohman, and B. Croft, Information Retrieval in Practice. Pearson Education (US), 2009.Google Scholar
MG4J: Managing gigabytes for java. Homepage. http://mg4j.dsi.unimi.it/.Google Scholar
Okapi-Pack. Homepage. http://www.soi.city.ac.uk/andym/OKAPI-PACK.Google Scholar
J. R. Pérez-Agüera and H. Zaragoza, "UCM-Y!R at CLEF 2008 Robust and WSD tasks," CLEF 2008 Workshop, 2008.Google Scholar
J. R. Pérez-Agüera, H. Zaragoza, and L. Araujo, "Exploiting morphological query structure using genetic optimization," in NLDB 2008 13th International Conference on Applications of Natural Language to Information Systems, Lecture Notes in Computer Science (LNCS), Springer Verlag, 2008. Google Scholar
J. Pérez-Iglesias, "BM25 and BM25F Implementation for Lucene," Webpage, http://nlp.uned.es/~jperezi/Lucene-BM25.Google Scholar
PF-Tijah. Homepage. http://dbappl.cs.utwente.nl/pftijah.Google Scholar
S. E. Robertson, "The probability ranking principle in information retrieval," Journal of Documentation, vol. 33, pp. 294-304, 1977.Google ScholarCross Ref
S. E. Robertson, "On term selection for query expansion," Journal of Documentation, vol. 46, pp. 359-364, 1990. Google ScholarDigital Library
S. E. Robertson, "Threshold setting and performance optimization in adaptive filtering," Information Retrieval, vol. 5, pp. 239-256, 2002. Google ScholarDigital Library
S. E. Robertson, M. E. Maron, and W. S. Cooper, "The unified probabilistic model for IR," in Proceedings of Research and Development in Information Retrieval, (G. Salton and H.-J. Schneider, eds.), pp. 108-117, Berlin: Springer-Verlag, 1983. Google Scholar
S. E. Robertson and K. Sparck Jones, "Relevance weighting of search terms," Journal of the American Society for Information Science, 1977.Google Scholar
S. E. Robertson, C. J. van Rijsbergen, and M. F. Porter, "Probabilistic models of indexing and searching," in Information Retrieval Research (Proceedings of Research and Development in Information Retrieval, Cambridge, 1980), (R. N. Oddy, S. E. Robertson, C. J. van Rijsbergen, and P. W. Williams, eds.), pp. 35- 56, London: Butterworths, 1981. Google Scholar
S. E. Robertson and S. Walker, "Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval," in Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 232-241, ACM/Springer, 1994. Google Scholar
S. E. Robertson and S. Walker, "On relevance weights with little relevance information," in Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 16-24, ACM, 2007. Google Scholar
S. E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau, "Okapi at TREC," in The First Text Retrieval Conference (TREC-1), NIST Special Publication 500-207, pp. 21-30, 1992.Google Scholar
S. E. Robertson and H. Zaragoza, "On rank-based effectiveness measures and optimization," Information Retrieval, vol. 10, no. 3, pp. 321-339, 2007. Google ScholarDigital Library
S. E. Robertson, H. Zaragoza, and M. Taylor, "Simple BM25 extension to multiple weighted fields," in Proceedings of the 2004 ACM CIKM International Conference on Information and Knowledge Management, pp. 42-49, ACM, 2004. Google Scholar
R. Song, M. J. Taylor, J. R. Wen, H. W. Hon, and Y. Yu, "Viewing term proximity from a different perspective," Advances in Information Retrieval (ECIR 2008), Springer LNCS 4956, pp. 346-357, 2008. Google Scholar
K. Sparck Jones, S. Walker, and S. E. Robertson, "A probabilistic model of information retrieval: Development and comparative experiments. Part 1," in Information Processing and Management, pp. 779-808, 2000. Google Scholar
K. Sparck Jones, S. Walker, and S. E. Robertson, "A probabilistic model of information retrieval: Development and comparative experiments. Part 2," in Information Processing and Management, pp. 809-840, 2000. Google Scholar
T. Tao and C. Zhai, "An exploration of proximity measures in information retrieval," in Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 295-302, ACM, 2007. Google Scholar
M. Taylor, H. Zaragoza, N. Craswell, S. E. Robertson, and C. Burges, "Optimisation methods for ranking functions with multiple parameters," in Fifteenth Conference on Information and Knowledge Management (ACM CIKM), 2006. Google Scholar
Terrier. Homepage. http://ir.dcs.gla.ac.uk/terrier.Google Scholar
R. van Os D. Hiemstra, H. Rode, and J. Flokstra, "PF/Tijah: Text search in an XML database system," Proceedings of the 2nd International Workshop on Open Source Information Retrieval (OSIR), pp. 12-17, http://dbappl.cs.utwente.nl/pftijah, 2006.Google Scholar
C. J. van Rijsbergen, Information Retrieval. Butterworth, 1979. Google Scholar
E. M. Voorhees and D. K. Harman, "Overview of the eighth text retrieval conference (TREC-8)," The Eighth Text Retrieval Conference (TREC-8), NIST Special Publication 500-246, pp. 1-24, 2000.Google Scholar
Wumpus. Homepage. http://www.wumpus-search.org/.Google Scholar
Xapian. http://xapian.org.Google Scholar
H. Zaragoza, N. Craswell, M. Taylor, S. Saria, and S. E. Robertson, "Microsoft Cambridge at TREC 2004: Web and HARD track," in The Thirteenth Text Retrieval Conference (TREC 2004), NIST Special Publication, 500-261, 2005.Google Scholar
Zettair. Homepage. http://www.seg.rmit.edu.au/zettair.Google Scholar

Index Terms

The Probabilistic Relevance Framework: BM25 and Beyond
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Index terms have been assigned to the content through auto-classification.

Recommendations

The Probabilistic Relevance Framework
Read More
Probabilistic document-context based relevance feedback with limited relevance judgments
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

This paper presents our novel relevance feedback (RF) algorithm that uses the probabilistic document-context based retrieval model with limited relevance judgments for document re-ranking. Probabilities of the document-context based retrieval model are ...
Read More
A framework of CBIR system based on relevance feedback
IITA'09: Proceedings of the 3rd international conference on Intelligent information technology application

Content-based image retrieval (CBIR) is an effective approach for obtaining desired image, however, due to the semantic gap between low-level visual features and high-level concept of image, CBIR system of state-of-the-art always can't achieve ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Foundations and Trends in Information Retrieval Volume 3, Issue 4
April 2009
59 pages
ISSN:1554-0669
EISSN:1554-0677
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
Now Publishers Inc.
Hanover, MA, United States
Publication History
- Published: 1 April 2009
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 540
  Total Citations
  View Citations
- 0
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

The Probabilistic Relevance Framework: BM25 and Beyond

Foundations and Trends in Information Retrieval

Abstract

References

Cited By

Index Terms

Recommendations

The Probabilistic Relevance Framework

Probabilistic document-context based relevance feedback with limited relevance judgments

A framework of CBIR system based on relevance feedback

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

Digital Edition

Caption

The Probabilistic Relevance Framework: BM25 and Beyond

Foundations and Trends in Information Retrieval

Abstract

References

Cited By

Index Terms

Recommendations

The Probabilistic Relevance Framework

Probabilistic document-context based relevance feedback with limited relevance judgments

A framework of CBIR system based on relevance feedback

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

Digital Edition

Share this Publication link

Share on Social Media