skip to main content
article

The Probabilistic Relevance Framework: BM25 and Beyond

Published:01 April 2009Publication History
Skip Abstract Section

Abstract

The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, grounded in work done in the 1970—1980s, which led to the development of one of the most successful text-retrieval algorithms, BM25. In recent years, research in the PRF has yielded new retrieval models capable of taking into account document meta-data (especially structure and link-graph information). Again, this has led to one of the most successful Web-search and corporate-search algorithms, BM25F. This work presents the PRF from a conceptual point of view, describing the probabilistic modelling assumptions behind the framework and the different ranking algorithms that result from its application: the binary independence model, relevance feedback models, BM25 and BM25F. It also discusses the relation between the PRF and other statistical models for IR, and covers some related topics, such as the use of non-textual features, and parameter optimisation for models with free parameters.

References

  1. S. Agarwal, C. Cortes, and R. Herbrich, eds., Proceedings of the NIPS 2005 Workshop on Learning to Rank, 2005.Google ScholarGoogle Scholar
  2. G. Amati, C. J. van Rijsbergen, and C. Joost, "Probabilistic models of information retrieval based on measuring the divergence from randomness," ACM Transactions on Information Systems, vol. 20, no. 4, pp. 357-389, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. M. Beaulieu, M. Gatford, X. Huang, S. E. Robertson, S. Walker, and P. Williams, "Okapi at TREC-5," The Fifth Text Retrieval Conference (TREC- 5). NIST Special Publication 500-238, pp. 143-165, 1997.Google ScholarGoogle Scholar
  4. F. V. Berghen, "Trust Region Algorithms," Webpage, http://www. lemurproject.org.Google ScholarGoogle Scholar
  5. F. V. Berghen, "CONDOR: A constrained, non-linear, derivative-free parallel optimizer for continuous, high computing load, noisy objective functions," PhD thesis, Université Libre de Bruxelles, 2004.Google ScholarGoogle Scholar
  6. C. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 2006. Google ScholarGoogle Scholar
  7. D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Bodoff and S. E. Robertson, "A new unified probabilistic model," Journal of the American Society for Information Science and Technology, vol. 55, pp. 471-487, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Boldi and S. Vigna, "MG4J at TREC 2005," in The Fourteenth Text Retrieval Conference (TREC 2005) Proceedings, NIST Special Publication 500-266, 2005. http://mg4j.dsi.unimi.it/.Google ScholarGoogle Scholar
  10. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, "Learning to rank using gradient descent," in Proceedings of the International Conference on Machine Learning (ICML), vol. 22, p. 89, 2005. Google ScholarGoogle Scholar
  11. W. Cooper, "Some inconsistencies and misidentified modelling assumptions in probabilistic information retrieval," ACM Transactions on Information Systems, vol. 13, pp. 110-111, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Craswell, S. E. Robertson, H. Zaragoza, and M. Taylor, "Relevance weighting for query independent evidence," in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 472-479, ACM, 2005. Google ScholarGoogle Scholar
  13. N. Craswell, H. Zaragoza, and S. E. Robertson, "Microsoft Cambridge at TREC-14: Enterprise track," in The Fourteenth Text Retrieval Conference (TREC 2005), 2005.Google ScholarGoogle Scholar
  14. F. Crestani, M. Lalmas, C. J. van Rijsbergen, and I. Campbell, ""Is this document relevant? ... probably": A survey of probabilistic models in information retrieval," ACM Computing Surveys, vol. 30, no. 4, 1998. Google ScholarGoogle Scholar
  15. W. B. Croft and D. J. Harper, "Using probabilistic models of document retrieval without relevance information," Journal of Documentation, vol. 35, pp. 285-295, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  16. W. Feller, An Introduction to Probability Theory and Its Applications, vol. 1. Wiley, 1968.Google ScholarGoogle Scholar
  17. N. Fuhr, "Probabilistic Models in Information Retrieval," The Computer Journal, vol. 35, no. 3, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, and K. E. Lochbaum, "Information retrieval using a singular value decomposition model of latent semantic structure," in Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 465-480, ACM, 1988. Google ScholarGoogle Scholar
  19. S. P. Harter, "A probabilistic approach to automatic keyword indexing (parts 1 and 2)," Journal of the American Society for Information Science, vol. 26, pp. 197-206 and 280-289, 1975.Google ScholarGoogle ScholarCross RefCross Ref
  20. D. Hiemstra, S. E. Robertson, and H. Zaragoza, "Parsimonious language models for information retrieval," in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178-185, ACM, 2004. Google ScholarGoogle Scholar
  21. T. Hofmann, "Probabilistic latent semantic indexing," in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, ACM, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Indri. Homepage. http://www.lemurproject.org/indri.Google ScholarGoogle Scholar
  23. T. Joachims, H. Li, T. Y. Liu, and C. Zhai, "Learning to rank for information retrieval (LR4IR 2007)," SIGIR Forum, vol. 41, no. 2, pp. 58-62, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Lafferty and C. Zhai, "Document language models, query models, and risk minimization for information retrieval," in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2001. Google ScholarGoogle Scholar
  25. J. Lafferty and C. Zhai, "Probabilistic relevance models based on document and query generation," in Language Modelling for Information Retrieval, (W. B. Croft and J. Lafferty, eds.), pp. 1-10, Kluwer, 2003.Google ScholarGoogle Scholar
  26. V. Lavrenko and W. B. Croft, "Relevance based language models," in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 120-127, ACM, 2001. Google ScholarGoogle Scholar
  27. Lemur Toolkit. Homepage. http://www.lemurproject.org.Google ScholarGoogle Scholar
  28. H. Li, T. Y. Liu, and C. Zhai, "Learning to rank for information retrieval (LR4IR 2008)," SIGIR Forum, vol. 42, no. 2, pp. 76-79, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lucene. Homepage. http://lucene.apache.org/.Google ScholarGoogle Scholar
  30. M. E. Maron and J. L. Kuhns, "On relevance, probabilistic indexing and information retrieval," Journal of the ACM, vol. 7, no. 3, pp. 216-244, 1960. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Metzler, "Automatic feature selection in the Markov random field model for information retrieval," in Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, pp. 253-262, ACM New York, NY, USA, 2007. Google ScholarGoogle Scholar
  32. D. Metzler and W. B. Croft, "A Markov random field model for term dependencies," in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 472-479, ACM, 2005. Google ScholarGoogle Scholar
  33. D. Metzler, T. Strohman, and B. Croft, Information Retrieval in Practice. Pearson Education (US), 2009.Google ScholarGoogle Scholar
  34. MG4J: Managing gigabytes for java. Homepage. http://mg4j.dsi.unimi.it/.Google ScholarGoogle Scholar
  35. Okapi-Pack. Homepage. http://www.soi.city.ac.uk/andym/OKAPI-PACK.Google ScholarGoogle Scholar
  36. J. R. Pérez-Agüera and H. Zaragoza, "UCM-Y!R at CLEF 2008 Robust and WSD tasks," CLEF 2008 Workshop, 2008.Google ScholarGoogle Scholar
  37. J. R. Pérez-Agüera, H. Zaragoza, and L. Araujo, "Exploiting morphological query structure using genetic optimization," in NLDB 2008 13th International Conference on Applications of Natural Language to Information Systems, Lecture Notes in Computer Science (LNCS), Springer Verlag, 2008. Google ScholarGoogle Scholar
  38. J. Pérez-Iglesias, "BM25 and BM25F Implementation for Lucene," Webpage, http://nlp.uned.es/~jperezi/Lucene-BM25.Google ScholarGoogle Scholar
  39. PF-Tijah. Homepage. http://dbappl.cs.utwente.nl/pftijah.Google ScholarGoogle Scholar
  40. S. E. Robertson, "The probability ranking principle in information retrieval," Journal of Documentation, vol. 33, pp. 294-304, 1977.Google ScholarGoogle ScholarCross RefCross Ref
  41. S. E. Robertson, "On term selection for query expansion," Journal of Documentation, vol. 46, pp. 359-364, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. E. Robertson, "Threshold setting and performance optimization in adaptive filtering," Information Retrieval, vol. 5, pp. 239-256, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. E. Robertson, M. E. Maron, and W. S. Cooper, "The unified probabilistic model for IR," in Proceedings of Research and Development in Information Retrieval, (G. Salton and H.-J. Schneider, eds.), pp. 108-117, Berlin: Springer-Verlag, 1983. Google ScholarGoogle Scholar
  44. S. E. Robertson and K. Sparck Jones, "Relevance weighting of search terms," Journal of the American Society for Information Science, 1977.Google ScholarGoogle Scholar
  45. S. E. Robertson, C. J. van Rijsbergen, and M. F. Porter, "Probabilistic models of indexing and searching," in Information Retrieval Research (Proceedings of Research and Development in Information Retrieval, Cambridge, 1980), (R. N. Oddy, S. E. Robertson, C. J. van Rijsbergen, and P. W. Williams, eds.), pp. 35- 56, London: Butterworths, 1981. Google ScholarGoogle Scholar
  46. S. E. Robertson and S. Walker, "Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval," in Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 232-241, ACM/Springer, 1994. Google ScholarGoogle Scholar
  47. S. E. Robertson and S. Walker, "On relevance weights with little relevance information," in Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 16-24, ACM, 2007. Google ScholarGoogle Scholar
  48. S. E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau, "Okapi at TREC," in The First Text Retrieval Conference (TREC-1), NIST Special Publication 500-207, pp. 21-30, 1992.Google ScholarGoogle Scholar
  49. S. E. Robertson and H. Zaragoza, "On rank-based effectiveness measures and optimization," Information Retrieval, vol. 10, no. 3, pp. 321-339, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. S. E. Robertson, H. Zaragoza, and M. Taylor, "Simple BM25 extension to multiple weighted fields," in Proceedings of the 2004 ACM CIKM International Conference on Information and Knowledge Management, pp. 42-49, ACM, 2004. Google ScholarGoogle Scholar
  51. R. Song, M. J. Taylor, J. R. Wen, H. W. Hon, and Y. Yu, "Viewing term proximity from a different perspective," Advances in Information Retrieval (ECIR 2008), Springer LNCS 4956, pp. 346-357, 2008. Google ScholarGoogle Scholar
  52. K. Sparck Jones, S. Walker, and S. E. Robertson, "A probabilistic model of information retrieval: Development and comparative experiments. Part 1," in Information Processing and Management, pp. 779-808, 2000. Google ScholarGoogle Scholar
  53. K. Sparck Jones, S. Walker, and S. E. Robertson, "A probabilistic model of information retrieval: Development and comparative experiments. Part 2," in Information Processing and Management, pp. 809-840, 2000. Google ScholarGoogle Scholar
  54. T. Tao and C. Zhai, "An exploration of proximity measures in information retrieval," in Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 295-302, ACM, 2007. Google ScholarGoogle Scholar
  55. M. Taylor, H. Zaragoza, N. Craswell, S. E. Robertson, and C. Burges, "Optimisation methods for ranking functions with multiple parameters," in Fifteenth Conference on Information and Knowledge Management (ACM CIKM), 2006. Google ScholarGoogle Scholar
  56. Terrier. Homepage. http://ir.dcs.gla.ac.uk/terrier.Google ScholarGoogle Scholar
  57. R. van Os D. Hiemstra, H. Rode, and J. Flokstra, "PF/Tijah: Text search in an XML database system," Proceedings of the 2nd International Workshop on Open Source Information Retrieval (OSIR), pp. 12-17, http://dbappl.cs.utwente.nl/pftijah, 2006.Google ScholarGoogle Scholar
  58. C. J. van Rijsbergen, Information Retrieval. Butterworth, 1979. Google ScholarGoogle Scholar
  59. E. M. Voorhees and D. K. Harman, "Overview of the eighth text retrieval conference (TREC-8)," The Eighth Text Retrieval Conference (TREC-8), NIST Special Publication 500-246, pp. 1-24, 2000.Google ScholarGoogle Scholar
  60. Wumpus. Homepage. http://www.wumpus-search.org/.Google ScholarGoogle Scholar
  61. Xapian. http://xapian.org.Google ScholarGoogle Scholar
  62. H. Zaragoza, N. Craswell, M. Taylor, S. Saria, and S. E. Robertson, "Microsoft Cambridge at TREC 2004: Web and HARD track," in The Thirteenth Text Retrieval Conference (TREC 2004), NIST Special Publication, 500-261, 2005.Google ScholarGoogle Scholar
  63. Zettair. Homepage. http://www.seg.rmit.edu.au/zettair.Google ScholarGoogle Scholar

Index Terms

  1. The Probabilistic Relevance Framework: BM25 and Beyond
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Foundations and Trends in Information Retrieval
      Foundations and Trends in Information Retrieval  Volume 3, Issue 4
      April 2009
      59 pages
      ISSN:1554-0669
      EISSN:1554-0677
      Issue’s Table of Contents

      Publisher

      Now Publishers Inc.

      Hanover, MA, United States

      Publication History

      • Published: 1 April 2009

      Qualifiers

      • article