Abstract
The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, grounded in work done in the 1970—1980s, which led to the development of one of the most successful text-retrieval algorithms, BM25. In recent years, research in the PRF has yielded new retrieval models capable of taking into account document meta-data (especially structure and link-graph information). Again, this has led to one of the most successful Web-search and corporate-search algorithms, BM25F. This work presents the PRF from a conceptual point of view, describing the probabilistic modelling assumptions behind the framework and the different ranking algorithms that result from its application: the binary independence model, relevance feedback models, BM25 and BM25F. It also discusses the relation between the PRF and other statistical models for IR, and covers some related topics, such as the use of non-textual features, and parameter optimisation for models with free parameters.
- S. Agarwal, C. Cortes, and R. Herbrich, eds., Proceedings of the NIPS 2005 Workshop on Learning to Rank, 2005.Google Scholar
- G. Amati, C. J. van Rijsbergen, and C. Joost, "Probabilistic models of information retrieval based on measuring the divergence from randomness," ACM Transactions on Information Systems, vol. 20, no. 4, pp. 357-389, 2002. Google ScholarDigital Library
- M. M. Beaulieu, M. Gatford, X. Huang, S. E. Robertson, S. Walker, and P. Williams, "Okapi at TREC-5," The Fifth Text Retrieval Conference (TREC- 5). NIST Special Publication 500-238, pp. 143-165, 1997.Google Scholar
- F. V. Berghen, "Trust Region Algorithms," Webpage, http://www. lemurproject.org.Google Scholar
- F. V. Berghen, "CONDOR: A constrained, non-linear, derivative-free parallel optimizer for continuous, high computing load, noisy objective functions," PhD thesis, Université Libre de Bruxelles, 2004.Google Scholar
- C. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 2006. Google Scholar
- D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003. Google ScholarDigital Library
- D. Bodoff and S. E. Robertson, "A new unified probabilistic model," Journal of the American Society for Information Science and Technology, vol. 55, pp. 471-487, 2004. Google ScholarDigital Library
- P. Boldi and S. Vigna, "MG4J at TREC 2005," in The Fourteenth Text Retrieval Conference (TREC 2005) Proceedings, NIST Special Publication 500-266, 2005. http://mg4j.dsi.unimi.it/.Google Scholar
- C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, "Learning to rank using gradient descent," in Proceedings of the International Conference on Machine Learning (ICML), vol. 22, p. 89, 2005. Google Scholar
- W. Cooper, "Some inconsistencies and misidentified modelling assumptions in probabilistic information retrieval," ACM Transactions on Information Systems, vol. 13, pp. 110-111, 1995. Google ScholarDigital Library
- N. Craswell, S. E. Robertson, H. Zaragoza, and M. Taylor, "Relevance weighting for query independent evidence," in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 472-479, ACM, 2005. Google Scholar
- N. Craswell, H. Zaragoza, and S. E. Robertson, "Microsoft Cambridge at TREC-14: Enterprise track," in The Fourteenth Text Retrieval Conference (TREC 2005), 2005.Google Scholar
- F. Crestani, M. Lalmas, C. J. van Rijsbergen, and I. Campbell, ""Is this document relevant? ... probably": A survey of probabilistic models in information retrieval," ACM Computing Surveys, vol. 30, no. 4, 1998. Google Scholar
- W. B. Croft and D. J. Harper, "Using probabilistic models of document retrieval without relevance information," Journal of Documentation, vol. 35, pp. 285-295, 1979.Google ScholarCross Ref
- W. Feller, An Introduction to Probability Theory and Its Applications, vol. 1. Wiley, 1968.Google Scholar
- N. Fuhr, "Probabilistic Models in Information Retrieval," The Computer Journal, vol. 35, no. 3, 1992. Google ScholarDigital Library
- G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, and K. E. Lochbaum, "Information retrieval using a singular value decomposition model of latent semantic structure," in Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 465-480, ACM, 1988. Google Scholar
- S. P. Harter, "A probabilistic approach to automatic keyword indexing (parts 1 and 2)," Journal of the American Society for Information Science, vol. 26, pp. 197-206 and 280-289, 1975.Google ScholarCross Ref
- D. Hiemstra, S. E. Robertson, and H. Zaragoza, "Parsimonious language models for information retrieval," in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178-185, ACM, 2004. Google Scholar
- T. Hofmann, "Probabilistic latent semantic indexing," in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, ACM, 1999. Google ScholarDigital Library
- Indri. Homepage. http://www.lemurproject.org/indri.Google Scholar
- T. Joachims, H. Li, T. Y. Liu, and C. Zhai, "Learning to rank for information retrieval (LR4IR 2007)," SIGIR Forum, vol. 41, no. 2, pp. 58-62, 2007. Google ScholarDigital Library
- J. Lafferty and C. Zhai, "Document language models, query models, and risk minimization for information retrieval," in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2001. Google Scholar
- J. Lafferty and C. Zhai, "Probabilistic relevance models based on document and query generation," in Language Modelling for Information Retrieval, (W. B. Croft and J. Lafferty, eds.), pp. 1-10, Kluwer, 2003.Google Scholar
- V. Lavrenko and W. B. Croft, "Relevance based language models," in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 120-127, ACM, 2001. Google Scholar
- Lemur Toolkit. Homepage. http://www.lemurproject.org.Google Scholar
- H. Li, T. Y. Liu, and C. Zhai, "Learning to rank for information retrieval (LR4IR 2008)," SIGIR Forum, vol. 42, no. 2, pp. 76-79, 2008. Google ScholarDigital Library
- Lucene. Homepage. http://lucene.apache.org/.Google Scholar
- M. E. Maron and J. L. Kuhns, "On relevance, probabilistic indexing and information retrieval," Journal of the ACM, vol. 7, no. 3, pp. 216-244, 1960. Google ScholarDigital Library
- D. Metzler, "Automatic feature selection in the Markov random field model for information retrieval," in Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, pp. 253-262, ACM New York, NY, USA, 2007. Google Scholar
- D. Metzler and W. B. Croft, "A Markov random field model for term dependencies," in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 472-479, ACM, 2005. Google Scholar
- D. Metzler, T. Strohman, and B. Croft, Information Retrieval in Practice. Pearson Education (US), 2009.Google Scholar
- MG4J: Managing gigabytes for java. Homepage. http://mg4j.dsi.unimi.it/.Google Scholar
- Okapi-Pack. Homepage. http://www.soi.city.ac.uk/andym/OKAPI-PACK.Google Scholar
- J. R. Pérez-Agüera and H. Zaragoza, "UCM-Y!R at CLEF 2008 Robust and WSD tasks," CLEF 2008 Workshop, 2008.Google Scholar
- J. R. Pérez-Agüera, H. Zaragoza, and L. Araujo, "Exploiting morphological query structure using genetic optimization," in NLDB 2008 13th International Conference on Applications of Natural Language to Information Systems, Lecture Notes in Computer Science (LNCS), Springer Verlag, 2008. Google Scholar
- J. Pérez-Iglesias, "BM25 and BM25F Implementation for Lucene," Webpage, http://nlp.uned.es/~jperezi/Lucene-BM25.Google Scholar
- PF-Tijah. Homepage. http://dbappl.cs.utwente.nl/pftijah.Google Scholar
- S. E. Robertson, "The probability ranking principle in information retrieval," Journal of Documentation, vol. 33, pp. 294-304, 1977.Google ScholarCross Ref
- S. E. Robertson, "On term selection for query expansion," Journal of Documentation, vol. 46, pp. 359-364, 1990. Google ScholarDigital Library
- S. E. Robertson, "Threshold setting and performance optimization in adaptive filtering," Information Retrieval, vol. 5, pp. 239-256, 2002. Google ScholarDigital Library
- S. E. Robertson, M. E. Maron, and W. S. Cooper, "The unified probabilistic model for IR," in Proceedings of Research and Development in Information Retrieval, (G. Salton and H.-J. Schneider, eds.), pp. 108-117, Berlin: Springer-Verlag, 1983. Google Scholar
- S. E. Robertson and K. Sparck Jones, "Relevance weighting of search terms," Journal of the American Society for Information Science, 1977.Google Scholar
- S. E. Robertson, C. J. van Rijsbergen, and M. F. Porter, "Probabilistic models of indexing and searching," in Information Retrieval Research (Proceedings of Research and Development in Information Retrieval, Cambridge, 1980), (R. N. Oddy, S. E. Robertson, C. J. van Rijsbergen, and P. W. Williams, eds.), pp. 35- 56, London: Butterworths, 1981. Google Scholar
- S. E. Robertson and S. Walker, "Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval," in Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 232-241, ACM/Springer, 1994. Google Scholar
- S. E. Robertson and S. Walker, "On relevance weights with little relevance information," in Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 16-24, ACM, 2007. Google Scholar
- S. E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau, "Okapi at TREC," in The First Text Retrieval Conference (TREC-1), NIST Special Publication 500-207, pp. 21-30, 1992.Google Scholar
- S. E. Robertson and H. Zaragoza, "On rank-based effectiveness measures and optimization," Information Retrieval, vol. 10, no. 3, pp. 321-339, 2007. Google ScholarDigital Library
- S. E. Robertson, H. Zaragoza, and M. Taylor, "Simple BM25 extension to multiple weighted fields," in Proceedings of the 2004 ACM CIKM International Conference on Information and Knowledge Management, pp. 42-49, ACM, 2004. Google Scholar
- R. Song, M. J. Taylor, J. R. Wen, H. W. Hon, and Y. Yu, "Viewing term proximity from a different perspective," Advances in Information Retrieval (ECIR 2008), Springer LNCS 4956, pp. 346-357, 2008. Google Scholar
- K. Sparck Jones, S. Walker, and S. E. Robertson, "A probabilistic model of information retrieval: Development and comparative experiments. Part 1," in Information Processing and Management, pp. 779-808, 2000. Google Scholar
- K. Sparck Jones, S. Walker, and S. E. Robertson, "A probabilistic model of information retrieval: Development and comparative experiments. Part 2," in Information Processing and Management, pp. 809-840, 2000. Google Scholar
- T. Tao and C. Zhai, "An exploration of proximity measures in information retrieval," in Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 295-302, ACM, 2007. Google Scholar
- M. Taylor, H. Zaragoza, N. Craswell, S. E. Robertson, and C. Burges, "Optimisation methods for ranking functions with multiple parameters," in Fifteenth Conference on Information and Knowledge Management (ACM CIKM), 2006. Google Scholar
- Terrier. Homepage. http://ir.dcs.gla.ac.uk/terrier.Google Scholar
- R. van Os D. Hiemstra, H. Rode, and J. Flokstra, "PF/Tijah: Text search in an XML database system," Proceedings of the 2nd International Workshop on Open Source Information Retrieval (OSIR), pp. 12-17, http://dbappl.cs.utwente.nl/pftijah, 2006.Google Scholar
- C. J. van Rijsbergen, Information Retrieval. Butterworth, 1979. Google Scholar
- E. M. Voorhees and D. K. Harman, "Overview of the eighth text retrieval conference (TREC-8)," The Eighth Text Retrieval Conference (TREC-8), NIST Special Publication 500-246, pp. 1-24, 2000.Google Scholar
- Wumpus. Homepage. http://www.wumpus-search.org/.Google Scholar
- Xapian. http://xapian.org.Google Scholar
- H. Zaragoza, N. Craswell, M. Taylor, S. Saria, and S. E. Robertson, "Microsoft Cambridge at TREC 2004: Web and HARD track," in The Thirteenth Text Retrieval Conference (TREC 2004), NIST Special Publication, 500-261, 2005.Google Scholar
- Zettair. Homepage. http://www.seg.rmit.edu.au/zettair.Google Scholar
Index Terms
- The Probabilistic Relevance Framework: BM25 and Beyond
Recommendations
Probabilistic document-context based relevance feedback with limited relevance judgments
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge managementThis paper presents our novel relevance feedback (RF) algorithm that uses the probabilistic document-context based retrieval model with limited relevance judgments for document re-ranking. Probabilities of the document-context based retrieval model are ...
A framework of CBIR system based on relevance feedback
IITA'09: Proceedings of the 3rd international conference on Intelligent information technology applicationContent-based image retrieval (CBIR) is an effective approach for obtaining desired image, however, due to the semantic gap between low-level visual features and high-level concept of image, CBIR system of state-of-the-art always can't achieve ...
Comments