Collection Construction Methodologies for Learning-to-Rank

Award:		NSF IIS-1017903
PI:		Javed A. Aslam
Institution:		Northeastern University

Summary

Modern search engines, especially those designed for the World Wide Web, commonly analyze and combine hundreds of features extracted from the submitted query and underlying documents (e.g., web pages) in order to assess the relative relevance of a document to a given query and thus rank the underlying collection. The sheer size of this problem has led to the development of learning-to-rank algorithms that can automate the construction of such ranking functions: Given a training set of (feature vector, relevance) pairs, a machine learning procedure learns how to combine the query and document features in such a way so as to effectively assess the relevance of any document to any query and thus rank a collection in response to a user input. Much thought and research has been placed on feature extraction and the development of sophisticated learning-to-rank algorithms. However, relatively little research has been conducted on the choice of documents and queries for learning-to-rank data sets nor on the effect of these choices on the ability of a learning-to-rank algorithm to "learn", effectively and efficiently.

The proposed work investigates the effect of query, document, and feature selection on the ability of learning-to-rank algorithms to efficiently and effectively learn ranking functions. In preliminary results on document selection, a pilot study has already determined that training sets whose sizes are as small as 2 to 5% of those typically used are just as effective for learning-to-rank purposes. Thus, one can train more efficiently over a much smaller (though effectively equivalent) data set, or, at an equal cost, one can train over a far "larger" and more representative data set. In addition to formally characterizing this phenomenon for document selection, the proposed work investigates this phenomenon for query and feature selection as well, with the end goals of (1) understanding the effect of document, query, and feature selection on learning-to-rank algorithms and (2) developing collection construction methodologies that are efficient and effective for learning-to-rank purposes.

Personnel

Javed A. Aslam (PI)
Virgil Pavlu (research scientist)
Pavel Metrikov (graduate student)
Peter Golbus (graduate student)

Former Personnel

Keshi Dai (now at Intent Media, New York City)
Shahzad Rajput (a Fulbright student, pursuing academic positions in his home country of Pakistan)
Stefan Savev (now at Microsoft Research, Advanced Technology Labs Europe)

Publications

A Modification of LambdaMART to Handle Noisy Crowdsourced Assessments
In Proceedings of the 4th International Conference on the Theory of Information Retrieval (ICTIR), page 31. ACM Press, September 2013.
- publisher's link
- bibliographic info
Optimizing nDCG Gains by Minimizing Effect of Label Inconsistency
In Advances in Information Retrieval: 35th European Conference on IR Research (ECIR), pages 760-763. Lecture Notes in Computer Science, Vol. 7814. Springer-Verlag, March 2013.
- publisher's link
- bibliographic info
Impact of Assessor Disagreement on Ranking Performance
In Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1091-1092. ACM Press, August 2012.
- publisher's link
- bibliographic info
IR System Evaluation using Nugget-based Test Collections
In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (WSDM), pages 393-402. ACM Press, February 2012.
- publisher's link
- bibliographic info
A Nugget-based Test Collection Construction Paradigm
In Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM), pages 1945-1948. ACM Press, October 2011.
- publisher's link
- bibliographic info
A Large-scale Study of the Effect of Training Set Characteristics over Learning-to-rank Algorithms
In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1243-1244. ACM Press, July 2011.
- publisher's link
- bibliographic info
Constructing Collections for Learning to Rank
In Proceedings of the 11th Dutch-Belgian Information Retrieval Workshop (DIR), pages 62-63. February 2011.
- publisher's link
- bibliographic info

Acknowledgment and Disclaimer

This material is based upon work supported by the National Science Foundation under Grant No. IIS-1017903. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).