|  | Statistical Analysis to Establish the Importance of Information Retrieval Parameters
               Julie Ayter (Institut National des Sciences Appliquées de Toulouse, France)
 
               Adrian-Gabriel Chifu (Université de Toulouse, France)
 
               Sébastien Déjean (Institut National des Sciences Appliquées de Toulouse, France)
 
               Cecile Desclaux (Institut National des Sciences Appliquées de Toulouse, France)
 
               Josiane Mothe (Université de Toulouse, France)
 
              Abstract: Search engines are based on models to index   documents, match queries and documents and rank documents. Research   in Information Retrieval (IR) aims at defining these models and   their parameters in order to optimize the results. Using benchmark   collections, it has been shown that there is not a best system   configuration that works for any query, but rather that performance   varies from one query to another. It would be interesting if a   meta-system could decide which system configuration should process a   new query by learning from the context of previousqueries. This   paper reports a deep analysis considering more than 80,000 search   engine configurations applied to 100 queries and the corresponding   performance. The goal of the analysis is to identify which   configuration responds best to a certain type of query. We   considered two approaches to define query types: one is   post-evaluation, based on query clustering according to the   performance measured with Average Precision, while the second   approach is pre-evaluation, using query features (including query   difficulty predictors) to cluster queries. Globally, we identified   two parameters that should be optimized:   retrieving_model and   TrecQueryTags_process. One could expect such   results as these two parameters are major components of IR   process. However our work results in two main conclusions: 1/ based   on post-evaluation approach, we found that   retrieving_model is the most influential   parameter for easy queries while TrecQueryTags   process is for hard queries; 2/ for pre-evaluation, current query   features do not allow to cluster queries to identify differences in   the influential parameters. 
             
              Keywords: IR system parameters, Random Forest, information retrieval, query clustering, query difficulty 
             Categories: H.3.3, H.3.4  |