Reaching high precision and recall rates in the results of term-based queries on text collections is becoming more and more crucial, as long as the amount of available documents increases and their quality tends to decrease. In particular, retrieval techniques based on the strict correspondence between terms in the query and terms in the documents miss important and relevant documents where it just happens that the terms selected by their authors are slightly different than those used by the final user that issues the query. Our proposal is to explicitly consider term co-occurrences when building the vector space. Indeed, the presence in a document of different but related terms to those in the query should strengthen the confidence that the document is relevant as well. Missing a query term in a document, but finding several terms strictly related to it, should equally support the hypothesis that the document is actually relevant. The computational perspective that embeds such a relatedness consists in matrix operations that capture direct or indirect term co-occurrence in the collection. We propose two different approaches to enforce such a perspective, and run preliminary experiments on a prototypical implementation, suggesting that this technique is potentially profitable.

Using Explict Word Co-occurrences to Improve Term-based Text Retrieval

FERILLI, Stefano;BASILE, TERESA MARIA;ESPOSITO, Floriana
2010

Abstract

Reaching high precision and recall rates in the results of term-based queries on text collections is becoming more and more crucial, as long as the amount of available documents increases and their quality tends to decrease. In particular, retrieval techniques based on the strict correspondence between terms in the query and terms in the documents miss important and relevant documents where it just happens that the terms selected by their authors are slightly different than those used by the final user that issues the query. Our proposal is to explicitly consider term co-occurrences when building the vector space. Indeed, the presence in a document of different but related terms to those in the query should strengthen the confidence that the document is relevant as well. Missing a query term in a document, but finding several terms strictly related to it, should equally support the hypothesis that the document is actually relevant. The computational perspective that embeds such a relatedness consists in matrix operations that capture direct or indirect term co-occurrence in the collection. We propose two different approaches to enforce such a perspective, and run preliminary experiments on a prototypical implementation, suggesting that this technique is potentially profitable.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11586/14384
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact