In this paper we present a method to automatically discover sitemaps from websites. Given a website, existing automatic solutions extract only a flat list of urls that do not show the hierarchical structure of its content. Manual approaches, performed by web-masters, extract deeper sitemaps (with respect to automatic methods). However, in many cases, also because of the natural evolution of the websites’ content, generated sitemaps do not reflect the actual content becoming soon helpless and confusing for users. We propose a different approach that is both automatic and effective. Our solution combines an algorithm to extract frequent patterns from navigation systems (e.g. menu, nav-bar, content list, etc.) contained in a website, with a hierarchy extraction algorithm able to discover rich hierarchies that unveil relationships among web pages (e.g. relationships of super/sub topic). Experimental results, show how our approach discovers high quality sitemaps that have a deep hierarchy and are complete in the extracted urls.

Automatic generation of sitemaps based on navigation systems

Lanotte, Pasqua Fabiana;Fumarola, Fabio;Malerba, Donato;Ceci, Michelangelo
2016-01-01

Abstract

In this paper we present a method to automatically discover sitemaps from websites. Given a website, existing automatic solutions extract only a flat list of urls that do not show the hierarchical structure of its content. Manual approaches, performed by web-masters, extract deeper sitemaps (with respect to automatic methods). However, in many cases, also because of the natural evolution of the websites’ content, generated sitemaps do not reflect the actual content becoming soon helpless and confusing for users. We propose a different approach that is both automatic and effective. Our solution combines an algorithm to extract frequent patterns from navigation systems (e.g. menu, nav-bar, content list, etc.) contained in a website, with a hierarchy extraction algorithm able to discover rich hierarchies that unveil relationships among web pages (e.g. relationships of super/sub topic). Experimental results, show how our approach discovers high quality sitemaps that have a deep hierarchy and are complete in the extracted urls.
2016
9783319514680
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/187811
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact