Layout analysis is a fundamental step in automatic document processing. Many different techniques have been proposed to perform this task. Some follow a top-down approach: they start by identifying the high level components of the page structure and then recursively split them until basic blocks are found. On the other hand, bottom-up approaches start with the smallest elements (e.g., the pixels in case of digitized document) and then recursively merge them into higher level components. A first limitation of such methods is that most of them are designed to deal only with digitized documents and hence are not applicable to native digital documents which are nowadays pervasive. Furthermore, top-down and most of bottom-up methods are able to process Manhattan layout documents only. In this work, we propose a general bottom-up strategy to tackle the layout analysis of (possibly) non-Manhattan documents, and two specializations of it to handle both bitmap and PS/PDF sources. It was successfully embedded and tested in the DOMINUS document management system.
A Distance-based Technique for non-Manhattan Layout Analysis
FERILLI, Stefano;ESPOSITO, F;BASILE, TERESA MARIA
2009-01-01
Abstract
Layout analysis is a fundamental step in automatic document processing. Many different techniques have been proposed to perform this task. Some follow a top-down approach: they start by identifying the high level components of the page structure and then recursively split them until basic blocks are found. On the other hand, bottom-up approaches start with the smallest elements (e.g., the pixels in case of digitized document) and then recursively merge them into higher level components. A first limitation of such methods is that most of them are designed to deal only with digitized documents and hence are not applicable to native digital documents which are nowadays pervasive. Furthermore, top-down and most of bottom-up methods are able to process Manhattan layout documents only. In this work, we propose a general bottom-up strategy to tackle the layout analysis of (possibly) non-Manhattan documents, and two specializations of it to handle both bitmap and PS/PDF sources. It was successfully embedded and tested in the DOMINUS document management system.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.