Layout analysis is a fundamental step in automatic document processing, because its outcome affects all subsequent processing steps. Many different techniques have been proposed to perform this task. In this work, we propose a general bottom-up strategy to tackle the layout analysis of (possibly) non-Manhattan documents, and two specializations of it to handle both bitmap and PS/PDF sources. A famous approach proposed in the literature for layout analysis was the RLSA. Here we consider a variant of RLSA, called RLSO (short for “Run-Lengh Smoothing with OR”), that exploits the OR logical operator instead of the AND and is particularly indicated for the identification of frames in non-Manhattan layouts. Like RLSA, RLSO is based on thresholds, but based on different criteria than those that work in RLSA. Since setting such thresholds is a hard and unnatural task for (even expert) users, and no single threshold can fit all documents, we developed a technique to automatically define such thresholds for each specific document, based on the distribution of spacing therein. Application on selected sample documents, that cover a significant landscape of real cases, revealed that the approach is satisfactory for documents characterized by the use of a uniform text font size.

A Run Length Smoothing-based Algorithm for Non-Manhattan Document Segmentation

FERILLI, Stefano;ESPOSITO, Floriana
2012

Abstract

Layout analysis is a fundamental step in automatic document processing, because its outcome affects all subsequent processing steps. Many different techniques have been proposed to perform this task. In this work, we propose a general bottom-up strategy to tackle the layout analysis of (possibly) non-Manhattan documents, and two specializations of it to handle both bitmap and PS/PDF sources. A famous approach proposed in the literature for layout analysis was the RLSA. Here we consider a variant of RLSA, called RLSO (short for “Run-Lengh Smoothing with OR”), that exploits the OR logical operator instead of the AND and is particularly indicated for the identification of frames in non-Manhattan layouts. Like RLSA, RLSO is based on thresholds, but based on different criteria than those that work in RLSA. Since setting such thresholds is a hard and unnatural task for (even expert) users, and no single threshold can fit all documents, we developed a technique to automatically define such thresholds for each specific document, based on the distribution of spacing therein. Application on selected sample documents, that cover a significant landscape of real cases, revealed that the approach is satisfactory for documents characterized by the use of a uniform text font size.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11586/139035
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact