Relational Data Mining Techniques for Historical Document Processing

Ceci, Michelangelo; Malerba, Donato

Document image understanding denotes the recognition of semantically relevant components in the layout extracted from a document image. Automatic approaches for document image understanding are highly demanded today by organizations involved in the preservation and valorisation of historical documents that collect more and more document images, whose effective usage critically depends on their fast and accurate indexing and cataloguing. In this context, Data Mining techniques can be profitably applied in order to support the user in the recognition of semantically relevant components in historical document images. However, such application is not straightforward and two important aspects have to be considered: First, extracted models should take into account the inherent spatial nature of the layout of a document image and spatial relations among layout components of interest. Second, low layout quality and standard of such a material introduces a considerable amount of noise in its description. For this reasons, in this paper, we investigate the application of a Statistical Relational Data Mining method, which successfully allows relations between components to be effectively and naturally represented by resorting to the Relational Data Mining framework and guarantees robustness to noise by exploiting statistical methods. Experiments are performed on two historical document corpora from the 20's and 30's.