A parallel, distributed algorithm for relational frequent pattern discovery from very large data sets

Appice, Annalisa; Ceci, Michelangelo; Turi, A; Malerba, Donato

doi:10.3233/IDA-2010-0456

The amount of data produced by ubiquitous computing applications is quickly growing, due to the pervasive presence of small devices endowed with sensing, computing and communication capabilities. Heterogeneity and strong interdependence, which characterize 'ubiquitous data', require a (multi-) relational approach to their analysis. However, relational data mining algorithms do not scale well and very large data sets are hardly processable. In this paper we propose an extension of a relational algorithm for multi-level frequent pattern discovery, which resorts to data sampling and distributed computation in Grid environments, in order to overcome the computational limits of the original serial algorithm. The set of patterns discovered by the new algorithm approximates the set of exact solutions found by the serial algorithm. The quality of approximation depends on three parameters: the proportion of data in each sample, the minimum support thresholds and the number of samples in which a pattern has to be frequent in order to be considered globally frequent. Considering that the first two parameters are hardly controllable, we focus our investigation on the third one. Theoretically derived conclusions are also experimentally confirmed. Moreover, an additional application in the context of event log mining proves the viability of the proposed approach to relational frequent pattern mining from very large data sets.