SIEVE: Generating a cybersecurity log dataset collection for SIEM event classification

Artioli, P.; Dentamaro, V.; Galantucci, S.; Magri, A.; Pellegrini, G.; Semeraro, G.

doi:10.1016/j.comnet.2025.111330

Effective cyber threat monitoring relies on deploying robust Security Information and Event Management (SIEM) systems. SIEM applications receive security events generated by different devices, systems, and applications. They should properly correlate them to identify potential cyber threats based on tactics, techniques, and procedures (TTP), bypassing other security mechanisms (e.g., firewall, IDS, etc.). Given that logs are primarily generated to notify relevant system events and activities in a human-readable format, supervised Natural Language Processing (NLP) techniques could be used to train models that complement conventional parsing methodologies by automatically suggesting event classification into pre-defined categories. Training such models requires a substantial amount of pre-classified (labeled) data of different types to provide the learning patterns and nuances needed to make accurate predictions. Since the number of security event datasets is scarce due to privacy or availability reasons, and the few publicly available ones are often limited in terms of event diversity, number of labels, or simply unfit for the task at hand, an effective synthetic dataset for training SIEM-related machine learning event classification algorithms could be very useful. For these reasons, this paper proposes the generation of a synthetic dataset specifically designed to train SIEM systems for log-type classification. This research paper, starting from an in-depth methodological analysis of the prominent Cybersecurity related datasets available in the literature, introduces SIEVE (Siem Ingesting EVEnts), a synthetic dataset collection built from publicly available log samples using SPICE (Semantic Perturbation and Instantiation for Content Enrichment), a novel text augmentation and perturbation technique. SPICE is shown to be effective in generating realistic logs. Each instance of the dataset collection displays different levels of augmentation. Subsequent performance assessments were conducted through comprehensive benchmarking against various NLP classification models. Tests were conducted by training the classifiers using SIEVE and testing them on both the same SIEVE logs and real logs. The results of the experiments show that the best model among those tested is SVM (MaF1 0.9323 - 0.9737), which maintains its performance with slight degradation, even in tests on real logs (MaF1 0.9477 - 0.9636). BERT, on the other hand, performs better than SVM in most of the tests on SIEVE (MaF1 0.9528 - 0.9730) but does not show robustness when tested on real logs (MaF1 0.8864 - 0.9182).