Assessing whether two documents were written by the same author is a crucial task, especially in the Internet age, with possible applications to philology and forensics. The problem has been tackled in the literature by exploiting frequency-based approaches, numeric techniques or writing style analysis. Focusing on this last perspective, this paper proposes a novel technique that takes into account the structure of sentences, assuming that it is strictly related to the author's writing style. Specifically, a (collection of) text(s) in natural language written by a given author is translated into a set of First-Order Logic descriptions, and a model of the author's writing habits is obtained as the result of clustering these descriptions. Then, if an overlapping exists between the models of a known author and of an unknown one, the conclusion can be drawn that they are the same person. Among the advantages of this approach, it does not need a training phase, and performs well also on short texts and/or small collections.
A Sentence Structure-based Approach to Unsupervised Author Identification
FERILLI, Stefano
2016-01-01
Abstract
Assessing whether two documents were written by the same author is a crucial task, especially in the Internet age, with possible applications to philology and forensics. The problem has been tackled in the literature by exploiting frequency-based approaches, numeric techniques or writing style analysis. Focusing on this last perspective, this paper proposes a novel technique that takes into account the structure of sentences, assuming that it is strictly related to the author's writing style. Specifically, a (collection of) text(s) in natural language written by a given author is translated into a set of First-Order Logic descriptions, and a model of the author's writing habits is obtained as the result of clustering these descriptions. Then, if an overlapping exists between the models of a known author and of an unknown one, the conclusion can be drawn that they are the same person. Among the advantages of this approach, it does not need a training phase, and performs well also on short texts and/or small collections.File | Dimensione | Formato | |
---|---|---|---|
2016_Article_.pdf
accesso aperto
Tipologia:
Documento in Versione Editoriale
Licenza:
Creative commons
Dimensione
3.31 MB
Formato
Adobe PDF
|
3.31 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.