In this paper, we describe the creation of a diachronic corpus for Italian by exploiting the digital archive of the newspaper “L’Unità”. We automatically clean and annotate the corpus with PoS tags, lemmas, named entities and syntactic dependencies. Moreover, we compute frequency-based time series for tokens, lemmas and entities. We show some interesting corpus statistics taking into account the temporal dimension and describe some examples of usage of time series.
A diachronic Italian corpus based on “L’Unità”
Basile P.;Caputo A.;Cassotti P.;
2020-01-01
Abstract
In this paper, we describe the creation of a diachronic corpus for Italian by exploiting the digital archive of the newspaper “L’Unità”. We automatically clean and annotate the corpus with PoS tags, lemmas, named entities and syntactic dependencies. Moreover, we compute frequency-based time series for tokens, lemmas and entities. We show some interesting corpus statistics taking into account the temporal dimension and describe some examples of usage of time series.File in questo prodotto:
File | Dimensione | Formato | |
---|---|---|---|
paper_44.pdf
accesso aperto
Tipologia:
Documento in Versione Editoriale
Licenza:
Creative commons
Dimensione
403.16 kB
Formato
Adobe PDF
|
403.16 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.