Finite mixtures of Multinomial distributions are a valuable tool for analyzing discrete positive data, particularly in the context of text analysis where data is represented as a Bag-of-Words (BOW). In this approach, only term frequency from a predefined vocabulary is considered, disregarding the specific positions of terms within the preprocessed text document. Dirichlet-Multinomial mixture models, in particular, offer a straightforward yet effective method for text categorization. These models often outperform more complex latent variable models in cases where documents are short. The combination of Dirichlet priors and Multinomial likelihoods can be addressed within a Bayesian framework. However, despite the model's simplicity, the exact posterior distribution is intractable, necessitating the use of numerical methods. Variational inference offers a promising approach by approximating the joint posterior distribution with a probability distribution in which the model parameters are assumed to be independent a posteriori. Under certain conditions, a coordinate ascent variational algorithm can be constructed to yield an approximation that closely matches the true posterior in terms of the reverse Kullback-Leibler divergence. A notable limitation of standard variational algorithms, however, is their requirement to use the entire dataset to compute the iterative equations for estimating the local variational parameters, which poses a significant scalability issue when working with large text corpora. To address this, we employ stochastic variational inference within the exponential family to develop a scalable estimation algorithm. By leveraging straightforward assumptions about the full conditional distributions of the hierarchical model and the distributions of the variational parameters, we demonstrate that, under the Robbins-Monro conditions, a gradient ascent algorithm can be derived. This algorithm converges to a local maximum of the approximated posterior surface. Crucially, instead of utilizing all observations, each iteration relies on a noisy yet unbiased estimate of the gradient calculated from a single randomly selected data point. Numerical simulations demonstrate the superior per-iteration computational efficiency of stochastic variational inference (SVI). While SVI typically requires more iterations for convergence, its efficiency advantage extends beyond computational speed. Albeit preliminary and somewhat speculative, the obtained results suggest that SVI yields higher-quality solutions, as evidenced by both text clustering accuracy and the implicit regularization of weakly identified components.

Stochastic variational inference for clustering short text data with finite mixtures of Dirichlet-Multinomial distributions

Bilancia, Massimo
;
2025-01-01

Abstract

Finite mixtures of Multinomial distributions are a valuable tool for analyzing discrete positive data, particularly in the context of text analysis where data is represented as a Bag-of-Words (BOW). In this approach, only term frequency from a predefined vocabulary is considered, disregarding the specific positions of terms within the preprocessed text document. Dirichlet-Multinomial mixture models, in particular, offer a straightforward yet effective method for text categorization. These models often outperform more complex latent variable models in cases where documents are short. The combination of Dirichlet priors and Multinomial likelihoods can be addressed within a Bayesian framework. However, despite the model's simplicity, the exact posterior distribution is intractable, necessitating the use of numerical methods. Variational inference offers a promising approach by approximating the joint posterior distribution with a probability distribution in which the model parameters are assumed to be independent a posteriori. Under certain conditions, a coordinate ascent variational algorithm can be constructed to yield an approximation that closely matches the true posterior in terms of the reverse Kullback-Leibler divergence. A notable limitation of standard variational algorithms, however, is their requirement to use the entire dataset to compute the iterative equations for estimating the local variational parameters, which poses a significant scalability issue when working with large text corpora. To address this, we employ stochastic variational inference within the exponential family to develop a scalable estimation algorithm. By leveraging straightforward assumptions about the full conditional distributions of the hierarchical model and the distributions of the variational parameters, we demonstrate that, under the Robbins-Monro conditions, a gradient ascent algorithm can be derived. This algorithm converges to a local maximum of the approximated posterior surface. Crucially, instead of utilizing all observations, each iteration relies on a noisy yet unbiased estimate of the gradient calculated from a single randomly selected data point. Numerical simulations demonstrate the superior per-iteration computational efficiency of stochastic variational inference (SVI). While SVI typically requires more iterations for convergence, its efficiency advantage extends beyond computational speed. Albeit preliminary and somewhat speculative, the obtained results suggest that SVI yields higher-quality solutions, as evidenced by both text clustering accuracy and the implicit regularization of weakly identified components.
File in questo prodotto:
File Dimensione Formato  
s00362-025-01702-0.pdf

accesso aperto

Tipologia: Documento in Versione Editoriale
Licenza: Creative commons
Dimensione 1.22 MB
Formato Adobe PDF
1.22 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/536840
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 3
social impact