Inspecting data for searching valuable information embedded in them represents a key aspect in several fields, becoming even more challenging because of the continual improvement of technologies which are able to furnish a very large amount of informative data. Fortunately, most of the available data presents an embedded mathematical structure that can be profitably exploited to better investigate latent patterns hidden in them. Analyzing real data covers a biggest set of approaches ranging from pre-processing to the actual discovery of information. In the first context, one of the main problems with real data is often related to the presence of anomalies that may spoil the resulting analysis as well as contain valuable information. In both cases, the ability to detect these occurrences is very important. Particularly, in the biomedical field, proper identification of outliers allows to develop of novel biological hypotheses not taken into consideration when experimental biological data are considered. On the other hand, the actual process of information discovery can be formulated with an optimization task underlying a matrix structure inside. This can be done with the help of Dimensionality Reduction (DR) that represents one of the most suitable instruments to untangle latent information at different levels. In particular, these methods aim to describe data under analysis onto a low-dimensional space allowing to consider most of all of the intrinsic knowledge as ideal sources (namely basis) of the process under consideration [3]. Often these approaches are also enriched by penalization terms can be added to enforce particular constraints able to emphasize useful properties. In this context, the tune of the hyperparameters controlling the weight of the additional constraints represents a problematic issue. In this talk, we focus on Linear DR methods for the analysis of data from the pre to the post processing, to untangle latent information at different levels representing data onto a low-dimensional space. In particular, the contribution of this talk will be twofold. We will first address the problem of detecting outlier samples with application to biomedical data, proposing an ensemble approach for anomalies detection in gene expression matrices based on the use of Hierarchical Clustering and Robust Principal Component Analysis, that allows deriving a novel pseudo-mathematical classification of anomalies [2]. Then, we will focus on Nonnegative Matrix Factorizations (NMFs), which prove to be the most effective among Linear DR methods in analyzing real-life nonnegative data [1]. Some variants of NMF will be also presented as minimization tasks to which penalization terms can be added in accordance with some additional characteristics. In particular, we regard the hyperparameters selection from an optimization point of view, incorporating their choice directly in the unsupervised algorithm as a part of the updating process in a bilevel formulation, providing theoretical and computational results to solve this problem. We will finally sketch future research directions. References [1] Gillis, 2020 Nonnegative Matix Factorization SIAM. [2] Selicato, L.; Esposito, F.; Gargano, G.; Vegliante, M.C.; Opinto, G.; Zaccaria, G.M.; Ciavarella, S.; Guarini, A.; Del Buono, N. 2021 A New Ensemble Method for Detecting Anomalies in Gene Expression Matrices In Mathematics 9, 882 MPDI. [3] Berry, M. W., Drmac Z., and Jessup E. R. 1999 Matrices, vector spaces, and information retrieval. SIAM review, 41(2):35-362.

Low rank approaches for the analysis of real data from pre to post processing

Flavia Esposito;Laura Selicato;Nicoletta Del Buono
2021-01-01

Abstract

Inspecting data for searching valuable information embedded in them represents a key aspect in several fields, becoming even more challenging because of the continual improvement of technologies which are able to furnish a very large amount of informative data. Fortunately, most of the available data presents an embedded mathematical structure that can be profitably exploited to better investigate latent patterns hidden in them. Analyzing real data covers a biggest set of approaches ranging from pre-processing to the actual discovery of information. In the first context, one of the main problems with real data is often related to the presence of anomalies that may spoil the resulting analysis as well as contain valuable information. In both cases, the ability to detect these occurrences is very important. Particularly, in the biomedical field, proper identification of outliers allows to develop of novel biological hypotheses not taken into consideration when experimental biological data are considered. On the other hand, the actual process of information discovery can be formulated with an optimization task underlying a matrix structure inside. This can be done with the help of Dimensionality Reduction (DR) that represents one of the most suitable instruments to untangle latent information at different levels. In particular, these methods aim to describe data under analysis onto a low-dimensional space allowing to consider most of all of the intrinsic knowledge as ideal sources (namely basis) of the process under consideration [3]. Often these approaches are also enriched by penalization terms can be added to enforce particular constraints able to emphasize useful properties. In this context, the tune of the hyperparameters controlling the weight of the additional constraints represents a problematic issue. In this talk, we focus on Linear DR methods for the analysis of data from the pre to the post processing, to untangle latent information at different levels representing data onto a low-dimensional space. In particular, the contribution of this talk will be twofold. We will first address the problem of detecting outlier samples with application to biomedical data, proposing an ensemble approach for anomalies detection in gene expression matrices based on the use of Hierarchical Clustering and Robust Principal Component Analysis, that allows deriving a novel pseudo-mathematical classification of anomalies [2]. Then, we will focus on Nonnegative Matrix Factorizations (NMFs), which prove to be the most effective among Linear DR methods in analyzing real-life nonnegative data [1]. Some variants of NMF will be also presented as minimization tasks to which penalization terms can be added in accordance with some additional characteristics. In particular, we regard the hyperparameters selection from an optimization point of view, incorporating their choice directly in the unsupervised algorithm as a part of the updating process in a bilevel formulation, providing theoretical and computational results to solve this problem. We will finally sketch future research directions. References [1] Gillis, 2020 Nonnegative Matix Factorization SIAM. [2] Selicato, L.; Esposito, F.; Gargano, G.; Vegliante, M.C.; Opinto, G.; Zaccaria, G.M.; Ciavarella, S.; Guarini, A.; Del Buono, N. 2021 A New Ensemble Method for Detecting Anomalies in Gene Expression Matrices In Mathematics 9, 882 MPDI. [3] Berry, M. W., Drmac Z., and Jessup E. R. 1999 Matrices, vector spaces, and information retrieval. SIAM review, 41(2):35-362.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/380289
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact