The next-generation sequencing revolution has impacted biological research by allowing the collection and analysis of very large datasets. However, despite the large availability of data, current computational methods used by biologists present some limitations in challenging domains, such as extremely imbalanced datasets characterized by almost only negative examples. In this paper, we address the problem of identifying sequences from the zebra finch (songbird) germline-restricted chromosome (GRC), which is present only in reproductive tissues and missing from all other cells. Since the germline contains the GRC in addition to other chromosomes, sequencing germline DNA must be followed by separation into GRC or non-GRC sequences. The complexity of this task depends on the limited availability of known GRC sequences. In this paper, we propose a one-class ensemble learning method to solve this problem, and we compare its performance with state-of-the-art methods for one-class classification. Our results show that the proposed method is able to identify positive sequences with high accuracy, having been trained only with negative sequences, and tuned with a limited number of positive sequences. Moreover, a biological analysis revealed that positive sequences from a verified GRC gene were ranked in the top third of all the sequences, showing that our method is successful in demarcating GRC from non-GRC sequences. Our method thus represents a valuable tool for biologists, since model predictions can allow them to focus their limited resources towards the experimental validation of a subset of higher confidence sequences.

One-Class Ensembles for Rare Genomic Sequences Identification

Corizzo R.
;
2020-01-01

Abstract

The next-generation sequencing revolution has impacted biological research by allowing the collection and analysis of very large datasets. However, despite the large availability of data, current computational methods used by biologists present some limitations in challenging domains, such as extremely imbalanced datasets characterized by almost only negative examples. In this paper, we address the problem of identifying sequences from the zebra finch (songbird) germline-restricted chromosome (GRC), which is present only in reproductive tissues and missing from all other cells. Since the germline contains the GRC in addition to other chromosomes, sequencing germline DNA must be followed by separation into GRC or non-GRC sequences. The complexity of this task depends on the limited availability of known GRC sequences. In this paper, we propose a one-class ensemble learning method to solve this problem, and we compare its performance with state-of-the-art methods for one-class classification. Our results show that the proposed method is able to identify positive sequences with high accuracy, having been trained only with negative sequences, and tuned with a limited number of positive sequences. Moreover, a biological analysis revealed that positive sequences from a verified GRC gene were ranked in the top third of all the sequences, showing that our method is successful in demarcating GRC from non-GRC sequences. Our method thus represents a valuable tool for biologists, since model predictions can allow them to focus their limited resources towards the experimental validation of a subset of higher confidence sequences.
2020
978-3-030-61526-0
978-3-030-61527-7
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/373835
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 10
  • ???jsp.display-item.citation.isi??? ND
social impact