Optimal de novo assemblies for chloroplast genomes based on inverted repeats patterns - Irisa Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

Optimal de novo assemblies for chloroplast genomes based on inverted repeats patterns

Résumé

Background Chloroplast genome assembly remains challenging because sequencing step outputs short reads both from plant and plastid genomes. Some recent dedicated assemblers [1,2] use the information of a highly conserved circular and quadripartite structure with a pair of dispersed inverted repeat regions in chloroplast genomes. Materials and methods We designed a dedicated pattern-driven de novo assembler which requires short unpaired reads uniquely (distances provided by paired-reads are not needed), sequenced from both the plant and its chloroplasts. A first step consists in separating the chloroplasts reads from the reads specific to plant. To this end we use the observation that the chloroplast genomes are over-represented compared to the plant genome. Then we compute an estimated coverage of the pre-assembled contigs and we keep the ones with higher coverage. The first step outputs an assembly graph where each vertex corresponds to a contig and is provided with an estimated multiplicity number. In the sequel we use another graph where each vertex is duplicated according to its multiplicity number and to the two possible contig orientations. The edges are duplicated respectively. In our approach the genome assembly is modelled as finding an elementary path in this graph. We formulate the dispersed repeats as linear constraints and we search for an elementary path using Integer Linear Programming similarly to [3]. In our approach inverted repeats correspond to occurrences of contigs paired with other occurrences of them but in reverse orientation. Their positions on the assembled sequence must satisfy nested-pairs pattern. We formulate the above constraints in terms of linear program where the objective is to maximize the nested-pairs number. Thus, we generalize a similar approach applied for RNA folding [4]. Indeed, in contrast to the later approach where the vertices correspond to bases with known sequence indices, in our case the positions of the contigs are variables. Our tool is implemented with Python 3 and uses the open-source PuLP package which integrates a free solver to solve the above optimization problem. Results We tested our program with QUAST [5] and we obtained very encouraging preliminary results, with high genome coverage (mostly >99%), and very low mismatches and indels rates. Conclusions We designed a chloroplast genome dedicated pattern-driven de novo assembler using only short unpaired reads. We formulate the conserved circular and quadripartite structure as linear constraints and implemented this model in an open-source program. Finally, QUAST evaluation returned some encouraging preliminary results.
Fichier principal
Vignette du fichier
BiATA_2021_formatted_abstract_ANDONOV_EPAIN.pdf (33.12 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03534195 , version 1 (19-01-2022)

Identifiants

  • HAL Id : hal-03534195 , version 1

Citer

Rumen Andonov, Victor Epain, Dominique Lavenier. Optimal de novo assemblies for chloroplast genomes based on inverted repeats patterns. BiATA 2021 - 4th International conference Bioinformatics: from Algorithms to Applications, Jul 2021, St. Petersbourg, Russia. pp.1-2. ⟨hal-03534195⟩
71 Consultations
71 Téléchargements

Partager

Gmail Facebook X LinkedIn More