DSK: k-mer counting with very low memory usage

Guillaume Rizk; Dominique Lavenier; Rayan Chikhi

doi:10.1093/bioinformatics/btt020

Article Dans Une Revue Bioinformatics Année : 2013

DSK: k-mer counting with very low memory usage

(1) , (2) , (2)

1
2

Guillaume Rizk

Fonction : Auteur

Algorizk [Paris]

Dominique Lavenier

Fonction : Auteur
PersonId : 1401
IdHAL : dominique-lavenier
ORCID : 0000-0003-2557-680X

Scalable, Optimized and Parallel Algorithms for Genomics

Rayan Chikhi

Fonction : Auteur correspondant
PersonId : 14839
IdHAL : rayan-chikhi
ORCID : 0000-0003-1099-8735
IdRef : 16546769X

Connectez-vous pour contacter l'auteur

Scalable, Optimized and Parallel Algorithms for Genomics

Résumé

Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a ﬁxed, userdeﬁned amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low abundance k-mers are optionally ﬁltered. DSK is the ﬁrst approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 hours.

Mots clés

bioinformatics k-mer counting next-generation sequencing

Domaines

Bio-informatique [q-bio.QM] Bio-Informatique, Biologie Systémique [q-bio.QM]

Fichier principal

dsk_preprint.pdf (83.95 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Rayan Chikhi : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00778473

Soumis le : dimanche 20 janvier 2013-15:54:55

Dernière modification le : vendredi 24 mars 2023-14:52:56

Archivage à long terme le : dimanche 21 avril 2013-03:52:27

Dates et versions

hal-00778473 , version 1 (20-01-2013)

Identifiants

HAL Id : hal-00778473 , version 1
DOI : 10.1093/bioinformatics/btt020

Citer

Guillaume Rizk, Dominique Lavenier, Rayan Chikhi. DSK: k-mer counting with very low memory usage. Bioinformatics, 2013, 29 (5), pp.652-653. ⟨10.1093/bioinformatics/btt020⟩. ⟨hal-00778473⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM EC-PARIS UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA IRISA-D7 INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM

640 Consultations

365 Téléchargements

DSK: k-mer counting with very low memory usage

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager