DSK: k-mer counting with very low memory usage - Université de Rennes Accéder directement au contenu
Article Dans Une Revue Bioinformatics Année : 2013

DSK: k-mer counting with very low memory usage

Résumé

Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed, userdefined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low abundance k-mers are optionally filtered. DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 hours.
Fichier principal
Vignette du fichier
dsk_preprint.pdf (83.95 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00778473 , version 1 (20-01-2013)

Identifiants

Citer

Guillaume Rizk, Dominique Lavenier, Rayan Chikhi. DSK: k-mer counting with very low memory usage. Bioinformatics, 2013, 29 (5), pp.652-653. ⟨10.1093/bioinformatics/btt020⟩. ⟨hal-00778473⟩
640 Consultations
365 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More