Zhiyang Wang: smallWig: Compression on RNA-seq Expression Data

Wednesday, November 19, 2014 - 12:15pm
Zhiyang Wang

The start of the 21st century has witnessed a dramatic decrease in the cost of DNA/RNA sequencing. As a result, peta bytes of genomic data are generated per year. However, current storage solutions and prices do not scale appropriately with such massive data surges. It is therefore of great importance to develop efficient and task-oriented compression methods. One such genomic information is expression data, representing the amount of RNA molecules transcribed from every nucleotide location in the genome. Except for the classical BigWig compression method, based on gzip, and the very recent work termed cWig, no highly efficient compression methods are currently known. 

We developed a specialized lossless approach to compression of RNA-seq expression and whole transcriptome data, combining various source coding methods so as to achieve fast encoding and compact data representation. We also provided for flexible random access that enables fast visualization of subsequences of expression data, where the access time and the compression rate can be tuned by the application. For archiving purposes, more elaborate context-based compression techniques were implemented as well, which further reduced the compressed file sizes.