[Tech Blog] PCIe SSD for Genome assembly

Introduction

dreamstime_ss_40556121

A genome assembly software takes a huge number of small pieces of DNA sequences (called “read”), and tries to assemble them to create a long DNA sequence which represents the original chromosomes. It generally consumes not only large computational power, but also large working memory space. The required memory space depends on the input data size, but often in the tera-bytes. Although it is true that the price of DRAM has dramatically decreased, a workstation or a server with a few TB of memory is still very expensive. For example, IBM System x3690 X5 can install 2TB of memory, but the list price is more than $300k.

On the other hand, PCI Express SSD board is on the rise. Many hardware vendors like Intel, Fusion-IO (acquired by San Disk), OCZ (acquired by Toshiba), etc, release variety of PCIe-SSD boards. Generally, it has lower bandwidth than DRAM, but much higher bandwidth than a standard SATA/SAS SSD. The price is a little high compared with a standard SATA-SSD, but Fusion-IO ioFX 2.0 has 1.6TB, at $5,418.95 on Amazon.com. Even if you insert this board into a high-end workstation, the total price is still below $10,000, which is much cheaper than a 2TB memory server.

In this blog post, I would like to explore whether using an SSD in place of DRAM is going to yield a viable solution. We will use an open-source Genome Assembler called “velvet” as a benchmark software.

Methods

First, I downloaded, compiled and installed “velvet” from the velvet web site.

$ tar xvfz velvet_1.2.08.tgz
$ cd velvet_1.2.08
$ make ‘MAXKMERLENGTH=51’ ‘OPENMP=1’
$ sudo cp velvetg velveth /usr/local/bin

In this procedure, ‘MAXKMERLENGTH=51’ sets the maximum number for “k-mer”. ‘OPENMP=1’ means that multi-threading by OpenMP is active. “k-mer” is a very important parameter in genome assembly, as it affects the quality of output DNA sequence. For more detail on k-mers, please refer to the velvet user manual.

Velvet has two process, velveth and velvetg. The velveth process creates a graph file to prepare the genome assembly. The memory required by velveth is not so large. The velvetg process, which is the actual assembling process, consumes much more memory and computation time.
Before we can start testing, we need input files. In this experiment, we will use two fastq files, SRR000021 and SRR000026, which were downloaded from this site . I processed these data by velveth as follows:

$ velveth SRR2126.k11  11 -short -fastq SRR2126.fastq

The first argument is the output directory name, the second argument is the length of the k-mer, the third argument specifies short read inputs, the 4th argument specifies the “fastq” file type, and the 5th argument is the input file.

The next step is assembly. The command to do so is as follows:

$ velvetg SRR2126.k11

The argument is the output directory generated by velveth. I measured the elapsed time of this command in several different hardware configurations:

1. Memory 4GB + Generic SATA HDD
2. Memory 4GB + Fusion IO ioFX
3. Memory 8GB

In case of configuration #2, I created a swap file on ioFX like this:

# dd if=/dev/zero of=/mnt/iofx/swap0 bs=1024 count=12582912
# chmod 600 swap0
# mkswap swap0 12582912
# swapoff –a
# swapon /mnt/iofx/swap0

Results

The velvetg process uses about 8GB memory space for this input data, so roughly half of temporary data is spilled out to swap memory space for configurations 1 and 2. Below figure  shows the elapsed time for each configuration. Using the Generic SATA HDD, the process was not finished after 2 hours, so we decided to kill the process.

Clipboard01

Conclusion

So is using PCIe-SSD a viable solution? It is hard to say, as 3x difference is not so small. In this particular experiment, only half of the memory space used by velvetg happened to be placed on the PCIe-SSD. As mentioned earlier, the real-life data that bioinformaticians deals with can be a few TBs. If almost all the memory space is on PCIe-SSD, the performance is expected to be much worse.

However, considering that HDD could not even complete the process in a reasonable amount of time, the PCIe-SSD card showed that it can vastly improve performance. This shows that the SSD can serve as a good compromise point, as DRAM is much more expensive than a SSD.

 

 

Comments are closed.