Customized Computing for Big-Data Applications

Project status: 

In the era of big data, many applications present siginificant compuational challenges. For example, in the field of bio-infomatics, the computation demand for personalized cancer treatment is prohibitively high for the general-purpose computing technologies, as tumor heterogeneity requires great sequencing depths, structural aberrations are difficult to detect with today’s algorithms, and the tumor has the ability to evolve, meaning the same tumor might be assayed a great many times during the course of treatment.  The goal of this research project is to make apply the domain-specific customized computing techniques developed by the Center for Domain-Specific Computing (CDSC) to greatly accelerate current the fundamental computational challenges in big-data applications. Specifically, we are exploring the accelerations of following topics:

1. Data Compression, which has been widely used in the data center to reduce data movement and data storage overhead. While FPGAs are well suited to accelerate the computation-intensive lossless compression algorithms, big data compression with parallel requests intrinsically poses challenges to FPGA-based compression system: how to scale the FPGA compression accelerator to support high throughput compression with multiple parallel threads; how to trade off the accelerator's compression speed and compression quality for various application domain. In this project we focus on designing high-performance compression system with high compression quality for a wide range of big data applications.

2. Sorting, which is a key application in many big data processing systems like Hive, Spark SQL and Map-Reduce-Merge. The performance of sorting is determined by the sorting algorithm, problem size, available hardware computation resources and the system's memory hierarchy. However, none of existing researches take into consideration all these factors and give the optimal solutions for different scenarios. In this project we define a general approach to help designers choose the optimal hardware sorting solutions based on the available hardware resources and the problem size they are facing. We also explore the best sorting accelerator architecture across various memory hierarchies, such as DDR DRAM, High Bandwidth Memory (HBM) and Solid-State Drive (SSD).

3. Genome sequencing. Genome sequencing analyzes similarities between DNA, or protein sequences, to evaluate the genetic relationship between organisms or species. It is a crucial way for genetic researchers to  unveil the mystery of life. It is widely used in the bioinformatics, drugs and medicine design among other related areas. However, the rate of genome data generation far exceeds the processing power of the general CPU. Therefore, we resort to FPGA to exploit the massive but irregular parallelism within the genome sequencing algorithms to effectively shorten the end-to-end computation time of the sequencing pipeline.

CDSC develops a general methodology for creating novel customizable computing platforms and the associated compilation tools and runtime management environment to support domain-specific computing. The recent focus is on design and implementation of accelerator-rich architectures, from single chips to data centers. It also includes highly automated compilation tools and runtime management software systems for customizable heterogeneous platforms, including multi-core CPUs, many-core GPUs, and FPGAs, as well as a general, reusable methodology for customizable computing applicable across different domains. By combining these critical capabilities, the goal is to deliver a supercomputer-in-a-box or supercomputer-in-a-cluster (for data center level deployment) that can be customized to an application domain to enable disruptive innovations. Our approach has been successfully demonstrated in the domain of genomic processing (BWA-mem + GATK) and medical image processing with over 10X improvement in performance and 100X in energy efficiency.

Invited Talks

"Datacenter-Scale Customizable Computing", first International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC'15) Held in conjunction with Super Computing 2015, November 22, 2015(Keynote Speech)

"Characterization and Acceleration for Genomic Sequencing and Analysis",  IEEE International Symposium on Workload Characterization, Seattle, WA, October 2017 (Keynote Speech)