Publications:
FCCM 2021: FANS: FPGA-Accelerated Near-Storage Sorting
ISCA 2020: Bonsai: High-Performance Adaptive Merge Tree Sorting
USENIX ATC 2019: INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive
FCCM 2018: High-Throughput Lossless Compression on Tightly Coupled CPU-FPGA Platforms
MEMSYS 2017: AIM: Accelerating Computational Genomics through Scalable and Noninvasive Accelerator-Interposed Memory
In the Big Data era, the volume of data is exploding, putting forward a new challenge to existing computer systems. Traditionally, the computer system is designed to be computing-centric, in which the data from IO devices is transferred and then processed by the CPU. However, this data movement is proven to be very expensive and can no longer be ignored in the Big Data era. To meet the ever-increasing performance needs, we expect the computer system to be redesigned in a data-centric fashion. Different computing engines are deployed in different storage hierarchies, including cache, memory, and disk, to form a multi-level data processing system. By doing computation in the most appropriate data hierarchy, the overall system performance and power efficiency are expected to be greatly improved.
In our group, we are currently focusing on in-storage computing (ISC) systems. The drive I/O speed plays an important role in the overall data processing efficiency—even for the in-memory computing framework. Although for decades the improvement of storage technology has been continuously pushing forward the drive speed, the system bottleneck is shifting from the storage drive to the host/drive interconnection and host I/O stacks. The advent of such a "data movement wall" prevents the high performance of the emerging storage from being delivered to end-users—which puts forward a new challenge to system designers. Rather than moving data from drive to host, ISC systems move computation from host to drive to avoid the aforementioned bottlenecks. However, existing ISC solutions face several system challenges which make them less usable. We are focussing our work in two directions. First, we are addressing the low programmability, limited performance, and lack of system support in existing ISC systems. Second, we are also working on identifying critical applications that would benefit from ISC systems and are designing accelerators for them.
In our first thrust, we are actively making progress towards our goal of making the developers’ life easier to achieve efficient and portable acceleration for heterogeneous, in-storage computing (ISC) systems. Our key contributions include INSIDER, a high-performance full-stack ISC system, which exposes a POSIX-like virtual file abstraction to interface application programs with the ISC accelerators, and allows streaming-based kernel development, making the programmers’ life easier; EISC, an FPGA-based ISC emulation system, which obtained an analytical model for accurate quantitative performance analysis of ISC accelerations by evaluating a diverse set of 12 applications. These works enable rapid prototyping of ISC accelerators and are already widely adopted by research groups and industry partners. We enrich the understanding of the benefits and limitations of ISC acceleration and provide useful guidance for selecting the applications for ISC-based acceleration. For future directions we are also working on enabling NVME-based device-to-device communication and analyzing the impacts of such a system.
In our second thrust, we are identifying workloads suitable for ISC systems and are designing state-of-art accelerators for them. One of the successful applications we have identified is large-scale sorting. In the big-data era, datacenters are increasingly consumed by sorting tasks which require large transfers between the host and drive. In our contribution FANS, we have developed a Smart-SSD based external merge sort accelerator which achieves over 3x speed-up compared to the previous state-of-art sorting accelerator. Another interesting application we have identified is computational genomics. Genome applications have a large memory footprint and require complex operations for genome reconstruction/sequencing making them both memory- and compute-intensive. In our paper AIM, we propose a scalable and non-invasive accelerator in-memory which achieves 3.7x speed-up over CPU-side performance. We are also working on identifying other such critical workloads which benefit from ISC systems and look forward to continue sharing our findings with the wider community.
In conclusion, our on-going research seeks to extend the applications and devices supported by the INSIDER framework and pushes towards more comprehensive system support, programming model support, and integration support. We also explore the opportunities to help a wider research community using ISC technology by accelerating critical applications.