Acceleration of Deep Learning for Cloud and Edge Computing

Project status: 

In this project, we explore efficient algorithms and architectures for state-of-the-art deep learning based applications. In the first work, we are exploring learning algorithms and acceleration techniques on graph learning algorithms. The second work, Caffeine, offers a uniformed framework to accelerate the full stack of convolutional neural networks (CNN), including both convolutional layers and fully-connected layers. Following this work, we further explore the efficient microarchitecture for implementing the computation-intensive kernels in CNN. A special architecture, systolic array, which consists of processing elements (PEs) with local interconnects, are thoroughly studied. Lastly, in the project of CLINK, a LSTM inference kernel is designed for EEG signal processing on neurofeeedback devices, which demonstrates high speedups and energy efficiency on FPGAs compared to CPU and GPU counterparts.

Below are the detailed summaries of each project.


1. Acceleration of Graph-based Machine Learning

While much progress has been made in the past decade to provide hardware acceleration for machine learning (ML) on images, speech, and video, there has been limited focus on accelerating ML for graphs. Graphs are ubiquitous, and often the fundamental data structure in applications ranging from bioinformatics, chemistry, healthcare, recommender systems, social network study, to system analysis and network security. Machine learning using graph-based data representations is receiving increasing attention in the past two years, with the development of several graph neural networks (GNN) algorithms. Further, graphs are not just useful as a representation of data (as in GNNs), they have also been shown to be an extremely efficient model representation (eg. arithmetic circuits representation of probabilistic graphical models). Graph-based ML algorithms pose unique challenges to existing CPUs and GPUs due to the highly irregular memory access imposed by the graph structure (poor for GPUs), with dense computation required by node-wise convolution and other operations (poor for CPUs), and dynamic fine-grain parallelism imposed by the non-uniform graph structure (poor for both). In this work, we propose a novel accelerator architecture for graph-based ML, named GraphDeep, that efficiently addresses these challenges. GraphDeep is a programmable heterogeneous multi-accelerator architecture, with spatially-exposed compute and memory resources, specialization for relevant access and compute patterns, and an architecture-aware task-based execution model. We will construct a full software stack for realistic benchmarks, and will evaluate with a cycle level simulator, synthesizible hardware design, and FPGA prototyping.

2. Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

With the recent advancement of multilayer convolutional neural networks (CNN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy-efficiency of the computation-demanding CNN, the FPGA-based acceleration emerges as one of the most attractive alternatives.  We design and implement Caffeine, a hardware/software co-designed library to efficiently accelerate the entire CNN on FPGAs. First, we propose a uniformed convolutional matrix multiplication representation for both computation-intensive convolutional layers and communication-intensive fully connected (FCN) layers. Second, we design Caffeine with the goal to maximize the underlying FPGA computing and bandwidth resource utilization, with a key focus on the bandwidth optimization by the memory access reorganization not studied in prior work. Moreover, we implement Caffeine in the portable high-level synthesis and provide various hardware/software definable parameters for user configurations. Finally, we also integrate Caffeine into the industry-standard software deep learning framework Caffe.

3. Automatic Systolic Array Synthesis

Modern FPGAs are equipped with an enormous amount of resource. However, existing implementations have difficulty to fully leverage the computation power of the latest FPGAs. We implement CNN on an FPGA using a systolic array architecture, which can achieve high clock frequency under high resource utilization. We provide an analytical model for performance and resource utilization and develop an automatic design space exploration framework, as well as source-to-source code transformation from a C program to a CNN implementation using systolic array. The experimental results show that our framework is able to generate the accelerator for real-life CNN models, achieving up to 461 GFlops for floating point data type and 1.2 Tops for 8-16 bit fixed point.

The project above works on the systolic array synthesis for CNN. We are also working on improving the generability of the approach to map more general applications to systolic arrays. We present our ongoing compilation framework named PolySA which leverages the power of the polyhedral model to achieve the end-to-end compilation for systolic array architecture on FPGAs. PolySA is the first fully automated compilation framework for generating high-performance systolic array architectures on the FPGA leveraging recent advances in high-level synthesis. We demonstrate PolySA on two key applications—matrix multiplication and convolutional neural network. PolySA is able to generate optimal designs within one hour with performance comparable to state-of-the-art manual designs.

4. CLINK: Compact LSTM Inference Kernel for Energy Efficient Neurofeedback Devices 

Neurofeedback device measures brain wave and generates feedback signal in real time and can be employed as treatments for various neurological diseases. Such devices require high energy efficiency because they need to be worn or surgically implanted into patients and support long battery life time. In this paper, we propose CLINK, a compact LSTM inference kernel, to achieve high energy efficient EEG signal processing for neurofeedback devices. The LSTM kernel can approximate conventional filtering functions while saving 84% computational operations. Based on this method, we propose energy efficient customizable circuits for realizing CLINK function. We demonstrated a 128-channel EEG processing engine on Zynq-7030 with 0.8 W, and the scaled up 2048-channel evaluation on VirtexVU9P shows that our design can achieve 215x and 7.9x energy efficiency compared to highly optimized implementations on E5- 2620 CPU and K80 GPU, respectively. We carried out the CLINK design in a 15-nm technology, and synthesis results show that it can achieve 272.8 pJ/inference energy efficiency, which further outperforms our design on the Virtex-VU9P by 99x.

4. FlexCNN: End-to-End Optimization of Deep Learning Applications

The irregularity of recent Convolutional Neural Network (CNN) models such as less data reuse and parallelism due to the extensive network pruning and simplification creates new challenges for FPGA acceleration. Furthermore, without proper optimization, there could be significant overheads when integrating FPGAs into existing machine learning frameworks like TensorFlow. Such a problem is mostly overlooked by previous studies. However, our study shows that a naive FPGA integration into TensorFlow could lead to up to 8.45x performance degradation. To address the challenges mentioned above, we propose several SW/HW co-design approaches to perform the end-to-end optimization of deep learning applications. We present a flexible and composable architecture called FlexCNN. It can deliver high computation efficiency for different types of convolution layers using techniques including dynamic tiling and data layout optimization. FlexCNN is further integrated into the TensorFlow framework with a fully-pipelined software-hardware integration flow. This alleviates the high overheads of TensorFlow-FPGA handshake and other non-CNN processing stages. We use OpenPose, a popular CNN-based application for human pose recognition, as a case study. Experimental results show that with the FlexCNN architecture optimizations, we can achieve 2.3x performance improvement. The pipelined integration stack leads to a further 5x speedup. Overall, the SW/HW co-optimization produces a speedup of 11.5x and results in an end-to-end performance of 23.8FPS for OpenPose with floating-point precision, which is the highest performance reported for this application on FPGA in the literature.

Please find the source code of FlexCNN at: