Domain-specific accelerators (DSAs) have shown to offer significant performance and energy efficiency over general-purpose CPUs to meet the ever increasing performance needs. However, it is well-known that the DSAs in field-programmable gate-arrays (FPGAs) or application specific integrated circuits (ASICs) are hard to design and require deep hardware knowledge to achieve high performance. Although the recent advances in high-level synthesis (HLS) tools made it possible to compile behavioral-level C/C++ programs to FPGA or ASIC designs, one still needs to have extensive experience in microarchitecture optimizations using pragmas and code transformation to the input program, which presents a significant barrier to a typical application domain-expert or software developer to design a DSA. Even worse, evaluating each HLS design candidate is time consuming, which makes it very difficult to perform manual design iteration or automated exploration. The proposed project addresses these problems by developing a fully automated framework for evaluating and optimizing the microarchitecture of a DSA design without the invocation of the time-consuming HLS tools. It represents the input C/C++ program as one or a set of graphs with the proper data flow and control flow information, including auto-inserted optimization directives (pragmas), and then makes use of the latest advances in graph-based machine learning (ML) and ML-driven optimizations to quickly evaluate each solution candidate and guide the optimization process. The goal of this project is to enable a typical software programmer to be able to design highly efficient hardware DSAs, with the quality comparable to those designed by experienced circuit designers.
The team led by Professors Jason Cong and Yizhou Sun from the CS Department were recently awarded $1.2M from the National Science Foundation (NSF) for the project entitled “High Level Synthesis via Graph-Centric Deep Learning”.
Below are the summaries of the papers under this project:
1. Automated Accelerator Optimization Aided by Graph Neural Networks
High-level synthesis (HLS) has freed the computer architects from developing their designs in a very low-level language and needing to exactly specify how the data should be transferred in register-level. With the help of HLS, the hardware designers must describe only a high-level behavioral flow of the design. Despite this, it still can take weeks to develop a high-performance architecture mainly because there are many design choices at a higher level that requires more time to explore. It also takes several minutes to hours to get feedback from the HLS tool on the quality of each design candidate. In this paper, we propose to solve this problem by modeling the HLS tool with a graph neural network (GNN) that is trained to be used for a wide range of applications. The experimental results demonstrate that by employing the GNN-based model, we are able to estimate the quality of design in milliseconds with high accuracy which can help us search through the solution space very quickly.
2. Improving GNN-Based Accelerator Design Automation with Meta Learning
Recently, there is a growing interest in developing learning-based models as a surrogate of the High-Level Synthesis (HLS) tools, where the key objective is rapid prediction of the quality of a candidate HLS design for automated design space exploration (DSE). Training is usually conducted on a given set of computation kernels (or kernels in short) needed for hardware acceleration. However, the model must also perform well on new kernels. The discrepancy between the training set and new kernels, called domain shift, frequently leads to model accuracy drop which in turn negatively impact the DSE performance. In this paper, we investigate the possibility of adapting an existing meta-learning approach, named MAML, to the task of design quality prediction. Experiments show the MAML-enhanced model outperforms a simple baseline based on fine tuning in terms of both offline evaluation on hold-out test sets and online evaluation for DSE speedup results.
3. Robust GNN-based Representation Learning for HLS
The efficient and timely optimization of microarchitecture for a target application is hindered by the long evaluation runtime of a design candidate, creating a serious burden. To tackle this problem, researchers have started using learning algorithms such as graph neural networks (GNNs) to accelerate the process by developing a surrogate of the target tool. However, challenges arise when developing such models for HLS tools due to the program’s long dependency range and deeply coupled input program and transformations (i.e., pragmas). To address them, in this paper, we present HARP (Hierarchical Augmentation for Representation with Pragma optimization) with a novel hierarchical graph representation of the HLS design by introducing auxiliary nodes to include high-level hierarchical information about the design. Additionally, HARP decouples the representation of the program and its transformations and includes a neural pragma transformer (NPT) approach to facilitate a more systematic treatment of this process. Our proposed graph representation and model architecture of HARP not only enhance the performance of the model and design space exploration based on it but also improve the model’s transfer learning capability, enabling easier adaptation to new environments.
4. Towards a Comprehensive Benchmark for High-Level Synthesis Targeted to FPGAs
High-level synthesis (HLS) aims to raise the abstraction layer in hardware design, enabling the design of domain-specific accelerators (DSAs) like field-programmable gate arrays (FPGAs) using C/C++ instead of hardware description languages (HDLs). Compiler directives in the form of pragmas play a crucial role in modifying the microarchitecture within the HLS framework. However, the space of possible microarchitectures grows exponentially with the number of pragmas. Moreover, the evaluation of each candidate design using the HLS tool consumes significant time, ranging from minutes to hours, leading to a time-consuming optimization process. To accelerate this process, machine learning models have been used to predict design quality in milliseconds. However, existing open-source datasets for training such models are limited in terms of design complexity and available optimizations. In this paper, we present HLSyn, the first benchmark that addresses these limitations. It contains more complex programs with a wider range of optimization pragmas, making it a comprehensive dataset for training and evaluating design quality prediction models. The HLSyn benchmark consists of 42 unique programs/kernels, resulting in over 42,000 labeled designs. We conduct an extensive comparison of state-of-the-art baselines to assess their effectiveness in predicting design quality. As an ongoing project, we anticipate expanding the HLSyn benchmark in terms of both quantity and variety of programs to further support the development of this field.
Domain-specific accelerators (DSAs) have shown to offer significant performance and energy efficiency over general-purpose CPUs to meet the ever increasing performance needs. However, it is well-known that the DSAs in field-programmable gate-arrays (FPGAs) or application specific integrated circuits (ASICs) are hard to design and require deep hardware knowledge to achieve high performance. Although the recent advances in high-level synthesis (HLS) tools made it possible to compile behavioral-level C/C++ programs to FPGA or ASIC designs, one still needs to have extensive experience in microarchitecture optimizations using pragmas and code transformation to the input program, which presents a significant barrier to a typical application domain-expert or software developer to design a DSA. Even worse, evaluating each HLS design candidate is time consuming, which makes it very difficult to perform manual design iteration or automated exploration. The proposed project addresses these problems by developing a fully automated framework for evaluating and optimizing the microarchitecture of a DSA design without the invocation of the time-consuming HLS tools. It represents the input C/C++ program as one or a set of graphs with the proper data flow and control flow information, including auto-inserted optimization directives (pragmas), and then makes use of the latest advances in graph-based machine learning (ML) and ML-driven optimizations to quickly evaluate each solution candidate and guide the optimization process. The goal of this project is to enable a typical software programmer to be able to design highly efficient hardware DSAs, with the quality comparable to those designed by experienced circuit designers.