Time: 1:30pm to 5:00pm PST, Sunday February 27, 2022
As high-level synthesis (HLS) tools are getting more and more mature, HLS synthesizable C/C++/OpenCL are becoming popular as new design entry languages for FPGA accelerator implementation. However, the pragmas and coding style associated with the HLS input program have significant impact to the final accelerator design quality. Therefore, there are growing interests in developing source-to-source transformation and optimization tools to automatically generate HLS-friendly C/C++/OpenCL code. The recent community-wide effort on MLIR (multi-level intermediate representation) and open-source of the Merlin Compiler by Xilinx (via acquisition of Falcon Computing Solutions) open more opportunities for source-to-source transformation and optimization. This workshop has six exciting talks about the latest progress in this area. It ends with a panel of leaders from academia and industry on FPGA synthesis to discuss “What’s next on source-to-source transformation?”
1:30 PM - 1:35 PM Introduction: Jason Cong (5 mins)
1:35 PM - 3:25 PM Session 1: Latest Progress on Source-to-Source Transformation for HLS (110 mins)
Chair: Jason Anderson (University of Toronto)
High-level Synthesis (HLS) has been widely adopted as it significantly improves the hardware design productivity and enables efficient design space exploration (DSE). Existing HLS tools are built using compiler infrastructures largely based on a single-level abstraction, such as LLVM. However, as HLS designs typically come with intrinsic structural or functional hierarchies, different HLS optimization problems are often better solved with different levels of abstractions. This tutorial describes ScaleHLS, a new scalable and customizable HLS framework, on top of a multi-level compiler infrastructure called MLIR. ScaleHLS can compile HLS C/C++, ONNX models, or PyTorch models to optimized HLS C/C++ in order to generate high-quality RTL designs using downstream tools, such as Xilinx Vivado HLS. ScaleHLS represents HLS designs at multiple representation levels and provides an HLS-dedicated analysis and transform library to solve the optimization problems at the suitable representation levels. Using this library, we provide a DSE engine to generate optimized HLS designs automatically. In addition, we develop an HLS C front-end and a C/C++ emission back-end to translate HLS designs into/from MLIR for enabling a source-to-source compilation flow. Experimental results show that, comparing to the baseline designs without manual code-rewriting and pragma insertion, that are only optimized by Xilinx Vivado HLS, ScaleHLS improves the performances with amazing quality-of-results -- up to 768.1x better on computation kernel level programs and up to 3825x better on neural network models. ScaleHLS is open-sourced and has been adopted by researchers and students. In this tutorial, we will introduce the key features of ScaleHLS, its different optimization passes at different abstraction levels, the DSE engine, and the end-to-end ScaleHLS flow. We will also demonstrate the steps of using ScaleHLS and discuss how the research community can leverage this framework to build additional features on top of ScaleHLS, including design verification, IP integration, and RTL generation by connecting to the CIRCT framework as a back-end.
Speaker: Hanchen Ye, UIUC
Authors: Hanchen Ye (University of Illinois at Urbana-Champaign), Cong Hao (Georgia Institute of Technology), Jianyi Cheng (Imperial College London), Hyunmin Jeong (University of Illinois at Urbana-Champaign), Jack Huang (University of Illinois at Urbana-Champaign), Stephen Neuendorffer (Xilinx Inc.), Deming Chen (University of Illinois at Urbana-Champaign)
High-Level Synthesis (HLS) has lifted the design abstraction from RTL to C/C++. HLS tools have been increasingly adopted by FPGA and ASIC designers. However, HLS tools typically require a lot of code rewrite efforts to arrive at quality designs. Moreover, the code rewriting requires extensive knowledge in not only hardware micro-architecture design but also the coding style of the particular high-level synthesis tool chosen, which limits the usability of HLS, especially for software developers. Merlin Compiler has been developed to address the programming challenge in HLS-based design flow. Merlin supports a small set of high-level OpenMP-like directives for users to specify the parallelism to explore. Moreover, it can automatically perform a set of advanced hardware-oriented optimizations specific to the input code and the target platform to arrive at high performance designs. Merlin combined with an HLS tool can greatly simplify hardware accelerator design. In the case of FPGAs, Merlin can also generate necessary code to interface CPUs and FPGAs and the accelerated functions can be integrated and linked with the CPU code to produce an executable that can run across CPUs & FPGAs. This is especially useful in today’s world where FPGAs are increasingly used to accelerate big data applications in data centers. Merlin Compiler was developed by the Falcon Computing Solutions, a startup originated from UCLA, and acquired by Xilinx in 2020. Merlin has been open sourced in 2021 to enable wide adoption and future research. In this tutorial, we will describe the high-level programming model and existing source-level optimization transformations in Merlin.
Speaker: Peichen Pan
Authors: Youxiang Chen, Xilinx, Jason Cong, UCLA, Min Gao, Google, Peichen Pan, Leda Technology
Abstract: Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized computing, but the fact that FPGAs are hard to program creates a steep learning curve for software programmers. Even with the help of high-level synthesis (HLS), accelerator designers still have to manually perform code reconstruction and cumbersome parameter tuning to achieve the optimal performance. While many learning models have been leveraged by existing work to automate the design of efficient accelerators, the unpredictability of modern HLS tools becomes a major obstacle for them to maintain high accuracy. To address this problem, we propose an automated DSE framework-AutoDSE- that leverages a bottleneck-guided coordinate optimizer to systematically find a better design point. AutoDSE detects the bottleneck of the design in each step and focuses on high-impact parameters to overcome it. The experimental results show that AutoDSE is able to identify the design point that achieves, on the geometric mean, 19.9x speedup over one CPU core for Machsuite and Rodinia benchmarks. Compared to the manually optimized HLS vision kernels in Xilinx Vitis libraries, AutoDSE can reduce their optimization pragmas by 26.38x while achieving similar performance. With less than one optimization pragma per design on average, we are making progress towards democratizing customizable computing by enabling software programmers to design efficient FPGA accelerators. AutoDSE is open sourced at: https://github.com/UCLA-VAST/AutoDSE
Speaker: Atefeh Sohrabizadeh, UCLA
Authors: Atefeh Sohrabizadeh, UCLA , Cody Hao Yu, Amazon Research , Min Gao, Google, Jason Cong, UCLA
Abstract: The exploding complexity and computation efficiency requirements of applications are stimulating a strong demand for hardware acceleration with heterogeneous platforms such as FPGAs. However, a high-quality FPGA design is very hard to create and optimize as it requires FPGA expertise and a long design iteration time. In contrast, software applications are typically developed in a short development cycle, with high-level languages like Python, which have much higher levels of abstraction than all existing hardware design flows. To close this gap between hardware design flows and software applications, and simplify FPGA programming, we create PyLog, a high-level, algorithm-centric Python-based programming and synthesis flow for FPGA. PyLog is powered by a set of compiler optimization passes and a type inference system to generate high-quality hardware design. It abstracts away the implementation details, and allows designers to focus on algorithmic specification. The whole FPGA development flow including synthesis and execution is automated. Evaluation shows that PyLog significantly improves FPGA design productivity and generates highly efficient FPGA designs that outperform highly optimized CPU implementation and state-of-the-art FPGA implementation by 3.17x and 1.24x on average. PyLog is flexible and scalable. It can target both edge and cloud devices, and can work naturally with the PYNQ runtime environment for the whole system deployment. PyLog has been open sourced to enable future research. In this tutorial, we will demonstrate the steps of using PyLog, including its synthesis and runtime flows, features of PyLog, use cases of PyLog, as well as the future directions of the PyLog project.
Speaker: Sitao Huang (UC Irvine)
Authors: Sitao Huang, University of California, Irvine, Kun Wu, Jialiang Zhang, Deming Chen, Wen-mei Hwu, ECE, University of Illinois at Urbana-Champaign
Abstract: With the pursuit of improving compute performance under strict power constraints, there is an increasing need for deploying applications to heterogeneous hardware architectures with accelerators, such as GPUs and FPGAs. However, although these heterogeneous computing platforms are becoming widely available, they are very difficult to program especially with FPGAs. As a result, the use of such platforms has been limited to a small subset of programmers with specialized hardware knowledge. To tackle this challenge, we introduce HeteroCL, a programming infrastructure composed of a Python-based domain-specific language (DSL) and an FPGA-targeted compilation flow. The HeteroCL DSL provides a clean programming abstraction that decouples algorithm specification from three important types of hardware customization in compute, data types, and memory architectures. HeteroCL further captures the interdependence among these different customization techniques, allowing programmers to explore various performance/area/accuracy trade-offs in a systematic and productive manner. In addition, our framework produces highly efficient hardware implementations for a variety of popular workloads by targeting spatial architecture templates such as systolic arrays and stencil with dataflow architectures. Experimental results show that HeteroCL allows programmers to explore the design space efficiently in both performance and accuracy by combining different types of hardware customization and targeting spatial architectures, while keeping the algorithm code intact. In this tutorial, we will introduce the key features of HeteroCL including both mixed-paradigm programming style and decoupled hardware customizations. Moreover, we will demonstrate how one can leverage HeteroCL to describe and optimize a realistic design by applying a variety of customization primitives and targeting spatial architecture templates.
Speaker: Yi-Hsiang Lai, Cornell
Authors: Yi-Hsiang Lai, Shaojie Xiang, Niansong Zhang, Hongzheng Chen, Zhiru Zhang, Cornell
Abstract: The hls4ml package translates trained neural network models into synthesizable FPGA firmware. The firmware library targets efficient, ultrafast inference for its original application in real-time processing in particle physics. However, the generality of the package makes it applicable to a wide range of scientific and industry areas in which real-time processing on-device is needed. The hls4ml package is application-driven and focuses on usability while allowing for deep customization features including tunable parallelism and quantization and model pruning. It is integrated with quantization aware training frameworks for a fully codesigned workflow resulting in low precision weights and activations and enabling very lightweight inference without loss of model accuracy. We will introduce a few use-cases to demonstrate the power and breadth of the workflow.
Speaker: Nhan V Tran, Fermi National Accelerator Laboratory
Authors: The hls4ml team, see fastmachinelearning.org
3:25 PM - 3:40 PM 15 mins break
3:40 PM - 4:50 PM Session 2: Panel: What’s next on source-to-source transformation? (70 mins)
Organizer and Moderator: Jason Cong (University of California, Los Angeles)
Moderator: Jason Cong, UCLA
Panelists: Deming Chen (UIUC), John Freeman (Intel), Stephenn Neuendorffer (Xilinx), Zhiru Zhang (Cornell)