HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing

Yi-Hsiang Lai1*, Yuze Chi2, Yuwei Hu1, Jie Wang2, Cody Hao Yu2, 3, Yuan Zhou1, Jason Cong2, Zhiru Zhang1*

1 School of Electrical and Computer Engineering, Cornell University, USA
2 Computer Science Department, University of California, Los Angeles, USA
3 Falcon Computing Solutions, Inc., USA
*{yl2666,zhiruz}@cornell.edu

ABSTRACT
With the pursuit of improving compute performance under strict power constraints, there is an increasing need for deploying applications to heterogeneous hardware architectures with accelerators, such as GPUs and FPGAs. However, although these heterogeneous computing platforms are becoming widely available, they are very difficult to program especially with FPGAs. As a result, the use of such platforms has been limited to a small subset of programmers with specialized hardware knowledge.

To tackle this challenge, we introduce HeteroCL, a programming infrastructure composed of a Python-based domain-specific language (DSL) and an FPGA-targeted compilation flow. The HeteroCL DSL provides a clean programming abstraction that decouples algorithm specification from three important types of hardware customization in compute, data types, and memory architectures. HeteroCL further captures the interdependence among these different customization techniques, allowing programmers to explore various performance/area/accuracy trade-offs in a systematic and productive manner. In addition, our framework produces highly efficient hardware implementations for a variety of popular workloads by targeting spatial architecture templates such as systolic arrays and stencil with dataflow architectures. Experimental results show that HeteroCL allows programmers to explore the design space efficiently in both performance and accuracy by combining different types of hardware customization and targeting spatial architectures, while keeping the algorithm code intact.

ACM Reference Format:

1 INTRODUCTION
Recent trends in technology scaling have led to a growing interest in non-traditional architectures that incorporate heterogeneity

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

FPGA ’19, February 24–26, 2019, Seaside, CA, USA
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6137-8/19/02...
https://doi.org/10.1145/3289602.3293910

and specialization as a means to improve performance under strict power and energy constraints [4, 8–10, 16]. Heterogeneous architectures with extensive use of accelerators, such as GPUs and FPGAs, have shown significant potential in this role to bring in orders-of-magnitude improvement in computing efficiency for a wide range of applications. Along this line, the latest advances in the industry have produced highly integrated heterogeneous hardware platforms, such as the CPU+FPGA multi-chip packages by Intel [19] and the GPU/FPGA-enhanced AWS cloud by Amazon [29]. Although these heterogeneous computing platforms are becoming commercially available to a wide user base, they are challenging to program, especially with FPGAs. As a result, the use of such platforms has been limited to a small subset of programmers with specialized knowledge on low-level hardware details.

To address this deficiency, recent years have seen promising development on high-level synthesis (HLS) for FPGAs [12]. This is evidenced by the availability of commercial C++/OpenCL-to-FPGA solutions (e.g., Altera/Intel SDK for OpenCL [19] and Xilinx Vivado HLS [40]) and a rapidly growing number of FPGA designs synthesized by these tools [2, 18, 35, 42]. However, programming high-performance FPGA applications with HLS tools requires a deep understanding of hardware details and is entirely different from traditional software programming. In particular, current programming models for HLS entangle algorithm specifications with hardware customization techniques. This approach has several drawbacks: (1) In order to achieve good quality-of-results (QoRs), programmers are required to use various vendor-specific data types and pragmas/directives [43], rendering FPGA-targeted applications even less flexible and portable; (2) Existing HLS programming models cannot cleanly capture the interdependence among different hardware optimization techniques, thus weakening the support of user-guided or automatic design space exploration. For example, there is no easy way to inform the HLS tool that the shape of an on-chip buffer (e.g., depth and number of banks) directly depends on the degree of parallelization; (3) HLS users need to extensively restructure the source program to guide the tool to realize specialized architectures such as data reuse buffers and systolic arrays, which are nontrivial to describe with imperative code in C/C++.

There exists an active body of work attempting to further democratize accelerator programming by using domain-specific languages (DSLs) to simplify the development and optimization of applications in certain fields. For example, Halide [32] and Spark [41] are widely used in image processing and big data analytics, respectively. Another relevant example is TVM, which is a Python-based DSL for high-performance deep learning applications [6]. Similar to Halide, TVM separates the algorithm from temporal schedule optimization
(e.g., loop tiling and reordering), which significantly improves code portability across different CPU and GPU architectures.

Along this direction, we propose HeteroCL — a multi-paradigm programming infrastructure for software-defined heterogeneous computing, currently targeting CPU+FPGA platforms. HeteroCL builds on the TVM framework and extends it by explicitly exposing heterogeneity in two dimensions: (1) in programming model with mixed declarative and imperative code, and (2) in optimization with decoupled algorithm and compute/data customization. HeteroCL is designed to retain the distinct strengths of each programming paradigm/customization technique, but eliminates the complexity in using them together in a single application. More concretely, our main technical contributions are as follows:

- The HeteroCL DSL provides a clean programming abstraction that decouples algorithm specification from three important types of hardware customization in compute, data types, and memory architectures. It further captures the interdependence among different types of hardware customization, enabling productive and systematic design space exploration.
- Unlike existing DSLs which primarily focus on separating algorithm specifications from temporal compute schedule, HeteroCL further supports bit-accurate types and enables decoupling between algorithms and data quantization schemes. This allows the programmer to productively express and explore the rich design trade-offs between performance/area and accuracy.
- HeteroCL nicely blends declarative symbolic expressions with imperative code. It also provides a unified interface to specify customization schemes for both declarative and imperative programs. This allows our framework to support a broad range of applications.
- The HeteroCL framework produces highly efficient spatial architectures by incorporating state-of-the-art HLS optimizations such as PolySA [13] for systolic arrays and SODA [7] for stencil with dataflow architectures. This allows productive and effective acceleration of many popular workloads from image processing and machine learning domains.
- We have developed a fully automated compilation flow from a HeteroCL program to heterogeneous compute platforms integrating CPUs and FPGAs. Our compiler generates LLVM code on CPUs and HLS code for FPGA targets (currently using the Merlin compiler [11]).

The remainder of this paper is organized as follows. — In Section 2, we introduce HeteroCL with a motivating example and then describe each of its features in detail; Section 3 presents the HeteroCL compilation flow; We report the evaluation results in Section 4 and compare with related work in Section 5; Section 6 concludes this work and outlines future research directions.

2 THE PROGRAMMING MODEL

Figure 1 shows the overview of the proposed framework, where the input is a HeteroCL program composed of an algorithm specification and decoupled hardware customization schemes. We then lower it to an intermediate representation (IR) extended from Halide [32]. After that, we compile it to the back end specified by the users.

HeteroCL is a Python-based DSL extended from TVM [6]. We choose TVM for the following reasons: (1) Python-based DSL provides programmers with a rich set of productive language features such as introspection and dynamic type system. (2) TVM is a tensor-oriented declarative DSL. Its declarative style is similar to TensorFlow [1], which compiles and executes the computation graph constructed by a programmer. This approach is beneficial for uncovering more high-level optimization opportunities in extracting parallelism and maximizing data reuse. (3) TVM inherits the idea of decoupling the algorithm specification from the temporal schedule, which is first proposed by Halide [32].

In addition to the features offered by TVM, HeteroCL further exposes heterogeneity in two dimensions: in hardware optimization techniques and programming paradigms. Figure 1 shows the key strength of HeteroCL, where HeteroCL programs can exploit various hardware optimization techniques efficiently by decoupling the algorithm specification from three classes of common hardware customization for FPGAs, which are compute, data type, and memory customization. HeteroCL further provides a clean abstraction to capture the interdependence among different optimization techniques. Moreover, HeteroCL integrates imperative programming with an embedded tensor-oriented programming model for the applications with regular parallelism. Users can choose the programming model that fits best to express a given component.

In the rest of the section, we first use a motivating example to show how HeteroCL abstracts different types of hardware customization and captures their interdependence. We then describe each customization in more detail. Finally, we present the imperative DSL in HeteroCL.

2.1 A Motivating Example

We use dot product operation as a motivating example that utilizes all three types of hardware customization. Figure 2a shows the host program, where we compute the dot product between vectors A and B. We offload function dot_product (L14) to FPGA for acceleration. Note that we need to batch the inputs due to FPGA on-chip resource limitation (L9). Before sending the batched inputs to the accelerator via DMA, we pack them to fully utilize the off-chip memory bandwidth (L12-13).

Figure 2b shows the optimized dot_product program implemented in HLS C++ code, where we apply all three types of hardware customization. First, we utilize data type customization by quantizing the data type of local buffers local_A and local_B from floating to fixed-point type DType (L4). By reducing the bitwidthDW, we increase the number of elements per memory I/O access, which shortens the

Figure 1: Overview of the HeteroCL framework.
Results after unpack
Performance
Data type customization (§2.3)

(a) Host program
(b) Optimized HLS code

Figure 3: Example of compute customization in HeteroCL.

In addition, it is important to balance the computation time and the data communication time to maximize the hardware efficiency.

Specifically, we need to carefully balance the two components by tuning the data bitwidth DW and the number of PEs PAR. We increase the compute throughput by increasing PAR. We also increase the number of elements per I/O access by lowering DW. However, the final performance is bounded by the minimum of the two. We use the Roofline [39] diagram in Figure 2d to show the relation between DW and PAR.

Figure 3 shows how we apply compute customization in HeteroCL. First, we define the algorithm in Figure 3a, where we first import the


define N = 1024
2 define BATCH = 32
3 define MB = 64 /- off-chip memory bandwidth /
4 define DW = 32 /- bitwidth of the data element /
5 define PAR = 8 /- parallelization factor /
6 # parallelization factor
7 typedef MyType ap_uint<MB>;
8 void host_sw(float A[N], float B[N], float & sum)
9 for (int i = 0; i < N; i += BATCH)
10 MyType vec_A;
11 MyType vec_B;
12 pack(A + i, vec_A);
13 pack(B + i, vec_B);
14 sum += dot_product(vec_A, vec_B);
15 }
16

typedef DType fixed<8, 2>;
1 typedef MyType vec_A, MyType vec_B; (a) HeteroCL program
1 # algorithm
2 # primitives applied
3 # algorithm only
4 # quantization scheme
5 # quantization scheme
6 # quantization scheme
7 # quantization scheme
8 # quantization scheme
9 # quantization scheme
10 # quantization scheme
11 # quantization scheme
12 vec_A = hcl.placeholder((128,), UInt(64))
13 vec_B = hcl.placeholder((128,), UInt(64))
14 DType psum = 0;
15 int i = 0; i < N; i += BATCH/PAR; i++)
16 psum += A[i] * B[i];
17 #pragma HLS pipeline II=1
18 #pragma HLS unroll
19 return psum;
20 }
21

Figure 4: Example of data type customization in HeteroCL.
Here we unpack the data sent from DMA vec_A to a local buffer local_A. The shape of the local buffer varies according to the quantization schemes. If we quantize local_A to a 32-bit/8-bit fixed-point buffer, each element of vec_A will be unpacked to two/eight elements in local_A.

HeteroCL module (L1), define the range to be sum up (L3), and use a vector/tensor-oriented compute operation hcl.compute to describe the multiplication and accumulation operation that sums across i and returns a scalar (L4-5). The equivalent HLS code is shown in Figure 3b (L1-3). After that, we apply compute customization primitives, which are called scheduling functions in Halide/TVM, to a customization scheme created in separation of the algorithm (L7). The first primitive is a loop transformation primitive which splits loop 1 into a two-level nested loop 1 and j by a factor PAR (L8). We further apply two parallelization primitives that pipeline the outer loop 1 (L9) and unroll the inner loop j (L10). The equivalent code after applying customization primitives is in Figure 3b (L5-10).

Figure 4 shows the results of applying decoupled quantization schemes in HeteroCL. In the algorithm specification, we unpack data transmitted from the 64-bit DMA vec_A to a local
Then, we create a quantization scheme (L6) and quantize local_A from the hardware optimization specification (L14-32). We first apply roCL, where we cleanly separate the algorithm specification (L1-8) and parallelization factor PAR (L34) and pick the best scheme for final FPGA synthesis (L36-38). We then evaluate the built kernel in parallel. Similar to TVM [6], we decouple the algorithm specification from compute customization schemes. Table 1 lists compute customization primitives currently supported by HeteroCL. The primitives prevent programmers from using vendor-specific pragmas, which makes HeteroCL programs portable to different back ends.

### 2.2 Compute Customization

Compute customization improves the performance of a design by performing loop transformations and executing the computation in parallel. Similar to TVM [6], we decouple the algorithm specification from compute customization schemes. Table 1 lists compute customization primitives currently supported by HeteroCL. The primitives prevent programmers from using vendor-specific pragmas, which makes HeteroCL programs portable to different back ends.
Figure 7 shows an example of combining different types of compute customization primitives, where we implement KNN-based digit recognition in HeteroCL. The knn algorithm contains four operations, which are diff, dist, init, and update, respectively (L1-9). By merging different operations (L13-14), changing the loop order (L15), and applying parallelization schemes (L17-18), we can finally achieve more than $10 \times$ speedup on FPGA comparing with single-core single-thread CPU execution. We further show the step-by-step speedup results in Section 4.

### 2.3 Data Type Customization

Quantized computation using low-bitwidth integers and/or fixed-point types is an essential technique to achieve efficient execution on FPGAs. To represent bit-accurate data types, traditional C-based HLS tools use templates such as ap_int<> and ap_fixed<>. Although this approach allows programmers to parameterize the bitwidths, they need to run a separate script to iterate through different quantization schemes. HeteroCL addresses this challenge by utilizing Python classes to represent the data types, which allows users to try out different quantization schemes within the same program. Table 2 lists the data types currently supported by HeteroCL.

Even with the bit-accurate data type support, it remains a challenge for most application developers to determine the right data types with the right bitwidth to achieve the best trade-off between accuracy and efficiency. To solve this, HeteroCL further decouples the algorithm specification from quantization schemes. HeteroCL provides two quantization primitives in Table 3, where quantize(t, d) quantizes a list of floating-point variables t to a fixed-point type d whose format is defined in Table 2. In addition, downsize(t, d) reduces the precision of a list of integer variables t to an arbitrary-bit integer type d. With quantize and downsize, programmers can explore the trade-off between performance/area and accuracy by tuning the bitwidths of variables in the algorithm. Note that this decoupled approach is well-suited for automated bitwidth-tuning frameworks based on autotuning or rule-based heuristics. Users can further provide domain-specific knowledge such as the numerical range or the distribution of a variable to quantization primitives to guide the bitwidth searching process.

### 2.4 Memory Customization

Accelerating applications on FPGAs usually requires a high on-chip memory bandwidth to match the throughput of massively parallel compute units. Without customized memory architectures such as reuse buffers, the memory bandwidth could become the main hindrance preventing designs from achieving better performance. We decouple the algorithm from the memory customization and provide a set of primitives (Table 4). Moreover, programmers can apply several customization primitives in a user-defined sequence, which is not possible using pragmas supported by existing HLS tools.

Figure 8 shows an example of exploring different quantization schemes in HeteroCL, where we implement LeNet [26], a convolutional neural network (CNN) for digit recognition. The pre-trained model is in floating point. To explore different quantization schemes, we simply use a for loop to iterate through different bitwidths (L11). Since we know the output values of activation are between 1 and −1, we set the integer bitwidth to two (i.e., $i - (i - 2)$) (L13). We further present the evaluation results in Section 4.
2.5 Mapping to Spatial Architecture Templates

Many popular workloads from image/video processing and machine learning domains can be realized in a highly efficient manner on FPGAs using spatial architectures such as systolic arrays [33, 38]. However, with the traditional C-based HLS methodology, it typically requires extensive code restructuring and the insertion of a right combination of pragmas to guide the tool to generate a high-performance dataflow architecture composed of reuse buffers and data streams. Figure 9 shows an example of mapping a Jacobi kernel to the stencil with dataflow architecture by using the macro $s[jacobi.out].stencil()$ specified in Line 9. Given this macro, the HeteroCL back-end compiler will automatically identify the stencil pattern within the computation of out (L3-5) and synthesize the corresponding spatial architecture with the stencil back end (elaborated in Section 3).

```python
1 # algorithm specification
2 def conv(i_img, kernel):
3    ri = hcl.reduce_axis(0, 3, 'ri')
4    rj = hcl.reduce_axis(0, 3, 'rj')
5    return hcl.compute((N-2, N-2), lambda i, j: hcl.sum(
6        i_img[i+ri, j+rj] * kernel[ri, rj],
7        axis=[ri, rj]), name='i_img')
8 # memory customization - custom reuse buffers
9    s = hcl.create_scheme([i_img, kernel], conv)
10   lb = s[i_img].reuse_at(conv.o_img, conv.j)
11   wb = s[lb].reuse_at(conv.o_img, conv.i)
```

Figure 9: Example of defining custom reuse buffers and their hierarchy in HeteroCL.

```python
1 # algorithm specification
2 def mmult(A, B):
3    s = hcl.create_scheme([A, B], mmult)
4    return hcl.compute(A.shape, lambda x, y:
5        hcl.sum([A[x, k] * B[k, y], axis=k], 'C')
6 )
```

Figure 10: Example of mapping computations to stencil with dataflow architectures in HeteroCL.

**Systolic Array** – HeteroCL further provides efficient support for mapping to systolic arrays, which are widely used spatial architectures that consist a group of processing elements locally connected to each other [25]. Featuring local interconnects and modular designs, the systolic array architecture is highly scalable and can take advantage of the enormous amount of computation resources on modern FPGAs. It is particularly suitable for applications having perfectly nested loops with uniform dependency, such as matrix-matrix multiplication. However, it is a complex task to manually create systolic array designs on FPGAs. Recent research from Intel reports that it takes several to tens of months of human effort to implement a high-performance systolic array design, even with an HLS design entry like OpenCL [33].

Similar to stencil optimization, we introduce a systolic macro in the HeteroCL DSL to allow convenient mapping from tensor code to systolic array architectures. Figure 11 shows an example of using the systolic macro for matrix-matrix multiplication. Here we specify the computation of C (L8) to be synthesized with a specialized back end, which incorporates the PolySA framework [13] for automatic systolic array generation (discussed in Section 3).

```python
1 # algorithm specification
2 def jacobi(in_):
3    return hcl.compute(in_.shape, lambda y, x:
4        (in_[y, x-1] + in_[y-1, x] + in_[y, x] +
5        in_[y, x+1] + in_[y+1, x])/5, out")
```

Figure 11: Example of mapping a computation to systolic arrays in HeteroCL.

```
1 # algorithm specification
2 def mmult(A, B):
3    s = hcl.create_schedule([A, B], mmult)
4    # map to systolic arrays
5    s = hcl.create_schedule([A, B], systolic)
6
```
This approach, however, has some drawbacks: (1) The normal Python
names. For example, B.out
of supported compute operations in Table 6.
operations for mixed-paradigm programming. We show examples
compute tends existing compute operations (e.g.,
provides an imperative DSL listed in Figure 12. HeteroCL further ex-
using normal Python to support imperative programming, HeteroCL
semantic is too flexible to be FPGA synthesizable. (2) A designated
port imperative programming, such as TVM [6] and Hot&Spicy [34].
efficiently represented with declarative programming (e.g., sorting
algorithms). It also gives full control to programmers for specifying
the algorithmic details.
Some of the existing Python-based DSLs use normal Python to support
imperative programming, such as TVM [6] and Hot&Spicy [34].
This approach, however, has some drawbacks: (1) The normal Python
semantic is too flexible to be FPGA synthesizable. (2) A designated
parser/compiler must be built, which could be error-prone. Instead of
using normal Python to support imperative programming, HeteroCL
provides an imperative DSL listed in Figure 12. HeteroCL further ex-
tends existing compute operations (e.g., compute) and develops new
operations for mixed-paradigm programming. We show examples of
supported compute operations in Table 6.

Programmers can also apply hardware customization to the
imperative DSL. For instance, Figure 13 shows the popcount algorithm
implemented using the imperative DSL, where we apply both compute
and data type customization. Programmers access the vectors and
loops declared within a HeteroCL compute operation by their
names. For example, B.out in Line 11 refers to the vector out
declared in Line 3; B.loop1 in Line 14 refers to the for loop loop1 in

<table>
<thead>
<tr>
<th>BinOp</th>
<th>:=</th>
<th>*=</th>
<th>/=</th>
<th>%=</th>
<th>&amp;=</th>
<th>^=</th>
<th>&gt;&gt;=</th>
<th>&lt;&lt;=</th>
</tr>
</thead>
</table>

<table>
<thead>
<tr>
<th>Operation</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>compute(s, f)</td>
<td>Compute a new tensor of shape s. The value of each element in the new tensor is calculated according to lambda function f.</td>
</tr>
<tr>
<td>update(t, f)</td>
<td>Update each element of tensor t according to lambda function f.</td>
</tr>
<tr>
<td>mutate(s, f)</td>
<td>Write a for loop of shape s in vector code, where f is a lambda function describing the for loop body.</td>
</tr>
</tbody>
</table>

Programmers can also apply hardware customization to the
imperative DSL. For instance, Figure 13 shows the popcount algorithm
implemented using the imperative DSL, where we apply both compute
and data type customization. Programmers access the vectors and
loops declared within a HeteroCL compute operation by their
names. For example, B.out in Line 11 refers to the vector out
declared in Line 3; B.loop1 in Line 14 refers to the for loop loop1 in

Figure 12: Imperative DSL in HeteroCL — We provide equivalent
semantics for commonly used expressions and statements in normal
Python. We also support bit-level operations for bit-accurate data
types. The imperative DSL highly resembles normal Python in that
they use same indentations, same rules for variable scope, and similar
keywords. This relieves new users from learning a whole new set of
syntax and semantics.

Figure 13: Example of applying hardware customization to
imperative DSL — We extend existing compute operations (e.g.,
compute) and customization primitives (e.g., `unroll`) in declarative
code to further support imperative programming.

3 BACK-END CODE GENERATION AND
OPTIMIZATION
The HeteroCL framework has multiple back-end supports including
CPU and HLS flows targeting FPGA. Specifically, we extend the
Halide IR used by TVM [6, 32] for our multi-paradigm programming
model and customization primitives. The extended Halide IR serves
as a unified representation for all back-end flows. In this section, we
briefly summarize our FPGA back-end code generation flow.

General Back End — The HeteroCL compiler can generate a corre-
responding accelerator kernel in many languages, including HLS
C/C++, OpenCL, and Merlin C. Merlin C is an OpenMP-like pro-
gramming model used by the Merlin compiler [11] from Falcon
Computing Solutions. We choose the Merlin compiler as one of
our back-end tools for two reasons. First, it leverages a small set of
OpenMP-like pragmas to apply certain architecture structures
by source-to-source C code transformation. Since Merlin pragmas
share lots of similarity with HeteroCL customization primitives, it is
relatively straightforward to integrate with HeteroCL. Second, the
Merlin compiler generates both HLS C kernels and OpenCL kernels
for FPGAs from the unified Merlin C source code.

Table 8 shows the correspondence between HeteroCL primitives
and Merlin C pragmas. The primitive `unroll` implies fine-grained
parallelism, which indicates all loop body logic to be scheduled in
the same hardware module. As a result, all sub-loops in the target
loop is flattened if a user applies `unroll` to a non-innermost loop.
On the other hand, the primitive `parallel` indicates coarse-grained
parallelism (e.g., a PE array). In addition, if a `pipeline` primitive is
assigned to a non-innermost loop, we map it to a coarse-grained
pipeline architecture.

Since HeteroCL primitives and Merlin C pragmas mainly specify
loop scheduling or memory organization, the implied architecture
can be represented as a composable, parallel, and pipeline (CPP)
arquitectures [14]. The authors in [14] have demonstrated that the
CPP architecture can be applied to broad classes of applications with
a good performance.

Stencil Back End — We incorporate the SODA framework pro-
posed in [7] to implement stencil patterns with optimized dataflow
architecture that minimizes the on-chip reuse buffer size. SODA takes
in a lightweight DSL that describes the stencil compute patterns and
design parameters. After the HeteroCL compiler identifies stencil
patterns according to user-specified macros, it generates the proper
DSL code to the SODA framework for hardware generation. In ad-
dition, hardware customization primitives such as loop unrolling
and data quantization are also reflected in the SODA DSL as design
parameters, which in turn guide the SODA framework for further
optimization.
We run the baseline designs on one CPU core with a single thread, which further performs automated design space exploration that optimizes the systolic array architecture including the shape of it and the interconnection between PEs.

4 EVALUATION

In this section, we evaluate the accelerators generated by HeteroCL. The platform we target is the AWS EC2 f1.2xlarge instance, which has 8 vCPU cores, 122GiB main memory, and a Xilinx Virtex UltraScale+™ VU9P FPGA. The default target frequency for this platform is 250 MHz.

We select several common FPGA benchmarks from a broad range of applications that are applied with either general, stencil, or systolic array back ends. For the general back end, we have (1) KNN-based digit recognition, which is simplified from that of Rosetta [43], (2) K-means algorithm code, and (3) Smith-Waterman [36]. For the stencil back end, we have (1) Gaussian, (2) Jacobi, and (3) Seidel. All of them are from Polybench [30]. For the systolic back end, we use (1) general matrix multiplication (GEMM) and (2) deep learning inference with LeNet model [26]. Among these benchmarks, KNN-based digit recognition, K-means, and Smith-Waterman need to be implemented with the HeteroCL imperative DSL.

Table 7 shows the benchmarks and the overall evaluation results. We run the baseline designs on one CPU core with a single thread. For the two systolic benchmarks, we are comparing our FPGA implementations with the GEMM function provided in Intel MKL [20] and a LeNet model optimized with TVM [6], respectively. We include memory transfer time (i.e., between DDR4 and FPGA) as part of the total run time. After applying proper customization primitives, we achieve up to 20.9× speedup for the benchmarks with the general back end. Moreover, we can achieve up to 13.2× and 10.6× speedup for benchmarks applied with stencil and systolic array back end, respectively. In the rest of this section, we show the detailed evaluation of each back end.

**General Back End** – We first evaluate the impact of HeteroCL customization primitives on performance with the general back end. Table 9 shows the speedup of generated accelerator kernels after step-by-step applications of customization primitives. We first show the speedup without applying any customization primitive and that after applying parallelization primitives such as unroll11 and parallel. However, without applying appropriate loop transformation primitives such as split and reorder, the performance improvement could be limited. Thus, we also show the results after applying those primitives to the benchmarks. Table 9 demonstrates the permutability of HeteroCL, where programmers can easily explore the design space just by adding or removing primitives without changing the algorithm code.

**Stencil Back End** – Table 10 shows the speedup of the benchmarks after applying the stencil macro, customization primitives, and the ideal speedup. If we only apply the macro, we are only up to 1.1× faster because of the limited parallelism. To improve the performance, we apply parallelization primitives (i.e., unroll11), which
Figure 14: Accuracy of LeNet with different quantization schemes.

Table 11: Performance results of systolic array benchmarks – The performance is in terms of Giga operations per second (GOPs).

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Backend</th>
<th>Data Type</th>
<th>Performance (GOPs)</th>
<th>Speedup</th>
</tr>
</thead>
<tbody>
<tr>
<td>LeNet Inference</td>
<td>CPU [6]</td>
<td>float32</td>
<td>15.4</td>
<td>1.0</td>
</tr>
<tr>
<td></td>
<td>FPGA</td>
<td>float32</td>
<td>79.8</td>
<td>5.2</td>
</tr>
<tr>
<td></td>
<td></td>
<td>fixed16</td>
<td>137.8</td>
<td>8.9</td>
</tr>
<tr>
<td>GEMM</td>
<td>CPU [20]</td>
<td>float32</td>
<td>76.0</td>
<td>1.0</td>
</tr>
<tr>
<td></td>
<td>FPGA</td>
<td>float32</td>
<td>245.9</td>
<td>3.2</td>
</tr>
<tr>
<td></td>
<td></td>
<td>fixed16</td>
<td>807.6</td>
<td>10.6</td>
</tr>
</tbody>
</table>

results in up to 6.7× speedup. At this stage, since the performance bottleneck of all benchmarks is the off-chip memory bandwidth (about 13.8GB/s), we cannot get higher throughput by further increasing the unrolling factor. To address this, we quantize the single-precision floating-point numbers to 16-bit fixed-point numbers. As a result, the required external memory bandwidth could be reduced and we can achieve an additional 2× speedup. As a reference, the last column shows the ideal speedup assuming the memory bandwidth is perfectly utilized. In summary, Table 10 shows that by combining spatial architecture macros with other types of hardware customization, we can further improve the performance.

Systolic Array Back End – We finally evaluate the benchmarks applied with systolic macros. For both applications, we evaluate the impact of data type customization on performance, which we present in Table 11. By using both spatial architecture macros and data type customization primitives, we can improve the performance of both designs.

We further show the accuracy results after applying different quantization schemes to LeNet benchmark in Figure 14, where the X-axis shows the number of total bitwidth and the Y-axis shows the accuracy. We demonstrate three different scenarios, which are quantizing the activation, weights, and both, respectively. We observe that with 8-bit fixed-point type, the accuracy degradation is marginal. Moreover, if we choose to quantize the activation only, 4-bit fixed-point type is the best choice.

5 RELATED WORK

There exists a large body of work on HLS and domain-specific programming. In this section, we survey a small subset of representative efforts on C-based HLS, DSLs for hardware accelerator designs, and those that support decoupled algorithm and optimizations.

C-based HLS – HLS tools such as LegUp [5], Intel FPGA SDK [21], and Xilinx Vivado HLS [40] allow developers to write FPGA designs in C/C++ and OpenCL, delivering higher productivity than traditional register-transfer-level (RTL) designs. The recently introduced Merlin compiler greatly simplifies the HLS design by applying source-to-source transformation to automatically generate optimized HLS-C or OpenCL programs [11]. However, to achieve good QoRs, developers are required to use various vendor-specific data types and pragmas/directives, rendering FPGA design with HLS less flexible and portable.

HeteroCL lifts the abstraction level of FPGA programming and provides developers with a systematic way to efficiently explore various trade-offs, making FPGA design more portable and productive.

DSLs for Hardware Accelerator Design – There is a growing interest in compiling programs written in high-level languages (e.g., Python, Scala) into reconfigurable hardware accelerators. Hot & Spicy compiles annotated Python code into HLS C/C++, where the annotations are translated into pragmas [34]. DHDL introduces a representation of hardware using parameterized templates in Scala that captures locality and parallelism information and compiles the representation into FPGAs and CGRAs [24]. Spatial extends DHDL by adding a set of low-level abstractions for control and memory [23]. However, in these DSLs the algorithm specification is tightly entangled with hardware optimizations, making design space exploration less productive.

HeteroCL decouples algorithm specification from hardware customization, and abstracts three important types of hardware customization into a set of customization primitives, enabling productive and systematic design space exploration. HeteroCL further offers additional macros in stencil and systolic for efficient mapping to highly optimized spatial architecture templates.

DSLs with Decoupled Algorithm and Optimization – Most computing patterns in image processing and deep learning can be concisely described as nested loops in a declarative programming paradigm, as illustrated in a lot of DSLs [17, 22, 27, 37]. Halide first proposes to decouple the algorithm specification from the temporal schedule [32]. Tiramisu extends Halide by adding explicit communication, synchronization, and mapping buffers to different memory hierarchies [3, 33]. Jing Pu, et al. also extend Halide to support custom reuse buffers and support FPGAs and CGRAs as back end [31]. T2S extends Halide by decoupling the temporal schedule from the algorithm specification, which allows programmers to define systolic-array-like architectures [33]. TVM builds a deep learning compiler stack on top of Halide IR, supporting both CPUs and GPUs [6]. While the declarative programming paradigm in these DSLs is powerful, it cannot express applications beyond image processing and deep learning.

HeteroCL, as a multi-paradigm programming infrastructure, nicely blends declarative symbolic expressions with imperative code, and provides a unified interface to specify customization schemes for both declarative and imperative programs. This allows HeteroCL to support a broader range of applications.

More specifically, we list the major differences between TVM and HeteroCL as follows: (1) TVM extensively uses declarative programming to target deep learning applications, while HeteroCL supports mixed imperative and declarative programming to target general applications. (2) TVM tries to solve the optimization challenges mainly for CPUs and GPUs, while HeteroCL focuses on hardware customization for FPGA and incorporates advanced spatial architecture templates. (3) TVM programs can target FPGAs as back end by using VTA, a programmable accelerator that uses a RISC-like programming abstraction to describe tensor operations [28]. On the other hand, HeteroCL programs are not limited to tensor operations.
In addition, programmers can apply various hardware customization techniques with provided primitives while the hardware generated by VTA is pre-defined. (4) HeteroCL supports bit-accurate data types, which are not available in TVM. Furthermore, HeteroCL proposes to decouple the quantization scheme from algorithm specification.

6 CONCLUSIONS AND FUTURE WORK

We have presented HeteroCL, a multi-paradigm programming infrastructure for heterogeneous platforms integrating CPUs and FPGAs. HeteroCL not only provides a clean abstraction that decouples the algorithm from compute/data customization, but it also captures the interdependence among them. Moreover, HeteroCL incorporates spatial architecture templates including systolic arrays and stencil with dataflow architectures. We believe HeteroCL can help developers to focus more on designing efficient algorithms rather than being distracted by low-level implementation details.

We are releasing the proposed framework in an open-source format. The programming infrastructure as well as the associated documents and example designs are publicly available on the authors’ website. Additionally, we plan to introduce primitives for data and device placement, and also data streaming interfaces. We will also integrate HeteroCL with more spatial architecture templates, distributed autotuning capabilities, and accurate QoR estimation boosted by machine learning techniques.

ACKNOWLEDGEMENTS

This research was supported in part by CRISP, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA, NSF/Intel CAPA Award #1723773, DARPA Young Faculty Award D15AP00096, NSF Awards #1453378, #1436827, and #1707408, and research gifts from Intel and Xilinx. We thank Amazon for providing AWS EC2 credits. We thank Prof. Adrian Sampson (Cornell), Dr. Hongbo Rong (Intel), and Dr. Justin Gottschlich (Intel) for providing helpful feedback on the HeteroCL framework. We also thank Ritchie Zhao, Ziyian Feng, Shaojie Xiang, Yichi Zhang, Patrick Clobreidge, and Qing Yu for their contributions to the HeteroCL benchmarks.

REFERENCES