Software Release

EBMF: Exact Binary Matrix Factorization This project provides SMT solving method and a heuristic, row packing, for the exact binary matrix factorization (EBMF) problem. Additionally, we provide an SMT method to find fooling set size of a binary matrix.


Optimal Layout Synthesizer of Quantum Circuits for Dynamically Field-Programmable Qubits Array.


Callipepla & SerpensCG are two conjugate gradient solvers on HBM FPGA.

Serpens: A High Bandwidth Memory Based Accelerator for General-Purpose Sparse Matrix-Vector Multiplication

Serpens is a high bandwidth memory based accelerator for general-purpose sparse matrix-vector multiplication. We build Serpens accelerator on Xilinx Alveo U280 card. Serpens achieves up to 60.55 GFLOP/s (30,204 MTEPS).

Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication

Sextans is an FPGA accelerator for general-purpose Sparse-Matrix Dense-Matrix Multiplication (SpMM).

Pyxis: An Open-Source Performance Dataset of Sparse Accelerators

Pyxis collects open-source accelerator designs and the performance data.

AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators


Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized computing, but the fact that FPGAs are hard to program creates a steep learning curve for software programmers. Even with the help of high-level synthesis (HLS), accelerator designers still have to manually perform code reconstruction and cumbersome parameter tuning to achieve the optimal performance. While many learning models have been leveraged by existing work to automate the design of efficient...

Merlin Compiler

We are excited that Xilinx has made the decision to open-source the Merlin compiler under the Apache license. The Merlin compiler was originally developed by the Falcon Computing Solutions, a spin-off from the VAST Lab, which was acquired by Xilinx in 2020. Multiple research projects in the VAST Lab, such as [S2FA], [HeteroCL], and...

AutoSA: Polyhedral-Based Systolic Array Auto-Compilation

AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs

Extending High-Level Synthesis for Task-Parallel Programs


C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of results (QoR) and short development cycles compared with the traditional register-transfer level design approach. Yet, limited by the sequential C semantics, it remains challenging to adopt the same highly...

OLSQ: Optimal Layout Synthesis for Quantum Computing

Many quantum computers have constraints on the connections between qubits. However, a quantum program may not conform to these constraints. Thus, it is necessary to perform 'layout synthesis for quantum computing', LSQC, which transforms quantum programs prior to execution so that the connectivity issues are resolved. OLSQ can solve LSQC optimally with respect to depth, number of SWAP gates, or fidelity. There is also a transition-based mode (TB) to speed it up with little loss of optimality. [link]

HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration

SODA: Stencil with Optimized Dataflow Architecture

Stencil computation is one of the most important kernels in many application domains such as image processing, solving partial differential equations, and cellular automata. Many of the stencil kernels are complex, usually consist of multiple stages or iterations, and often contain redundant computation. Such kernels are often offloaded to FPGAs to take advantages of the efficiency of dedicated hardware accelerators. However, implementing such complex kernels efficiently is not trivial, due to...

QUEKO benchmarks

QUantum Mapping Examples with Known Optimal are a few families of quantum programs, i.e., quantum circuits, that have known optimal depths and gate counts for corresponding quantum devices in layout synthesis for quantum computing. 

FlexCNN: End-to-End Optimization of Deep Learning Applications


Tutorial slides:
Click [here]

HeteroRefactor: Refactoring for Heterogeneous Computing with FPGA

Hardware Acceleration of Long Read Pairwise Overlapping in Genome Sequencing


INSIDER is an FPGA-based full-stack in-storage computing system:

Please click the above link for further details.


HeteroCL is a programming infrastructure composed of a Python-based domain-specific language (DSL) and a compilation flow through close collaboration by research groups led by Prof. Zhiru Zhang at Cornell and Prof. Jason Cong at UCLA. The HeteroCL DSL provides a clean abstraction that decouples algorithm specification from three important types of hardware customization in compute, data types, and memory architectures. HeteroCL further captures the interdependence among these techniques, allowing programmers to explore various trade-offs in a systematic and productive manner. In addition,...

Caffeine 2019
Cloud-Scale BWAMEM

Cloud-scale BWAMEM (CS-BWAMEM) is an ultrafast and highly scalable aligner built on top of cloud infrastructures, including Spark and Hadoop distributed file system (HDFS). It leverages the abundant computing resources in a public or private cloud to fully exploit the parallelism obtained from the enormous number of reads. With CSBWAMEM, the pair-end whole-genome reads (30x) can be aligned within 80 minutes in a 25-node cluster with 300 cores. The features include: 1) support both pair-end and single-end alignment; 2) achieve similar quality to BWA-MEM; 3) Input: FASTQ files and 4) output...

Microbenchmarks to Characterize Modern CPU-FPGA Platforms

With the rapid evolution of CPU-FPGA heterogeneous acceleration platforms, it is critical for both platform developers and users to quantify the fundamental microarchitectural features of the platforms. We developed a set of microbenchmarks to evaluate mainstream CPU-FPGA platforms.

The first benchmark ( is dedicated to the Alpha Data card which connects a CPU with an FPGA via the PCIe interface. The benchmark follows the Xilinx SDAccel programming model, and...

PARADE: Full-System Accelerator-Rich Architecture Simulator
PARADE is a cycle-accurate full-system simulation platform that enables the design and exploration of the emerging accelerator-rich architectures (ARA). It extends the widely used gem5 simulator with high-level synthesis (HLS) support. 
CMOST - System-Level FPGA Synthesis

CMOST is a system-level design automation framework for FPGA. The main features are:

  • Analyze and extract system-level information and generate task level data model
  • System-level optimizations for parallelism, task mapping and scheduling, pipelined streaming and data organization
    • Module evaluation using high-level synthesis
    • System-level module selection and duplication
    • ...
PolyOpt/HLS: Polyhedral-Based Data Reuse Optimization for FPGA

PolyOpt/HLS is a polyhedral loop optimization framework dedicated to data reuse optimization for High-Level Synthesis, integrated in the ROSE compiler. The main features are:

  • Automatic extraction of regions that can be optimized in the polyhedral model
  • Full support of PoCC (the Polyhedral Compiler Collection) analysis and optimizations
    • Dependence analysis with Candl
    • Program transformations for tiling and parallelism with Pluto
    • Code generation with CLooG
    • Parametric tiling with PTile
    • Data reuse optmization with LMP
    • ...

LEKO and LEKU Suites [GitHub]

(Logic synthesis Examples with Known Optimal/Upper-bounds)

Director : Prof. Jason Cong

Author : Kirill Minkovich

Copyright 2005-2008 the Regents of University of California


PEKO-MS (placement suboptimality benchmarks with parametrized white space)

Open-source repository:

The generating algorithm is described in

xPilot: Platform-based Behavior Synthesis System

The xPilot Team:

  • Professor Jason Cong
  • Researchers: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Bin Liu, Junjuan Xu, Zhiru Zhang

TPEKO Suite (Timing-driven Placement Example with Known Optimal delay)

PEKO Suite (Placement Example with Known Optimal wirelength)

fpgaEva : A Heterogeneous FPGA Evaluation Tool

fpgaEva is a heterogeneous FPGA evaluation tool that incorporates a set of architecture evaluation related features into a user friendly JAVA interface. Modern field programmable gate arrays (FPGAs) provide in a single device both logic array for general logic functions and embedded memory blocks (EMBs) for efficient implementation of on-chip memory and specialized logic functions. Besides, recent generation of FPGAs take advantage of speed and density benefits resulted from heterogeneous FPGAs, which provide either an array of homogeneous programmable logic blocks (PLBs), each configured...

MCAS: Multi-Cycle Architectural Synthesis System

The MCAS system accepts behavioral C and VHDL, performs aggressive high-level synthesis and optimization coupled with physical planning to optimize design performance, and generates RTL implementations together with physical constraints and timing constraints (e.g., multi-cycle path constraints) which serve as guidelines for the downstream tools. The underlying theme of this research is to raise the design abstraction from RTL to higher-level description without losing the physical reality.

The Team

3-D IC Physical Design and 3-D Architecture Exploration

3-D ICs have recently attracted great interest from researchers and IC designers. Studies demonstrate a potential performance improvement of up to 65% by transferring a placement from 2-D to 3-D and eliminating long interconnects. Furthermore, the multiple device layer structure of 3-D ICs provides a platform to integrate different components, such as digital ICs, analog ICs, memory, RF modules, and different technologies such as SOI, SiGe HBTs, GaAs, etc., into one single circuit stack. Thus, it is a more flexible vehicle for system-on-chip (SoC) and system-in-package (SiP) designs...

CPMO --- Constrained Placement by Multilevel Optimization

Placement is one of the most important steps in the post-RTL synthesis process, as it directly defines the interconnects, which are now the bottleneck in circuit and system performance in deep submicron technologies. The placement problem has been studied extensively in the past 30 years. However, a study from UCLA shows that existing placement solutions are surprisingly far from optimal. Using a set of cleverly constructed circuit placement examples with known optima (PEKO) that match many industrial circuit characteristics, the study shows that the results of leading placement tools from...

RASP: FPGA/CPLD Technology Mapping and Synthesis Package

RASP, an FPGA/CPLD technology mapping and synthesis package, is the synthesis core of the UCLA RASP System developed at UCLA VLSI CAD LAB. This site is actively updated.


Rasp team:

  • Jason Cong

  • Deming Chen

  • Eugene Ding

  • Zhijun Huang

  • Yean-Yow Hwang

  • John Peck

  • Chang Wu

  • Songjie Xu

Copyright (C) 1991-2004 the Regents of University of California


Performance Estimation Models for Optimized Interconnects (IPEM)

IPEM provides a set of procedures that estimate performance under interconnect optimization for deep submicron technology. Adopting models derived from several interconnection optimization algorithms of Trio, IPEM is fast, accurate, and easy to be linked to user's application programs. The results of IPEM match well with the UCLA Trio package.

IPEM team...

mGP - A Multilevel Global Placement Tool

mGP - A Multilevel Global Placement Tool

V4R - Multilayer MCM Router

V4R - Multilayer MCM Router

TRIO - Tree, Repeater and Interconnect Optimization Package

TRIO - Tree, Repeater and Interconnect Optimization Package