|Serpens: A High Bandwidth Memory Based Accelerator for General-Purpose Sparse Matrix-Vector Multiplication||
Serpens is a high bandwidth memory based accelerator for general-purpose sparse matrix-vector multiplication. We build Serpens accelerator on Xilinx Alveo U280 card. Serpens achieves up to 60.55 GFLOP/s (30,204 MTEPS).
|Pyxis: An Open-Source Performance Dataset of Sparse Accelerators||
Pyxis collects open-source accelerator designs and the performance data.
|Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication||
Sextans is an FPGA accelerator for general-purpose Sparse-Matrix Dense-Matrix Multiplication (SpMM).
|AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators||
Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized computing, but the fact that FPGAs are hard to program creates a steep learning curve for software programmers. Even with the help of high-level synthesis (HLS), accelerator designers still have to manually perform code reconstruction and cumbersome parameter tuning to achieve the optimal performance. While many learning models have been leveraged by existing work to automate the design of efficient...
|OLSQ: Optimal Layout Synthesis for Quantum Computing||
Many quantum computers have constraints on the connections between qubits. However, a quantum program may not conform to these constraints. Thus, it is necessary to perform 'layout synthesis for quantum computing', LSQC, which transforms quantum programs prior to execution so that the connectivity issues are resolved. OLSQ can solve LSQC optimally with respect to depth, number of SWAP gates, or fidelity. There is also a transition-based mode (TB) to speed it up with little loss of optimality. [link]
We are excited that Xilinx has made the decision to open-source the Merlin compiler under the Apache license. The Merlin compiler was originally developed by the Falcon Computing Solutions, a spin-off from the VAST Lab, which was acquired by Xilinx in 2020. Multiple research projects in the VAST Lab, such as [S2FA], [HeteroCL], and...
|AutoSA: Polyhedral-Based Systolic Array Auto-Compilation|
|AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs|
|Extending High-Level Synthesis for Task-Parallel Programs||
C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of results (QoR) and short development cycles compared with the traditional register-transfer level design approach. Yet, limited by the sequential C semantics, it remains challenging to adopt the same highly...
|HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration|
|SODA: Stencil with Optimized Dataflow Architecture||
Stencil computation is one of the most important kernels in many application domains such as image processing, solving partial differential equations, and cellular automata. Many of the stencil kernels are complex, usually consist of multiple stages or iterations, and often contain redundant computation. Such kernels are often offloaded to FPGAs to take advantages of the efficiency of dedicated hardware accelerators. However, implementing such complex kernels efficiently is not trivial, due to...
QUantum Mapping Examples with Known Optimal are a few families of quantum programs, i.e., quantum circuits, that have known optimal depths and gate counts for corresponding quantum devices in layout synthesis for quantum computing.
|FlexCNN: End-to-End Optimization of Deep Learning Applications||
|HeteroRefactor: Refactoring for Heterogeneous Computing with FPGA|
|Hardware Acceleration of Long Read Pairwise Overlapping in Genome Sequencing|
INSIDER is an FPGA-based full-stack in-storage computing system:
Please click the above link for further details.
HeteroCL is a programming infrastructure composed of a Python-based domain-specific language (DSL) and a compilation flow through close collaboration by research groups led by Prof. Zhiru Zhang at Cornell and Prof. Jason Cong at UCLA. The HeteroCL DSL provides a clean abstraction that decouples algorithm specification from three important types of hardware customization in compute, data types, and memory architectures. HeteroCL further captures the interdependence among these techniques, allowing programmers to explore various trade-offs in a systematic and productive manner. In addition,...
Cloud-scale BWAMEM (CS-BWAMEM) is an ultrafast and highly scalable aligner built on top of cloud infrastructures, including Spark and Hadoop distributed file system (HDFS). It leverages the abundant computing resources in a public or private cloud to fully exploit the parallelism obtained from the enormous number of reads. With CSBWAMEM, the pair-end whole-genome reads (30x) can be aligned within 80 minutes in a 25-node cluster with 300 cores. The features include: 1) support both pair-end and single-end alignment; 2) achieve similar quality to BWA-MEM; 3) Input: FASTQ files and 4) output...
|Microbenchmarks to Characterize Modern CPU-FPGA Platforms||
With the rapid evolution of CPU-FPGA heterogeneous acceleration platforms, it is critical for both platform developers and users to quantify the fundamental microarchitectural features of the platforms. We developed a set of microbenchmarks to evaluate mainstream CPU-FPGA platforms.
The first benchmark (https://github.com/peterpengwei/Microbench_AlphaData) is dedicated to the Alpha Data card which connects a CPU with an FPGA via the PCIe interface. The benchmark follows the Xilinx SDAccel programming model, and...
|PARADE: Full-System Accelerator-Rich Architecture Simulator||
PARADE is a cycle-accurate full-system simulation platform that enables the design and exploration of the emerging accelerator-rich architectures (ARA). It extends the widely used gem5 simulator with high-level synthesis (HLS) support.
|CMOST - System-Level FPGA Synthesis||
CMOST is a system-level design automation framework for FPGA. The main features are:
|PolyOpt/HLS: Polyhedral-Based Data Reuse Optimization for FPGA||
PolyOpt/HLS is a polyhedral loop optimization framework dedicated to data reuse optimization for High-Level Synthesis, integrated in the ROSE compiler. The main features are:
|xPilot: Platform-based Behavior Synthesis System||
The xPilot Team:
|fpgaEva : A Heterogeneous FPGA Evaluation Tool||
fpgaEva is a heterogeneous FPGA evaluation tool that incorporates a set of architecture evaluation related features into a user friendly JAVA interface. Modern field programmable gate arrays (FPGAs) provide in a single device both logic array for general logic functions and embedded memory blocks (EMBs) for efficient implementation of on-chip memory and specialized logic functions. Besides, recent generation of FPGAs take advantage of speed and density benefits resulted from heterogeneous FPGAs, which provide either an array of homogeneous programmable logic blocks (PLBs), each configured...
|MCAS: Multi-Cycle Architectural Synthesis System||
The MCAS system accepts behavioral C and VHDL, performs aggressive high-level synthesis and optimization coupled with physical planning to optimize design performance, and generates RTL implementations together with physical constraints and timing constraints (e.g., multi-cycle path constraints) which serve as guidelines for the downstream tools. The underlying theme of this research is to raise the design abstraction from RTL to higher-level description without losing the physical reality.
|RASP: FPGA/CPLD Technology Mapping and Synthesis Package||
RASP, an FPGA/CPLD technology mapping and synthesis package, is the synthesis core of the UCLA RASP System developed at UCLA VLSI CAD LAB. This site is actively updated.
Copyright (C) 1991-2004 the Regents of University of California
|Performance Estimation Models for Optimized Interconnects (IPEM)||
IPEM provides a set of procedures that estimate performance under interconnect optimization for deep submicron technology. Adopting models derived from several interconnection optimization algorithms of Trio, IPEM is fast, accurate, and easy to be linked to user's application programs. The results of IPEM match well with the UCLA Trio package.
|3-D IC Physical Design and 3-D Architecture Exploration||
3-D ICs have recently attracted great interest from researchers and IC designers. Studies demonstrate a potential performance improvement of up to 65% by transferring a placement from 2-D to 3-D and eliminating long interconnects. Furthermore, the multiple device layer structure of 3-D ICs provides a platform to integrate different components, such as digital ICs, analog ICs, memory, RF modules, and different technologies such as SOI, SiGe HBTs, GaAs, etc., into one single circuit stack. Thus, it is a more flexible vehicle for system-on-chip (SoC) and system-in-package (SiP) designs...
|CPMO --- Constrained Placement by Multilevel Optimization||
Placement is one of the most important steps in the post-RTL synthesis process, as it directly defines the interconnects, which are now the bottleneck in circuit and system performance in deep submicron technologies. The placement problem has been studied extensively in the past 30 years. However, a study from UCLA shows that existing placement solutions are surprisingly far from optimal. Using a set of cleverly constructed circuit placement examples with known optima (PEKO) that match many industrial circuit characteristics, the study shows that the results of leading placement tools from...
|V4R - Multilayer MCM Router|
|TRIO - Tree, Repeater and Interconnect Optimization Package|
|mGP - A Multilevel Global Placement Tool||
mGP - A Multilevel Global Placement Tool