Peipei Zhou
Ph.D. Student in Computer Science at UCLA

VAST Lab
4651 Boelter Hall
Computer Science Department
University of California, Los Angeles, CA 90095

Email: memoryzpp [at] cs.ucla.edu

Google Scholar

Microsoft Academic

LinkedIn



About Me

I am a fourth-year Computer Science Ph.D. student (6th year graduate student researcher) at the University of California, Los Angeles, advised by Prof. Jason Cong. I am a member of the VLSI architecture, synthesis & technology (VAST) at UCLA. I received M.S. degree under supervision of Prof. Cong on June. 2014 and B.S. degree from Chien-Shiung Wu Honors College of Southeast University on June, 2012.

My research interests lie in Parallel/Distributed architecture and programming, performance and energy model for computer architecture design. I am involved in research projects including customized computing for precision medicine, deep learning, and accelerator-rich architectures.



What's New

Sept 2017
Presented Genomic Workload Charaterization poster in Annual Review for CFAR(Center for Future Architectures Research).
June 2017
Started my 2nd internship at Microsoft, in OXO Machine Learning Team.
November 2016
2016 ICCAD present Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks.
November 2016
2016 ICCAD HALO travel award.
November 2016
2016 ICCAD HALO (Hardware and Algorithms for Learning On-a-chip) present two posters: Caffeine ICCAD' 16, ISLPED' 16(presenter only)
June 2012
I graduated from Chien-Shiung Wu Honors College of Southeast University.


Research Projects

2016 Jun-Now

Customized Computing for Spark-based GATK.

2016 Jun-Now

Impact of I/O for in-memory computing framework Apache Spark.

2015 Jun-Now

Analytic Model for Energy Efficiency of Pipelining.

2014 Jun-Now

Customized-Computing-for-Precision-Medicine [CDSC '14].

2013 Apr-Now

Fully Pipelined and dynamically composable architecture (FPCA) of CGRA.



Publications

Conference Papers

C6

Bandwidth Optimization Through On-Chip Memory Restructuring for HLS DAC '17

Jason Cong, Peng Wei, Cody Hao Yu, Peipei Zhou
54th Annual Design Automation Conference (ACM DAC '17), acceptance rate: x/x = x.x%

High-level synthesis (HLS) is getting increasing attention from both academia and industry for high-quality and high-productivity designs. However, when inferring primitive-type arrays in HLS designs into on-chip memory buffers, commercial HLS tools fail to effectively organize FPGAs’ on-chip BRAM building blocks to realize high-bandwidth data communication; this often leads to suboptimal quality of results. This paper addresses this issue via automated on-chip buffer restructuring. Specifically, we present three buffer restructuring approaches and develop an analytical model for each approach to capture its impact on performance and resource consumption. With the proposed model, we formulate the process of identifying the optimal design choice into an integer non-linear programming (INLP) problem and demonstrate that it can be solved efficiently with the help of a one-time C-to-HDL(hardware description language) synthesis. The experimental results show that our automated source-to-source code transformation tool improves the performance of a broad class of HLS designs by averagely 4.8x.
@inproceedings{dac2017BandwidthOptimization,
title={Bandwidth Optimization Through On-Chip Memory Restructuring for HLS},
author={Jason Cong, Peng Wei, Cody Hao Yu, Peipei Zhou},
booktitle={2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC)},
year={2017}
}
C5

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks ICCAD '16

Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, Jason Cong
36th International Conference on Computer-Aided Design (ACM ICCAD '16), acceptance rate: 97/408 = 23.8%

With the recent advancement of multilayer convolutional neural networks (CNN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy-efficiency of the computation-demanding CNN, the FPGA-based acceleration emerges as one of the most attractive alternatives. In this paper we design and implement Caffeine, a hardware/soft-ware co-designed library to efficiently accelerate the entire CNN on FPGAs. First, we propose a uniformed convolutional matrix-multiplication representation for both computation-intensive con-volutional layers and communication-intensive fully connected (FCN) layers. Second, we design Caffeine with the goal to maximize the underlying FPGA computing and bandwidth resource utilization , with a key focus on the bandwidth optimization by the memory access reorganization not studied in prior work. Moreover , we implement Caffeine in the portable high-level synthesis and provide various hardware/software definable parameters for user configurations. Finally, we also integrate Caffeine into the industry-standard software deep learning framework Caffe. We evaluate Caffeine and its integration with Caffe by implementing VGG16 and AlexNet network on multiple FPGA platforms. Caffeine achieves a peak performance of 365 GOPS on Xilinx KU060 FPGA and 636 GOPS on Virtex7 690t FPGA. This is the best published result to our best knowledge. We achieve more than 100x speedup on FCN layers over previous FPGA accelerators. An end-to-end evaluation with Caffe integration shows up to 7.3x and 43.5x performance and energy gains over Caffe on a 12-core Xeon server, and 1.5x better energy-efficiency over the GPU implementation on a medium-sized FPGA (KU060). Performance projections to a system with a high-end FPGA (Virtex7 690t) shows even higher gains.
@inproceedings{chen2016caffeine,
title={Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks},
author={Chen Zhang, Zhenman Fang , Peipei Zhou , Peichen Pan, and Jason Cong},
booktitle={International Conference on Computer-Aided Design},
year={2016}
}
C4

Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication FCCM '16

Peipei Zhou, Hyunseok Park, Zhenman Fang, Jason Cong, André DeHon
24th IEEE International Symposium on Field-Programmable Custom Computing Machines (IEEE FCCM '16), acceptance rate: 32/133 = 24%

Customized pipeline designs that minimize the pipeline initiation interval (II) maximize the throughput of FPGA accelerators designed with high-level synthesis (HLS). What is the impact of minimizing II on energy efficiency? Using a matrix-multiply accelerator, we show that matrix multiplies with II>1 can sometimes reduce dynamic energy below II=1 due to interconnect savings, but II=1 always achieves energy close to the minimum. We also identify sources of inefficient mapping in the commercial tool flow.
@INPROCEEDINGS{ZhouFCCM2014EnergyPipeline,
author={P. Zhou and H. Park and Z. Fang and J. Cong and A. DeHon},
booktitle={2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)},
title={Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication},
year={2016},
pages={172-175},
keywords={field programmable gate arrays;high level synthesis;matrix multiplication;pipeline arithmetic;power aware computing;FPGA accelerators;HLS;commercial tool flow;dynamic energy reduction;energy efficiency;full pipelining;high-level synthesis;interconnect savings;mapping sources;matrix multiplication;matrix-multiply accelerator;pipeline designs;pipeline initiation interval (II);Analytical models;Energy consumption;Field programmable gate arrays;Kernel;Pipelines;Registers;Wires;Analytic Model;Energy;High-level Synthesis;Initiation Interval;Pipeline},
doi={10.1109/FCCM.2016.50},
month={May},
}
C3

ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architecture (Poster) FPGA '16

Yu-Ting Chen, Jason Cong, Zhenman Fang, Peipei Zhou
24th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ACM/SIGDA FPGA '16)

Compared to conventional general-purpose processors, accelerator-rich architectures (ARAs) can provide orders-of-magnitude performance and energy gains. In this paper we design and implement the ARAPrototyper to enable rapid design space explorations for ARAs in real silicons and reduce the tedious prototyping efforts. First, ARAPrototyper provides a reusable baseline prototype with a highly customizable memory system, including interconnect between accelerators and buffers, interconnect between buffers and last-level cache (LLC) or DRAM, coherency choice at LLC or DRAM, and address translation support. To provide more insights into performance analysis, ARAPrototyper adds several performance counters on the accelerator side and leverages existing performance counters on the CPU side. Second, ARAPrototyper provides a clean interface to quickly integrate a user?s own accelerators written in high-level synthesis (HLS) code. Then, an ARA prototype can be automatically generated and mapped to a Xilinx Zynq SoC. To quickly develop applications that run seamlessly on the ARA prototype, ARAPrototyper provides a system software stack and abstracts the accelerators as software libraries for application developers. Our results demonstrate that ARAPrototyper enables a wide range of design space explorations for ARAs at manageable prototyping efforts and 4,000 to 10,000X faster evaluation time than full-system simulations. We believe that ARAPrototyper can be an attractive alternative for ARA design and evaluation.
@inproceedings{Chen:2016:AER:2847263.2847302,
author = {Chen, Yu-Ting and Cong, Jason and Fang, Zhenman and Zhou, Peipei},
title = {ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architecture (Poster)},
booktitle = {Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
series = {FPGA '16},
year = {2016},
isbn = {978-1-4503-3856-1},
location = {Monterey, California, USA},
pages = {281--281},
numpages = {1},
url = {http://doi.acm.org/10.1145/2847263.2847302},
doi = {10.1145/2847263.2847302},
acmid = {2847302},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {accelerator integration, accelerator-rich architecture, customized memory system, fpga prototyping, performance evaluation},
}
C2

CS-BWAMEM: A fast and scalable read aligner at the cloud scale for whole genome sequencing (Poster) HITSEQ '15

Yu-Ting Chen, Jason Cong, Jie Lei, Sen Li, Myron Peto, Paul Spellman, Peng Wei, and Peipei Zhou
High Throughput Sequencing, Algorithms & Applications, A SIG of IMSB/ECCB 2015
Best Poster Award
NEWS

The deep-coverage whole-genome sequencing (WGS) can generate billions of reads to be sequenced. It is time consuming for state-of-the-art aligners, such as BWA-MEM, to align the tremendous number of reads onto the reference genome. Inherently, the reads can be aligned using a massively parallel approach, and the alignment process should not be bounded by the limited number of computing cores of a single server. We present cloudscale BWAMEM (CS-BWAMEM), an ultrafast and highly scalable aligner built on top of cloud infrastructures. It leverages the abundant computing resources in a public or private cloud to fully exploit the parallelism obtained from the enormous number of reads. With CSBWAMEM, the pair-end whole-genome reads (30x) can be aligned within 80 minutes in a 25-node cluster with 300 cores.
@article{chen2015CSBWAMEM,
title={CS-BWAMEM: A fast and scalable read aligner at the cloud scale for whole genome sequencing},
author={Chen, YT and Cong, J and Li, S and Peto, M and Spellman, P and Wei, P and Zhou, P},
journal={High Throughput Sequencing Algorithms and Applications (HITSEQ)},
year={2015}
}
C1

A Fully Pipelined and Dynamically Composable Architecture of CGRA FCCM '14

Jason Cong, Hui Huang, Chiyuan Ma, Bingjun Xiao, Peipei Zhou
22nd IEEE International Symposium on Field-Programmable Custom Computing Machines (IEEE FCCM '14)

Future processor chips will not be limited by the transistor resources, but will be mainly constrained by energy efficiency. Reconfigurable fabrics bring higher energy efficiency than CPUs via customized hardware that adapts to user applications. Among different reconfigurable fabrics, coarse-grained reconfigurable arrays (CGRAs) can be even more efficient than fine-grained FPGAs when bit-level customization is not necessary in target applications. CGRAs were originally developed in the era when transistor resources were more critical than energy efficiency. Previous work shares hardware among different operations via modulo scheduling and time multiplexing of processing elements. In this work, we focus on an emerging scenario where transistor resources are rich. We develop a novel CGRA architecture that enables full pipelining and dynamic composition to improve energy efficiency by taking full advantage of abundant transistors. Several new design challenges are solved. We implement a prototype of the proposed architecture in a commodity FPGA chip for verification. Experiments show that our architecture can fully exploit the energy benefits of customization for user applications in the scenario of rich transistor resources.
@inproceedings{Cong2014FCCM,
author = {Cong, Jason and Huang, Hui and Ma, Chiyuan and Xiao, Bingjun and Zhou, Peipei},
title = {A Fully Pipelined and Dynamically Composable Architecture of CGRA},
booktitle = {Proceedings of the 2014 IEEE 22Nd International Symposium on Field-Programmable Custom Computing Machines},
series = {FCCM '14},
year = {2014},
isbn = {978-1-4799-5111-6},
pages = {9--16},
numpages = {8},
url = {http://dx.doi.org/10.1109/.10},
doi = {10.1109/.10},
acmid = {2650333},
publisher = {IEEE Computer Society},
address = {Washington, DC, USA},
keywords = {CGRA, pipeline, dynamic composition, reconfigurable architecture},
}

Journal Article

J2

ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architectures arxiv

Yu-Ting Chen, Jason Cong, Zhenman Fang, Bingjun Xiao, Peipei Zhou
arXiv:1610.09761.

Compared to conventional general-purpose processors, accelerator-rich architectures (ARAs) can provide orders-of-magnitude performance and energy gains and are emerging as one of the most promising solutions in the age of dark silicon. However, many design issues related to the complex interaction between general-purpose cores, accelerators, customized on-chip interconnects, and memory systems remain unclear and difficult to evaluate. In this paper we design and implement the ARAPrototyper to enable rapid design space explorations for ARAs in real silicons and reduce the tedious prototyping efforts far down to manageable efforts. First, ARAPrototyper provides a reusable baseline prototype with a highly customizable memory system, including interconnect between accelerators and buffers, interconnect between buffers and last-level cache (LLC) or DRAM, coherency choice at LLC or DRAM, and address translation support. Second, ARAPrototyper provides a clean interface to quickly integrate users' own accelerators written in high-level synthesis (HLS) code. The whole design flow is highly automated to generate a prototype of ARA on an FPGA system-on-chip (SoC). Third, to quickly develop applications that run seamlessly on the ARA prototype, ARAPrototyper provides a system software stack, abstracts the accelerators as software libraries, and provides APIs for software developers. Our experimental results demonstrate that ARAPrototyper enables a wide range of design space explorations for ARAs at manageable prototyping efforts, which has 4,000X to 10,000X faster evaluation time than full-system simulations. We believe that ARAPrototyper can be an attractive alternative for ARA design and evaluation.

@article{chen2016araprototyper,
title={ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architectures},
author={Chen, Yu-Ting and Cong, Jason and Fang, Zhenman and Xiao, Bingjun and Zhou, Peipei},
journal={arXiv preprint arXiv:1610.09761},
year={2016}
}

J1

A Fully Pipelined and Dynamically Composable Architecture of CGRA (Coarse Grained Reconfigurable Architecture) Master Thesis

Peipei Zhou
Master Thesis Committees: Prof. Jason Cong, Prof. Milos Ercegovac, Prof. Dejan Markovic

Future processor will not be limited by the transistor resources, but will be mainly constrained by energy efficiency. Reconfigurable architecture offers higher energy efficiency than CPUs through customized hardware and more flexibility than ASICs. FPGAs allow configurability at bit level to keep both efficiency and flexibility. However, in many computation-intensive applications, only word level customizations are necessary, which inspires coarse-grained reconfigurable arrays(CGRAs) to raise configurability to word level and to reduce configuration information, and to enable on-the-fly customization. Traditional CGRAs are designed in the era when transistor resources are scarce. Previous work in CGRAs share hardware resources among different operations via modulo scheduling and time multiplexing processing elements. In the emerging scenario where transistor resources are rich, we develop a novel CGRA architecture that features full pipelining and dynamic composition to improve energy efficiency and implement the prototype on Xilinx Virtex-6 FPGA board. Experiments show that fully pipelined and dynamically composable architecture(FPCA) can exploit the energy benefits of customization for user applications when the transistor resources are rich.

@article{zhou2014fully,
title={A Fully Pipelined and Dynamically Composable Architecture of CGRA (Coarse Grained Reconfigurable Architecture)},
url = {https://escholarship.org/uc/item/9446s3nx},
author={Zhou, Peipei},
year={2014}
}



Career

June-Sept 2017
Software Engineer Intern in Microsoft, Microsoft Headquarter, Washington.

I worked in Microsoft Office Machine Learning Team directed by Robert Rounthwaite. My mentor is Vincent Etter. I worked on mining large-scale data from Wikipedia edits and built neural sequence to sequence (seq2seq) model. Reference.

June-Sept 2014
Research Intern in Microsoft Research, Microsoft Headquarter, Washington.

I worked in Microsoft computer architecture group directed by Dr. Doug Burger. My mentor Dr. Joo-Young Kim and I worked on image compression pipeline. I have implemented the advanced compression algorithm in C++(software reference code) and RTL verilog code for Catapult FPGA in data centers. Reference

July-Sept 2011
Research Intern in Honeywell Automation & Control, Nanjing, China.



Teaching

EE110L
Circuit Measurements Laboratory 2013 Summer (Discussion 1A, 1B)
EE110
Circuit Analysis II 2013 Fall, 2014 Spring (Discussion 1A)



Paper Reading

Blog
PhD Paper Reading


Awards

November 2016
ICCAD HALO 2016 Travel Award
June 2011
Honeywell Innovator Scholarship, one of five recipients in Mainland China.
December 2009
Principal Scholarship, Southeast University

Misc

You are the No. web counter free th vistor of my research homepage.

Free Visitor Maps at VisitorMap.org
Get a FREE visitor map for your site!