FPGA Implementation of Reduced Precision Convolutional Neural Networks

FPGA Implementation of Reduced Precision Convolutional Neural Networks PDF Author: Muhammad Mohid Nabil
Publisher:
ISBN:
Category : Convolutions (Mathematics)
Languages : en
Pages :

Get Book Here

Book Description
With the improvement in processing systems, machine learning applications are finding widespread use in almost all sectors of technology. Image recognition is one application of machine learning which has become widely popular with various architectures and systems aimed at improving recognition performance. With classification accuracy now approaching saturation point, many researchers are now focusing on resource and energy efficiency. With the increased demand for learning applications in embedded devices, it is of paramount importance to optimize power and energy consumption to increase utility in these low power embedded systems. In recent months, reduced precision neural networks have caught the attention of some researchers. Reduced data width deep nets offer the potential of saving valuable resources on hardware platforms. In turn, these hardware platforms such as Field Programmable Gate Arrays (FPGAs) offer the potential of a low power system with massive parallelism increasing throughput and performance. In this research, we explore the implementations of a deep learning architecture on FPGA in the presence of resource and energy constraints. We study reduced precision neural networks and implement one such architecture as a proof of concept. We focus on binarized convolutional neural network and its implementation on FPGAs. Binarized convolutional nets have displayed a classification accuracy of up to 88% with some smaller image sets such as CIFAR-10. This number is on the rise with some of the new architectures. We study the tradeoff between architecture depth and its impact on accuracy to get a better understanding of the convolutional layers and their impact on the overall performance. This is done from a hardware perspective giving us better insight enabling better resource allocation on FPGA fabric. Zynq ZCU-102 has been used for accelerator implementation. High level synthesis tool (Vivado HLS) from Xilinx is used for CNN definition on FPGA fabric.

FPGA Implementation of Reduced Precision Convolutional Neural Networks

FPGA Implementation of Reduced Precision Convolutional Neural Networks PDF Author: Muhammad Mohid Nabil
Publisher:
ISBN:
Category : Convolutions (Mathematics)
Languages : en
Pages :

Get Book Here

Book Description
With the improvement in processing systems, machine learning applications are finding widespread use in almost all sectors of technology. Image recognition is one application of machine learning which has become widely popular with various architectures and systems aimed at improving recognition performance. With classification accuracy now approaching saturation point, many researchers are now focusing on resource and energy efficiency. With the increased demand for learning applications in embedded devices, it is of paramount importance to optimize power and energy consumption to increase utility in these low power embedded systems. In recent months, reduced precision neural networks have caught the attention of some researchers. Reduced data width deep nets offer the potential of saving valuable resources on hardware platforms. In turn, these hardware platforms such as Field Programmable Gate Arrays (FPGAs) offer the potential of a low power system with massive parallelism increasing throughput and performance. In this research, we explore the implementations of a deep learning architecture on FPGA in the presence of resource and energy constraints. We study reduced precision neural networks and implement one such architecture as a proof of concept. We focus on binarized convolutional neural network and its implementation on FPGAs. Binarized convolutional nets have displayed a classification accuracy of up to 88% with some smaller image sets such as CIFAR-10. This number is on the rise with some of the new architectures. We study the tradeoff between architecture depth and its impact on accuracy to get a better understanding of the convolutional layers and their impact on the overall performance. This is done from a hardware perspective giving us better insight enabling better resource allocation on FPGA fabric. Zynq ZCU-102 has been used for accelerator implementation. High level synthesis tool (Vivado HLS) from Xilinx is used for CNN definition on FPGA fabric.

Caffeinated FPGAs

Caffeinated FPGAs PDF Author: Roberto DiCecco
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
This thesis presents a framework for performing training and inference of Convolutional Neural Networks (CNNs) with reduced precision floating-point arithmetic. This work aims to provide a means for FPGA and machine learning researchers to use the customizability of FPGAs to explore the precision requirements of training CNNs with an open-source framework. This is accomplished through the creation of a High-Level Synthesis library with a Custom Precision Floating-Point data type that is configurable in both exponent and mantissa widths, with several standard operators and rounding modes supported. With this library a FPGA CNN Training Engine (FCTE) has been created along with a FPGA CNN framework FPGA Caffe, which is built on Caffe. FCTE has a peak performance of approximately 350 GFLOPs, and has been used to show that a mantissa width of 5 and exponent width of 6 is sufficient for training several models targeting the MNIST and CIFAR-10 datasets.

In-situ Implementation and Training of Convolutional Neural Network on FPGAs

In-situ Implementation and Training of Convolutional Neural Network on FPGAs PDF Author: Akshay Raju Krishnani
Publisher:
ISBN:
Category : Field programmable gate arrays
Languages : en
Pages :

Get Book Here

Book Description
The main objective of this thesis is to investigate the efficiency of in-situ trainable Convolutional Neural Networks (CNNs) on modern programmable System-on-Chip (SoC) Field Programmable Gate Arrays (FPGAs) composed of embedded processors and reconfigurable fabric and to study the robustness of the system when faults happen. One particular characteristic of this work is that CNN is developed exclusively using High-Level Synthesis (HLS), particularly in SystemC, generating Verilog code. In this thesis, the feature maps are also being trained on the FPGA, which is traditionally done offline. The CNN architecture is instantiated on the FPGA and weights are trained through the software model on the ARM processor embedded into the FPGA and updated in the architecture through the AXI bus interface. Moreover, since CNN is implemented in hardware the resource used need to be minimized. This allows to choose a smaller, and cheaper FPGA, as well as reducing the total power consumption. To address this, the effect of bitwidth reduction of the CNN is investigated with respect to the accuracy of handwritten characters recognitions. Finally, the robustness of the CNN is analyzed by breaking internal connection of different neurons studying how the accuracy drops when the fault happens at different layers If the accuracy is reduced, then the CNN is re-trained in-situ to increase the accuracy of the CNN.

FPGA Implementations of Neural Networks

FPGA Implementations of Neural Networks PDF Author: Amos R. Omondi
Publisher: Springer Science & Business Media
ISBN: 0387284877
Category : Technology & Engineering
Languages : en
Pages : 365

Get Book Here

Book Description
During the 1980s and early 1990s there was signi?cant work in the design and implementation of hardware neurocomputers. Nevertheless, most of these efforts may be judged to have been unsuccessful: at no time have have ha- ware neurocomputers been in wide use. This lack of success may be largely attributed to the fact that earlier work was almost entirely aimed at developing custom neurocomputers, based on ASIC technology, but for such niche - eas this technology was never suf?ciently developed or competitive enough to justify large-scale adoption. On the other hand, gate-arrays of the period m- tioned were never large enough nor fast enough for serious arti?cial-neur- network (ANN) applications. But technology has now improved: the capacity and performance of current FPGAs are such that they present a much more realistic alternative. Consequently neurocomputers based on FPGAs are now a much more practical proposition than they have been in the past. This book summarizes some work towards this goal and consists of 12 papers that were selected, after review, from a number of submissions. The book is nominally divided into three parts: Chapters 1 through 4 deal with foundational issues; Chapters 5 through 11 deal with a variety of implementations; and Chapter 12 looks at the lessons learned from a large-scale project and also reconsiders design issues in light of current and future technology.

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays PDF Author: Jonathan Greene
Publisher:
ISBN: 9781450343541
Category :
Languages : en
Pages :

Get Book Here

Book Description
FPGA '17: The 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Feb 22, 2017-Feb 24, 2017 Monterey, USA. You can view more information about this proceeding and all of ACM�s other published conference proceedings from the ACM Digital Library: http://www.acm.org/dl.

FPGA Implementations of Neural Networks

FPGA Implementations of Neural Networks PDF Author: Amos R. Omondi
Publisher: Springer
ISBN: 9781441939425
Category : Technology & Engineering
Languages : en
Pages : 0

Get Book Here

Book Description
During the 1980s and early 1990s there was signi?cant work in the design and implementation of hardware neurocomputers. Nevertheless, most of these efforts may be judged to have been unsuccessful: at no time have have ha- ware neurocomputers been in wide use. This lack of success may be largely attributed to the fact that earlier work was almost entirely aimed at developing custom neurocomputers, based on ASIC technology, but for such niche - eas this technology was never suf?ciently developed or competitive enough to justify large-scale adoption. On the other hand, gate-arrays of the period m- tioned were never large enough nor fast enough for serious arti?cial-neur- network (ANN) applications. But technology has now improved: the capacity and performance of current FPGAs are such that they present a much more realistic alternative. Consequently neurocomputers based on FPGAs are now a much more practical proposition than they have been in the past. This book summarizes some work towards this goal and consists of 12 papers that were selected, after review, from a number of submissions. The book is nominally divided into three parts: Chapters 1 through 4 deal with foundational issues; Chapters 5 through 11 deal with a variety of implementations; and Chapter 12 looks at the lessons learned from a large-scale project and also reconsiders design issues in light of current and future technology.

Software and Hardware Co-optimization for Deep Learning Algorithms on FPGA

Software and Hardware Co-optimization for Deep Learning Algorithms on FPGA PDF Author: Chen Wu
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Over recent years, deep learning paradigms such as convolutional neural networks (CNNs) have shown great success in various families of tasks including object detection and au- tonomous driving, etc. To extend such success to non-euclidean data, graph convolutional networks (GCNs) have been introduced, and have quickly attracted industrial and academia attention as a popular solution to real-world problems. However, both CNNs and GCNs often have huge computation and memory complexity, which calls for specific hardware architec- tures to accelerate these algorithms. In this dissertation, we propose several architectures to accelerate CNNs and GCNs based on FPGA platforms. We start from the domain-specific FPGA-overlay processor (OPU) on commonly used CNNs, such as VGG, Inception, ResNet, and YoloV2. The data is first quantized to 8-bit fixed-point with little accuracy loss to reduce computation complexity and memory require- ment. A fully-pipelined dataflow architecture is proposed to accelerate the typical layers (i.e., convolutional, pooling, residual, inception, and activation layers) in CNNs. Experi- mental results show that OPU is 9.6 faster than GPU Jetson TX2 on a cascaded of three CNNs, which are used for the curbside parking system. However, 8-bit fixed-point data representation always need re-training to maintain accu- racy for deep CNNs. In this way, we propose a low precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitation. With- out any re-training, LPFP finds an optimal 8-bit data representation with negligible top- 1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder (MAC) and one 3-bit adder. Therefore, we can implement four 8-bit LPFP multiplications using one DSP48E1 of Xilinx Kintex-7 family or one DSP48E2 of Xilinx Ultrascale/Ultrascale Plus family whereas one DSP can only imple- ment two 8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by 1.5 over existing FPGA accelerators. Particularly for VGG16 and Yolo, compared with seven FPGA accelerators, we improve average throughput by 3.5 and 27.5 and average throughput per DSP by 4.1 and 5, respectively. CNNs quantized with mixed precision, on the other hand, benefits from low precision while maintaining accuracy. To better leverage the advantages of mixed precision, we propose a Mixed Precision FPGA-based Overlay Processor (MP-OPU) for both conventional and lightweight CNNs. The micro-architecture of MP-OPU considers sharing of computation core with mixed precision weights and activations to improve computation efficiency. In addition, run-time scheduling of external memory access and data arrangement are optimized to further leverage the advantages of mixed precision data representation. Our experimental results show that MP-OPU reaches 4.92 TOPS peak throughput when implemented on Xilinx VC709 FPGA (with all DSPs configured to support 2-bit multipliers). Moreover, MP-OPU achieves 12.9 latency reduction and 2.2 better throughput per DSP for conventional CNNs, while 7.6 latency reduction and 2.9 better throughput per DSP for lightweight CNNs, all on average compared with existing FPGA accelerators/processors, respectively. Graph convolutional networks (GCNs) have been introduced to effectively process non-euclidean graph data. However, GCNs incur large amount of irregularity in computation and memory access, which prevents efficient use of previous CNN accelerators/processors. In this way, we propose a lightweight FPGA-based accelerator, named LW-GCN, to tackle irregularity in computation and memory access in GCN inference. We first decompose the main GCN operations into Sparse Matrix-Matrix Multiplication (SpMM) and Matrix-Matrix Multiplication (MM). Thereafter, we propose a novel compression format to balance work- load across PEs and prevent data hazards. In addition, we quantize the data into 16-bit fixed-point and apply workload tiling, and map both SpMM and MM onto a uniform archi- tecture on resource limited devices. Evaluations on GCN and GraphSAGE are performed on Xilinx Kintex-7 FPGA with three popular datasets. Compared with existing CPU, GPU and state-of-the-art FPGA-based accelerator, LW-GCN reduces latency by up to 60, 12 and 1.7 and increases power efficiency by up to 912, 511 and 3.87, respectively. Moreover, compared with Nvidia's latest edge GPU Jetson Xavier NX, LW-GCN achieves speedup and energy savings of 32 and 84, respectively. At last, we extend our GCN inference accelerator to a GCN training accelerator, called SkeletonGCN. To better fit the properties of GCN training, we add more software-hardware co-optimizations. First, we simplify the non-linear operations in GCN training to better fit the FPGA computation, and identify reusable intermediate results to eliminate redundant computation. Second, we optimize the previous compression format to further reduce mem- ory bandwidth while allowing efficient decompression on hardware. Finally, we propose a unified architecture to support SpMM, MM and MM with transpose, all on the same group of PEs to increase DSP utilization on FPGA. Evaluations are performed on Xilinx Alveo U200 board. Compared with existing FPGA-based accelerator on the same network archi- tecture, SkeletonGCN can achieve up to 11.3 speedup while maintaining the same training accuracy with 16-bit fixed-point data representation. In addition, SkeletonGCN is 178 and 13.1 faster than state-of-the-art CPU and GPU implementation on popular datasets, respectively. To summarize, we have been working on FPGA-based acceleration for deep learning algorithms of CNNs and GCNs in both inference and training process. All the accelera- tors/processors were hand-coded and have been fully verified. In addition, the related tool chains for generating golden results and running instructions for the accelerators/processors have also been finished.

Efficient Processing of Deep Neural Networks

Efficient Processing of Deep Neural Networks PDF Author: Vivienne Sze
Publisher: Springer Nature
ISBN: 3031017668
Category : Technology & Engineering
Languages : en
Pages : 254

Get Book Here

Book Description
This book provides a structured treatment of the key principles and techniques for enabling efficient processing of deep neural networks (DNNs). DNNs are currently widely used for many artificial intelligence (AI) applications, including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Therefore, techniques that enable efficient processing of deep neural networks to improve key metrics—such as energy-efficiency, throughput, and latency—without sacrificing accuracy or increasing hardware costs are critical to enabling the wide deployment of DNNs in AI systems. The book includes background on DNN processing; a description and taxonomy of hardware architectural approaches for designing DNN accelerators; key metrics for evaluating and comparing different designs; features of DNN processing that are amenable to hardware/algorithm co-design to improve energy efficiency and throughput; and opportunities for applying new technologies. Readers will find a structured introduction to the field as well as formalization and organization of key concepts from contemporary work that provide insights that may spark new ideas.

FPGA Logic Block Architectures for Efficient Deep Learning Inference

FPGA Logic Block Architectures for Efficient Deep Learning Inference PDF Author: Mohamed Eldafrawy
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Reducing the precision of deep neural networks can yield large efficiency gains with little or no accuracy degradation compared to single-precision floating-point representation. A wide range of precisions fall on the pareto-optimal curve of hardware efficiency vs. accuracy with no single precision dominating, making the variable precision capabilities of field-programmable gate arrays (FPGAs) very valuable. This thesis proposes six FPGA logic block architectures that improve the area efficiency of multiplications and additions implemented in the soft fabric. Increasing the look-up table fracturability and adding two adders to the adaptive logic module leads to a 1.5x area reduction for machine learning (ML) kernels and increases their speed, while simultaneously reducing the area of general applications by 6%. On the other hand, adding a 9-bit shadow multiplier to logic blocks reduces ML kernels' area by 2.4x and critical path delay by 1.4x, but increases the area of general applications by 15%.

Design and Implementation of Binarized and Ternarized Convolutional Neural Networks on FPGA.

Design and Implementation of Binarized and Ternarized Convolutional Neural Networks on FPGA. PDF Author:
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description