FPGA Logic Block Architectures for Efficient Deep Learning Inference

FPGA Logic Block Architectures for Efficient Deep Learning Inference PDF Author: Mohamed Eldafrawy
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Reducing the precision of deep neural networks can yield large efficiency gains with little or no accuracy degradation compared to single-precision floating-point representation. A wide range of precisions fall on the pareto-optimal curve of hardware efficiency vs. accuracy with no single precision dominating, making the variable precision capabilities of field-programmable gate arrays (FPGAs) very valuable. This thesis proposes six FPGA logic block architectures that improve the area efficiency of multiplications and additions implemented in the soft fabric. Increasing the look-up table fracturability and adding two adders to the adaptive logic module leads to a 1.5x area reduction for machine learning (ML) kernels and increases their speed, while simultaneously reducing the area of general applications by 6%. On the other hand, adding a 9-bit shadow multiplier to logic blocks reduces ML kernels' area by 2.4x and critical path delay by 1.4x, but increases the area of general applications by 15%.

FPGA Logic Block Architectures for Efficient Deep Learning Inference

FPGA Logic Block Architectures for Efficient Deep Learning Inference PDF Author: Mohamed Eldafrawy
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Reducing the precision of deep neural networks can yield large efficiency gains with little or no accuracy degradation compared to single-precision floating-point representation. A wide range of precisions fall on the pareto-optimal curve of hardware efficiency vs. accuracy with no single precision dominating, making the variable precision capabilities of field-programmable gate arrays (FPGAs) very valuable. This thesis proposes six FPGA logic block architectures that improve the area efficiency of multiplications and additions implemented in the soft fabric. Increasing the look-up table fracturability and adding two adders to the adaptive logic module leads to a 1.5x area reduction for machine learning (ML) kernels and increases their speed, while simultaneously reducing the area of general applications by 6%. On the other hand, adding a 9-bit shadow multiplier to logic blocks reduces ML kernels' area by 2.4x and critical path delay by 1.4x, but increases the area of general applications by 15%.

Enhancing FPGA Architecture for Efficient Deep Learning Inference

Enhancing FPGA Architecture for Efficient Deep Learning Inference PDF Author: Andrew Maher Mansour Boutros
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Deep Learning (DL) has become best-in-class for numerous applications but at a high computational cost that necessitates high-performance energy-efficient acceleration. FPGAs offer an appealing DL inference acceleration platform due to their flexibility and energy-efficiency. This thesis explores FPGA architectural changes to enhance the efficiency of a class of DL models, convolutional neural networks (CNNs), on FPGAs. We first build three state-of-the-art CNN computing architectures (CAs) as benchmarks representative of the DL domain and quantify the FPGA vs. ASIC efficiency gaps for these CAs to highlight the bottlenecks of current FPGA architectures. Then, we enhance the flexibility of digital signal processing (DSP) blocks on current FPGAs for low-precision DL. Our DSP block increases the performance of 8-bit and 4-bit CNN inference by 1.3x and 1.6x respectively with minimal block area overhead. Finally, we present a preliminary evaluation of logic block architectural changes, leaving their detailed evaluation for future work.

Optimizing FPGA Architecture for Deep Learning Workloads

Optimizing FPGA Architecture for Deep Learning Workloads PDF Author: Aman Arora
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Deep Learning (DL) applications have tremendous computation requirements, making running them on traditional computers (CPUs) very inefficient. Modern computer systems deploy hardware acceleration, which involves offloading compute-intensive and memory-intensive tasks to specialized hardware. In the space of hardware acceleration alternatives, Field Programmable Gate Arrays (FPGAs) lie in the middle of the programmability-efficiency spectrum, with Graphic Processing Units (GPUs) being more programmable and Application Specific Integrated Circuits (ASICs) being more efficient. FPGAs provide massive parallelism and are reconfigurable, which makes them very well suited for the fast-changing needs of DL applications. However, the generic building blocks available on traditional FPGAs limit the acceleration that can be achieved. Hence, FPGAs trail ASICs by an order of magnitude in terms of performance. So, how can the gap between ASICs and FPGAs be minimized, while retaining the strength of FPGAs - the reconfigurability? This dissertation describes research that aims to find the answer to this question by proposing new domain-optimized FPGAs for Deep Learning. The key idea is to integrate new hardware blocks to the FPGA that provide domain-specialized functionality, while still keeping them largely general and allowing them to be used with traditional FPGA flows. Specifically, new DL-optimized FPGAs containing blocks called Tensor Slices and CoMeFa RAMs are presented. The architecture of these blocks, along with the tradeoffs in exploring their architectures, is explained. Results show that significant performance improvement and energy reduction can be obtained for DL applications by using DL-specialized FPGAs containing these blocks. New benchmarks, called Koios, developed to explore FPGA architectures for DL are also explained. These benchmarks are open-sourced and work with VTR (an academic open source FPGA architecture exploration tool). New DL-optimized FPGAs, containing Tensor Slices and CoMeFa RAMs, are significantly more efficient at accelerating DL workloads, while still being reconfigurable at a fine-grain. With the abundance of DL applications, making DL-optimized FPGAs is an attractive proposition

Efficient Processing of Deep Neural Networks

Efficient Processing of Deep Neural Networks PDF Author: Vivienne Sze
Publisher: Springer Nature
ISBN: 3031017668
Category : Technology & Engineering
Languages : en
Pages : 254

Get Book Here

Book Description
This book provides a structured treatment of the key principles and techniques for enabling efficient processing of deep neural networks (DNNs). DNNs are currently widely used for many artificial intelligence (AI) applications, including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Therefore, techniques that enable efficient processing of deep neural networks to improve key metrics—such as energy-efficiency, throughput, and latency—without sacrificing accuracy or increasing hardware costs are critical to enabling the wide deployment of DNNs in AI systems. The book includes background on DNN processing; a description and taxonomy of hardware architectural approaches for designing DNN accelerators; key metrics for evaluating and comparing different designs; features of DNN processing that are amenable to hardware/algorithm co-design to improve energy efficiency and throughput; and opportunities for applying new technologies. Readers will find a structured introduction to the field as well as formalization and organization of key concepts from contemporary work that provide insights that may spark new ideas.

FPGA Implementations of Neural Networks

FPGA Implementations of Neural Networks PDF Author: Amos R. Omondi
Publisher: Springer Science & Business Media
ISBN: 0387284877
Category : Technology & Engineering
Languages : en
Pages : 365

Get Book Here

Book Description
During the 1980s and early 1990s there was signi?cant work in the design and implementation of hardware neurocomputers. Nevertheless, most of these efforts may be judged to have been unsuccessful: at no time have have ha- ware neurocomputers been in wide use. This lack of success may be largely attributed to the fact that earlier work was almost entirely aimed at developing custom neurocomputers, based on ASIC technology, but for such niche - eas this technology was never suf?ciently developed or competitive enough to justify large-scale adoption. On the other hand, gate-arrays of the period m- tioned were never large enough nor fast enough for serious arti?cial-neur- network (ANN) applications. But technology has now improved: the capacity and performance of current FPGAs are such that they present a much more realistic alternative. Consequently neurocomputers based on FPGAs are now a much more practical proposition than they have been in the past. This book summarizes some work towards this goal and consists of 12 papers that were selected, after review, from a number of submissions. The book is nominally divided into three parts: Chapters 1 through 4 deal with foundational issues; Chapters 5 through 11 deal with a variety of implementations; and Chapter 12 looks at the lessons learned from a large-scale project and also reconsiders design issues in light of current and future technology.

Embedded Deep Learning

Embedded Deep Learning PDF Author: Bert Moons
Publisher: Springer
ISBN: 3319992236
Category : Technology & Engineering
Languages : en
Pages : 216

Get Book Here

Book Description
This book covers algorithmic and hardware implementation techniques to enable embedded deep learning. The authors describe synergetic design approaches on the application-, algorithmic-, computer architecture-, and circuit-level that will help in achieving the goal of reducing the computational cost of deep learning algorithms. The impact of these techniques is displayed in four silicon prototypes for embedded deep learning. Gives a wide overview of a series of effective solutions for energy-efficient neural networks on battery constrained wearable devices; Discusses the optimization of neural networks for embedded deployment on all levels of the design hierarchy – applications, algorithms, hardware architectures, and circuits – supported by real silicon prototypes; Elaborates on how to design efficient Convolutional Neural Network processors, exploiting parallelism and data-reuse, sparse operations, and low-precision computations; Supports the introduced theory and design concepts by four real silicon prototypes. The physical realization’s implementation and achieved performances are discussed elaborately to illustrated and highlight the introduced cross-layer design concepts.

Communications and Networking

Communications and Networking PDF Author: Honghao Gao
Publisher: Springer Nature
ISBN: 3030677206
Category : Computers
Languages : en
Pages : 789

Get Book Here

Book Description
This proceedings constitutes the refereed proceedings of the 15th EAI International Conference on Communications and Networking, ChinaCom 2020, held in November 2020 in Shanghai, China. Due to COVID-19 pandemic the conference was held virtually. The 54 papers presented were carefully selected from 143 submissions. The papers are organized in topical sections on Transmission Optimization in Edge Computing; Performance and Scheduling Optimization in Edge Computing; Mobile Edge Network System; Communication Routing and Control; Transmission and Load Balancing; Edge Computing and Distributed Machine Learning; Deep Learning.

FPGA Architecture

FPGA Architecture PDF Author: Ian Kuon
Publisher: Now Publishers Inc
ISBN: 1601981260
Category : Technology & Engineering
Languages : en
Pages : 134

Get Book Here

Book Description
Reviews the historical development of programmable logic devices, the fundamental programming technologies that the programmability is built on, and then describes the basic understandings gleaned from research on architectures. It is an invaluable reference for engineers and computer scientists.

Hardware Architectures for Deep Learning

Hardware Architectures for Deep Learning PDF Author: Masoud Daneshtalab
Publisher: Institution of Engineering and Technology
ISBN: 1785617680
Category : Computers
Languages : en
Pages : 329

Get Book Here

Book Description
This book presents and discusses innovative ideas in the design, modelling, implementation, and optimization of hardware platforms for neural networks. The rapid growth of server, desktop, and embedded applications based on deep learning has brought about a renaissance in interest in neural networks, with applications including image and speech processing, data analytics, robotics, healthcare monitoring, and IoT solutions. Efficient implementation of neural networks to support complex deep learning-based applications is a complex challenge for embedded and mobile computing platforms with limited computational/storage resources and a tight power budget. Even for cloud-scale systems it is critical to select the right hardware configuration based on the neural network complexity and system constraints in order to increase power- and performance-efficiency. Hardware Architectures for Deep Learning provides an overview of this new field, from principles to applications, for researchers, postgraduate students and engineers who work on learning-based services and hardware platforms.

Software and Hardware Co-optimization for Deep Learning Algorithms on FPGA

Software and Hardware Co-optimization for Deep Learning Algorithms on FPGA PDF Author: Chen Wu
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Over recent years, deep learning paradigms such as convolutional neural networks (CNNs) have shown great success in various families of tasks including object detection and au- tonomous driving, etc. To extend such success to non-euclidean data, graph convolutional networks (GCNs) have been introduced, and have quickly attracted industrial and academia attention as a popular solution to real-world problems. However, both CNNs and GCNs often have huge computation and memory complexity, which calls for specific hardware architec- tures to accelerate these algorithms. In this dissertation, we propose several architectures to accelerate CNNs and GCNs based on FPGA platforms. We start from the domain-specific FPGA-overlay processor (OPU) on commonly used CNNs, such as VGG, Inception, ResNet, and YoloV2. The data is first quantized to 8-bit fixed-point with little accuracy loss to reduce computation complexity and memory require- ment. A fully-pipelined dataflow architecture is proposed to accelerate the typical layers (i.e., convolutional, pooling, residual, inception, and activation layers) in CNNs. Experi- mental results show that OPU is 9.6 faster than GPU Jetson TX2 on a cascaded of three CNNs, which are used for the curbside parking system. However, 8-bit fixed-point data representation always need re-training to maintain accu- racy for deep CNNs. In this way, we propose a low precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitation. With- out any re-training, LPFP finds an optimal 8-bit data representation with negligible top- 1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder (MAC) and one 3-bit adder. Therefore, we can implement four 8-bit LPFP multiplications using one DSP48E1 of Xilinx Kintex-7 family or one DSP48E2 of Xilinx Ultrascale/Ultrascale Plus family whereas one DSP can only imple- ment two 8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by 1.5 over existing FPGA accelerators. Particularly for VGG16 and Yolo, compared with seven FPGA accelerators, we improve average throughput by 3.5 and 27.5 and average throughput per DSP by 4.1 and 5, respectively. CNNs quantized with mixed precision, on the other hand, benefits from low precision while maintaining accuracy. To better leverage the advantages of mixed precision, we propose a Mixed Precision FPGA-based Overlay Processor (MP-OPU) for both conventional and lightweight CNNs. The micro-architecture of MP-OPU considers sharing of computation core with mixed precision weights and activations to improve computation efficiency. In addition, run-time scheduling of external memory access and data arrangement are optimized to further leverage the advantages of mixed precision data representation. Our experimental results show that MP-OPU reaches 4.92 TOPS peak throughput when implemented on Xilinx VC709 FPGA (with all DSPs configured to support 2-bit multipliers). Moreover, MP-OPU achieves 12.9 latency reduction and 2.2 better throughput per DSP for conventional CNNs, while 7.6 latency reduction and 2.9 better throughput per DSP for lightweight CNNs, all on average compared with existing FPGA accelerators/processors, respectively. Graph convolutional networks (GCNs) have been introduced to effectively process non-euclidean graph data. However, GCNs incur large amount of irregularity in computation and memory access, which prevents efficient use of previous CNN accelerators/processors. In this way, we propose a lightweight FPGA-based accelerator, named LW-GCN, to tackle irregularity in computation and memory access in GCN inference. We first decompose the main GCN operations into Sparse Matrix-Matrix Multiplication (SpMM) and Matrix-Matrix Multiplication (MM). Thereafter, we propose a novel compression format to balance work- load across PEs and prevent data hazards. In addition, we quantize the data into 16-bit fixed-point and apply workload tiling, and map both SpMM and MM onto a uniform archi- tecture on resource limited devices. Evaluations on GCN and GraphSAGE are performed on Xilinx Kintex-7 FPGA with three popular datasets. Compared with existing CPU, GPU and state-of-the-art FPGA-based accelerator, LW-GCN reduces latency by up to 60, 12 and 1.7 and increases power efficiency by up to 912, 511 and 3.87, respectively. Moreover, compared with Nvidia's latest edge GPU Jetson Xavier NX, LW-GCN achieves speedup and energy savings of 32 and 84, respectively. At last, we extend our GCN inference accelerator to a GCN training accelerator, called SkeletonGCN. To better fit the properties of GCN training, we add more software-hardware co-optimizations. First, we simplify the non-linear operations in GCN training to better fit the FPGA computation, and identify reusable intermediate results to eliminate redundant computation. Second, we optimize the previous compression format to further reduce mem- ory bandwidth while allowing efficient decompression on hardware. Finally, we propose a unified architecture to support SpMM, MM and MM with transpose, all on the same group of PEs to increase DSP utilization on FPGA. Evaluations are performed on Xilinx Alveo U200 board. Compared with existing FPGA-based accelerator on the same network archi- tecture, SkeletonGCN can achieve up to 11.3 speedup while maintaining the same training accuracy with 16-bit fixed-point data representation. In addition, SkeletonGCN is 178 and 13.1 faster than state-of-the-art CPU and GPU implementation on popular datasets, respectively. To summarize, we have been working on FPGA-based acceleration for deep learning algorithms of CNNs and GCNs in both inference and training process. All the accelera- tors/processors were hand-coded and have been fully verified. In addition, the related tool chains for generating golden results and running instructions for the accelerators/processors have also been finished.