Exploiting Data Characteristics in The Design of Accelerators for Deep Learning

Exploiting Data Characteristics in The Design of Accelerators for Deep Learning PDF Author: Patrick H. Judd
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
The recent "Cambrian explosion" of Deep Learning (DL) algorithms in concert with the end of Moore's Law and Dennard Scaling has spurred interest in the design of custom hardware accelerators for DL algorithms. While DL has progressed quickly thanks in part to the abundance of efficient parallel computation provided by General Purpose Graphics Processing Units, newer DL algorithms demand even higher levels of compute density and efficiency. Furthermore, applications of DL in the mobile and embedded domains demand the energy efficiency of special purpose hardware. DL algorithms are dominated by large matrix-vector product computations, making them ideal targets for wide Single Instruction Multiple Data architectures. For the most part, efficiently mapping the structure of these computations to hardware is straightforward. Building on such designs, this thesis examines the data characteristics of these computations and proposes hardware modifications to exploit them for performance and energy efficiency. Specifically, this thesis examines the sparsity and precision requirements of Deep Convolutional Neural Networks, which comprise multiple layers of matrix-vector product computations. We propose a profiling method to find per layer reduced precision configurations while maintaining high classification accuracy. Following this, we propose three accelerator designs that build on top of the state-of-the-art DaDianNao accelerator. 1) Proteus exploits the reduced precision profiles by adding a light weight memory compression layer, saving energy in memory access and communication, and enabling larger networks in a fixed memory budget. 2) Cnvlutin exploits the presence of zero, and near zero, values in the inter-layer data by applying sparse compression to the data stream while maintain efficient utilization of the wide memory and compute structure of the SIMD accelerator. 3) Stripes exploits the reduced precision profiles for performance by processing data bit-serially, compensating for serial latency by exploiting the abundant parallelism in the convolution operation. All three designs exploit approximation, in terms of reduced precision and computation skipping to improve energy efficiency and/or performance while maintaining high classification accuracy. By approximating more aggressively, these designs can also dynamically trade-off accuracy for further improvements in performance and energy.

Exploiting Data Characteristics in The Design of Accelerators for Deep Learning

Exploiting Data Characteristics in The Design of Accelerators for Deep Learning PDF Author: Patrick H. Judd
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
The recent "Cambrian explosion" of Deep Learning (DL) algorithms in concert with the end of Moore's Law and Dennard Scaling has spurred interest in the design of custom hardware accelerators for DL algorithms. While DL has progressed quickly thanks in part to the abundance of efficient parallel computation provided by General Purpose Graphics Processing Units, newer DL algorithms demand even higher levels of compute density and efficiency. Furthermore, applications of DL in the mobile and embedded domains demand the energy efficiency of special purpose hardware. DL algorithms are dominated by large matrix-vector product computations, making them ideal targets for wide Single Instruction Multiple Data architectures. For the most part, efficiently mapping the structure of these computations to hardware is straightforward. Building on such designs, this thesis examines the data characteristics of these computations and proposes hardware modifications to exploit them for performance and energy efficiency. Specifically, this thesis examines the sparsity and precision requirements of Deep Convolutional Neural Networks, which comprise multiple layers of matrix-vector product computations. We propose a profiling method to find per layer reduced precision configurations while maintaining high classification accuracy. Following this, we propose three accelerator designs that build on top of the state-of-the-art DaDianNao accelerator. 1) Proteus exploits the reduced precision profiles by adding a light weight memory compression layer, saving energy in memory access and communication, and enabling larger networks in a fixed memory budget. 2) Cnvlutin exploits the presence of zero, and near zero, values in the inter-layer data by applying sparse compression to the data stream while maintain efficient utilization of the wide memory and compute structure of the SIMD accelerator. 3) Stripes exploits the reduced precision profiles for performance by processing data bit-serially, compensating for serial latency by exploiting the abundant parallelism in the convolution operation. All three designs exploit approximation, in terms of reduced precision and computation skipping to improve energy efficiency and/or performance while maintaining high classification accuracy. By approximating more aggressively, these designs can also dynamically trade-off accuracy for further improvements in performance and energy.

Data Orchestration in Deep Learning Accelerators

Data Orchestration in Deep Learning Accelerators PDF Author: Tushar Krishna
Publisher: Springer Nature
ISBN: 3031017676
Category : Technology & Engineering
Languages : en
Pages : 158

Get Book Here

Book Description
This Synthesis Lecture focuses on techniques for efficient data orchestration within DNN accelerators. The End of Moore's Law, coupled with the increasing growth in deep learning and other AI applications has led to the emergence of custom Deep Neural Network (DNN) accelerators for energy-efficient inference on edge devices. Modern DNNs have millions of hyper parameters and involve billions of computations; this necessitates extensive data movement from memory to on-chip processing engines. It is well known that the cost of data movement today surpasses the cost of the actual computation; therefore, DNN accelerators require careful orchestration of data across on-chip compute, network, and memory elements to minimize the number of accesses to external DRAM. The book covers DNN dataflows, data reuse, buffer hierarchies, networks-on-chip, and automated design-space exploration. It concludes with data orchestration challenges with compressed and sparse DNNs and future trends. The target audience is students, engineers, and researchers interested in designing high-performance and low-energy accelerators for DNN inference.

Efficient Processing of Deep Neural Networks

Efficient Processing of Deep Neural Networks PDF Author: Vivienne Sze
Publisher: Springer Nature
ISBN: 3031017668
Category : Technology & Engineering
Languages : en
Pages : 254

Get Book Here

Book Description
This book provides a structured treatment of the key principles and techniques for enabling efficient processing of deep neural networks (DNNs). DNNs are currently widely used for many artificial intelligence (AI) applications, including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Therefore, techniques that enable efficient processing of deep neural networks to improve key metrics—such as energy-efficiency, throughput, and latency—without sacrificing accuracy or increasing hardware costs are critical to enabling the wide deployment of DNNs in AI systems. The book includes background on DNN processing; a description and taxonomy of hardware architectural approaches for designing DNN accelerators; key metrics for evaluating and comparing different designs; features of DNN processing that are amenable to hardware/algorithm co-design to improve energy efficiency and throughput; and opportunities for applying new technologies. Readers will find a structured introduction to the field as well as formalization and organization of key concepts from contemporary work that provide insights that may spark new ideas.

Simulating Dataflow Accelerators for Deep Learning Application in Heterogeneous System

Simulating Dataflow Accelerators for Deep Learning Application in Heterogeneous System PDF Author: Quang Anh Hoang
Publisher:
ISBN:
Category : Computer architecture
Languages : en
Pages : 0

Get Book Here

Book Description
For the past few decades, deep learning has emerged as an essential discipline that broadens the horizon of the knowledge of humankind. At its core, Deep Neural Networks (DNN) play a vital role in processing input data to generate predictions or decisions (inference step), with their accuracy ameliorated by extensive training (training step). As the complexity of the problem increases, the number of layers in DNN models tends to rise. Such complex models require more computations and take longer to produce an output. Additionally, the large number of calculations require a tremendous amount of power. Therefore, improving energy efficiency is a primary design consideration. To address this concern, researchers have studied domain-specific architecture to develop highly efficient hardware tailored for a given application, which performs a given set of computations at a lower energy cost. An energy-efficient yet high-performance system is created by pairing this application-specific accelerator with a General-Purpose Processor (GPP). This heterogeneity helps offload the heavy computations to the accelerator while handling less computation intensive tasks on the GPP. In this thesis, we study the performance of dataflow accelerators integrated into a heterogeneous architecture for executing deep learning workloads. Fundamental to these accelerators is their high levels of concurrency in executing computations simultaneously, making them suitable to exploit data parallelism present in DNN operations. With the limited bandwidth of interconnection between accelerator and main memory being one of the critical constraints of a heterogeneous system, a tradeoff between memory overhead and computational runtime is worth considering. This tradeoff is the main criteria we use in this thesis to evaluate the performance of each architecture and configuration. A model of dataflow memristive crossbar array accelerator is first proposed to expand the scope of the heterogeneous simulation framework towards architectures with analog and mixed-signal circuits. At the core of this accelerator, an array of resistive memory cells connected in crossbar architecture is used for computing matrix multiplications. This design aims to study the effect of memory-performance tradeoffs on systems with analog components. Therefore, a comparison between memristive crossbar array architecture and its digital counterpart, systolic array, is presented. While existing studies focus on heterogeneous systems with digital components, this approach is the first to consider a mixed-signal accelerator incorporated with a general-purpose processor for deep learning workloads. Finally, an application interface software is designed to configure the system's architecture and map DNN layers to simulated hardware. At the core of this software is a DNN model parser-partitioner, which provides subsequent tasks of generating a hardware configuration for the accelerator and assigns partitioned workload to the simulated accelerator. The interface provided by this software can be developed further to incorporate scheduling and mapping algorithms. This extension will produce a synthesizer that will facilitate the following: • Hardware configuration: generate the optimal configuration of system hardware, incorporating the key hardware characteristics such as the number of accelerators, dimension of processing array, and memory allocation for each accelerator. • Schedule of execution: implement a mapping algorithm to decide on an efficient distribution and schedule of partitioned workloads. For future development, this synthesizer will unite the first two stages in system's design flow. In the first analysis stage, simulators search for optimal design aspects under a short time frame based on abstract application graphs and the system's specifications. In architecture stage, within the optimal design region from previous stage, simulators refine their findings by studying further details on architectural level. This inter-stage fusion, once finished, can bring the high accuracy of architectural-level simulation tool closer to analysis stage. In the opposite direction, mapping algorithms implemented in analysis tools can provide architectural exploration with near-optimal scheduling. Together, this stack of software can significantly reduce the time searching for specifications with optimal efficiency.

Embedded Deep Learning

Embedded Deep Learning PDF Author: Bert Moons
Publisher: Springer
ISBN: 3319992236
Category : Technology & Engineering
Languages : en
Pages : 206

Get Book Here

Book Description
This book covers algorithmic and hardware implementation techniques to enable embedded deep learning. The authors describe synergetic design approaches on the application-, algorithmic-, computer architecture-, and circuit-level that will help in achieving the goal of reducing the computational cost of deep learning algorithms. The impact of these techniques is displayed in four silicon prototypes for embedded deep learning. Gives a wide overview of a series of effective solutions for energy-efficient neural networks on battery constrained wearable devices; Discusses the optimization of neural networks for embedded deployment on all levels of the design hierarchy – applications, algorithms, hardware architectures, and circuits – supported by real silicon prototypes; Elaborates on how to design efficient Convolutional Neural Network processors, exploiting parallelism and data-reuse, sparse operations, and low-precision computations; Supports the introduced theory and design concepts by four real silicon prototypes. The physical realization’s implementation and achieved performances are discussed elaborately to illustrated and highlight the introduced cross-layer design concepts.

Algorithm-accelerator Co-design for High-performance and Secure Deep Learning

Algorithm-accelerator Co-design for High-performance and Secure Deep Learning PDF Author: Weizhe Hua
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Deep learning has emerged as a new engine for many of today's artificial intelligence/machine learning systems, leading to several recent breakthroughs in vision and natural language processing tasks.However, as we move into the era of deep learning with billions and even trillions of parameters, meeting the computational and memory requirements to train and serve state-of-the-art models has become extremely challenging. Optimizing the computational cost and memory footprint of deep learning models for better system performance is critical to the widespread deployment of deep learning. Moreover, a massive amount of sensitive and private user data is exposed to the deep learning system during the training or serving process. Therefore, it is essential to investigate potential vulnerabilities in existing deep learning hardware, and then design secure deep learning systems that provide strong privacy guarantees for user data and the models that learn from the data. In this dissertation, we propose to co-design the deep learning algorithms and hardware architectural techniques to improve both the performance and security/privacy of deep learning systems. On high-performance deep learning, we first introduce channel gating neural network (CGNet), which exploits the dynamic sparsity of specific inputs to reduce computation of convolutional neural networks. We also co-develop an ASIC accelerator for CGNet that can turn theoretical FLOP reduction into wall-clock speedup. Secondly, we present Fast Linear Attention with a Single Head (FLASH), a state-of-the-art language model specifically designed for Google's TPU that can achieve transformer-level quality with linear complexity with respect to the sequence length. Through our empirical studies on masked language modeling, auto-regressive language modeling, and fine-tuning for question answering, FLASH achieves at least similar if not better quality compared to the augmented transformer, while being significantly faster (e.g., up to 12 times faster). On the security of deep learning, we study the side-channel vulnerabilities of existing deep learning accelerators. We then introduce a secure accelerator architecture for privacy-preserving deep learning, named GuardNN. GuardNN provides a trusted execution environment (TEE) with specialized protection for deep learning, and achieves a small trusted computing base and low protection overhead at the same time. The FPGA prototype of GuardNN achieves a maximum performance overhead of 2.4\% across four different modern DNNs models for ImageNet.

Hardware Accelerators for Machine Learning: From 3D Manycore to Processing-in-Memory Architectures

Hardware Accelerators for Machine Learning: From 3D Manycore to Processing-in-Memory Architectures PDF Author: Aqeeb Iqbal Arka
Publisher:
ISBN:
Category : Machine learning
Languages : en
Pages : 0

Get Book Here

Book Description
Big data applications such as - deep learning and graph analytics require hardware platforms that are energy-efficient yet computationally powerful. 3D manycore architectures are the key to efficiently executing such compute- and data-intensive applications. Through silicon via (TSV)-based 3D manycore system is a promising solution in this direction as it enables integration of disparate heterogeneous computing cores on a single system. Recent industry trends show the viability of 3D integration in real products (e.g., Intel Lakefield SoC Architecture, the AMD Radeon R9 Fury X graphics card, and Xilinx Virtex-7 2000T/H580T, etc.). However, the achievable performance of conventional through-silicon-via (TSV)-based 3D systems is ultimately bottlenecked by the horizontal wires (wires in each planar die). Moreover, current TSV 3D architectures suffer from thermal limitations. Hence, TSV-based architectures do not realize the full potential of 3D integration. Monolithic 3D (M3D) integration, a breakthrough technology to achieve "More Moore and More Than Moore," and opens up the possibility of designing cores and associated network routers using multiple layers by utilizing monolithic inter-tier vias (MIVs) and hence, reducing the effective wire length. Compared to TSV-based 3D ICs, M3D offers the "true" benefits of vertical dimension for system integration: the size of a MIV used in M3D is over 100x smaller than a TSV. However, designing these new architectures often involves optimizingmultiple conflicting objectives (e.g., performance, thermal, etc.) due to thepresence of a mix of computing elements and communication methodologies; each with a different requirement for high performance. To overcome the difficult optimization challenges due to the large design space and complex interactions among the heterogeneous components (CPU, GPU, Last Level Cache, etc.) in an M3D-based manycore chip, Machine Learning algorithms can be explored as a promising solution to this problem and. The first part of this dissertation focuses on the design of high-performance and energy-efficient architectures for big-data applications, enabled by M3D vertical integration and data-driven machine learning algorithms. As an example, we consider heterogeneous manycore architectures with CPUs, GPUs, and Cache as the choice of hardware platform in this part of the work. The disparate nature of these processing elements introduces conflicting design requirements that need to be satisfied simultaneously. Moreover, the on-chip traffic pattern exhibited by different big-data applications (like many-to-few-to-many in CPU/GPU-based manycore architectures) need to be incorporated in the design process for optimal power-performance trade-off. In this dissertation, we first design a M3D-enabled heterogeneous manycore architecture and we demonstrate the efficacy of machine learning algorithms for efficiently exploring a large design space. For large design space exploration problems, the proposed machine learning algorithm can find good solutions in significantly less amount of time than exiting state-of-the-art counterparts. However, the M3D-enabled heterogeneous manycore architecture is still limited by the inherent memory bandwidth bottlenecks of traditional von-Neumann architectures. As a result, later in this dissertation, we focus on Processing-in-Memory (PIM) architectures tailor-made to accelerate deep learning applications such as Graph Neural Networks (GNNs) as such architectures can achieve massive data parallelism and do not suffer from memory bandwidth-related issues. We choose GNNs as an example workload as GNNs are more complex compared to traditional deep learning applications as they simultaneously exhibit attributes of both deep learning and graph computations. Hence, it is both compute- and data-intensive in nature. The high amount of data movement required by GNN computation poses a challenge to conventional von-Neuman architectures (such as CPUs, GPUs, and heterogeneous system-on-chips (SoCs)) as they have limited memory bandwidth. Hence, we propose the use of PIM-based non-volatile memory such as Resistive Random Access Memory (ReRAM). We leverage the efficient matrix operations enabled by ReRAMs and design manycore architectures that can facilitate the unique computation and communication needs of large-scale GNN training. We then exploit various techniques such as regularization methods to further accelerate GNN training ReRAM-based manycore systems. Finally, we streamline the GNN training process by reducing the amount of redundant information in both the GNN model and the input graph.Overall, this work focuses on the design challenges of high-performance and energy-efficient manycore architectures for machine learning applications. We propose novel architectures that use M3D or ReRAM-based PIM architectures to accelerate such applications. Moreover, we focus on hardware/software co-design to ensure the best possible performance.

Deep Learning for Social Media Data Analytics

Deep Learning for Social Media Data Analytics PDF Author: Tzung-Pei Hong
Publisher: Springer Nature
ISBN: 3031108698
Category : Computers
Languages : en
Pages : 297

Get Book Here

Book Description
This edited book covers ongoing research in both theory and practical applications of using deep learning for social media data. Social networking platforms are overwhelmed by different contents, and their huge amounts of data have enormous potential to influence business, politics, security, planning and other social aspects. Recently, deep learning techniques have had many successful applications in the AI field. The research presented in this book emerges from the conviction that there is still much progress to be made toward exploiting deep learning in the context of social media data analytics. It includes fifteen chapters, organized into four sections that report on original research in network structure analysis, social media text analysis, user behaviour analysis and social media security analysis. This work could serve as a good reference for researchers, as well as a compilation of innovative ideas and solutions for practitioners interested in applying deep learning techniques to social media data analytics.

Proceedings of Ninth International Congress on Information and Communication Technology

Proceedings of Ninth International Congress on Information and Communication Technology PDF Author: Xin-She Yang
Publisher: Springer Nature
ISBN: 9819732999
Category :
Languages : en
Pages : 635

Get Book Here

Book Description


Computer Architecture for Scientists

Computer Architecture for Scientists PDF Author: Andrew A. Chien
Publisher: Cambridge University Press
ISBN: 1316518531
Category : Computers
Languages : en
Pages : 265

Get Book Here

Book Description
A principled, high-level view of computer performance and how to exploit it. Ideal for software architects and data scientists.