Kernel-assisted and Topology-aware MPI Collective Communication Among Multicore Or Many-core Clusters

Kernel-assisted and Topology-aware MPI Collective Communication Among Multicore Or Many-core Clusters PDF Author: Teng Ma
Publisher:
ISBN:
Category :
Languages : en
Pages : 136

Get Book Here

Book Description
Multicore or many-core clusters have become the most prominent form of High Performance Computing (HPC) systems. Hardware complexity and hierarchies not only exist in the inter-node layer, i.e., hierarchical networks, but also exist in internals of multicore compute nodes, e.g., Non Uniform Memory Accesses (NUMA), network-style interconnect, and memory and shared cache hierarchies. Message Passing Interface (MPI), the most widely adopted in the HPC communities, suffers from decreased performance and portability due to increased hardware complexity of multiple levels. We identified three critical issues specific to collective communication: The first problem arises from the gap between logical collective topologies and underlying hardware topologies; Second, current MPI communications lack efficient shared memory message delivering approaches; Last, on distributed memory machines, like multicore clusters, a single approach cannot encompass the extreme variations not only in the bandwidth and latency capabilities, but also in features such as the aptitude to operate multiple concurrent copies simultaneously. To bridge the gap between logical collective topologies and hardware topologies, we developed a distance-aware framework to integrate the knowledge of hardware distance into collective algorithms in order to dynamically reshape the communication patterns to suit the hardware capabilities. Based on process distance information, we used graph partitioning techniques to organize the MPI processes in a multi-level hierarchy, mapping on the hardware characteristics. Meanwhile, we took advantage of the kernel-assisted one-sided single-copy approach (KNEM) as the default shared memory delivering method. Via kernel-assisted memory copy, the collective algorithms offload copy tasks onto non-leader/not-root processes to evenly distribute copy workloads among available cores. Finally, on distributed memory machines, we developed a technique to compose multi-layered collective algorithms together to express a multi-level algorithm with tight interoperability between the levels. This tight collaboration results in more overlaps between inter- and intra-node communication. Experimental results have confirmed that, by leveraging several technologies together, such as kernel-assisted memory copy, the distance-aware framework, and collective algorithm composition, not only do MPI collectives reach the potential maximum performance on a wide variation of platforms, but they also deliver a level of performance immune to modifications of the underlying process-core binding.

Kernel-assisted and Topology-aware MPI Collective Communication Among Multicore Or Many-core Clusters

Kernel-assisted and Topology-aware MPI Collective Communication Among Multicore Or Many-core Clusters PDF Author: Teng Ma
Publisher:
ISBN:
Category :
Languages : en
Pages : 136

Get Book Here

Book Description
Multicore or many-core clusters have become the most prominent form of High Performance Computing (HPC) systems. Hardware complexity and hierarchies not only exist in the inter-node layer, i.e., hierarchical networks, but also exist in internals of multicore compute nodes, e.g., Non Uniform Memory Accesses (NUMA), network-style interconnect, and memory and shared cache hierarchies. Message Passing Interface (MPI), the most widely adopted in the HPC communities, suffers from decreased performance and portability due to increased hardware complexity of multiple levels. We identified three critical issues specific to collective communication: The first problem arises from the gap between logical collective topologies and underlying hardware topologies; Second, current MPI communications lack efficient shared memory message delivering approaches; Last, on distributed memory machines, like multicore clusters, a single approach cannot encompass the extreme variations not only in the bandwidth and latency capabilities, but also in features such as the aptitude to operate multiple concurrent copies simultaneously. To bridge the gap between logical collective topologies and hardware topologies, we developed a distance-aware framework to integrate the knowledge of hardware distance into collective algorithms in order to dynamically reshape the communication patterns to suit the hardware capabilities. Based on process distance information, we used graph partitioning techniques to organize the MPI processes in a multi-level hierarchy, mapping on the hardware characteristics. Meanwhile, we took advantage of the kernel-assisted one-sided single-copy approach (KNEM) as the default shared memory delivering method. Via kernel-assisted memory copy, the collective algorithms offload copy tasks onto non-leader/not-root processes to evenly distribute copy workloads among available cores. Finally, on distributed memory machines, we developed a technique to compose multi-layered collective algorithms together to express a multi-level algorithm with tight interoperability between the levels. This tight collaboration results in more overlaps between inter- and intra-node communication. Experimental results have confirmed that, by leveraging several technologies together, such as kernel-assisted memory copy, the distance-aware framework, and collective algorithm composition, not only do MPI collectives reach the potential maximum performance on a wide variation of platforms, but they also deliver a level of performance immune to modifications of the underlying process-core binding.

Algorithms and Architectures for Parallel Processing

Algorithms and Architectures for Parallel Processing PDF Author: Guojun Wang
Publisher: Springer
ISBN: 3319271407
Category : Computers
Languages : en
Pages : 880

Get Book Here

Book Description
This four volume set LNCS 9528, 9529, 9530 and 9531 constitutes the refereed proceedings of the 15th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2015, held in Zhangjiajie, China, in November 2015. The 219 revised full papers presented together with 77 workshop papers in these four volumes were carefully reviewed and selected from 807 submissions (602 full papers and 205 workshop papers). The first volume comprises the following topics: parallel and distributed architectures; distributed and network-based computing and internet of things and cyber-physical-social computing. The second volume comprises topics such as big data and its applications and parallel and distributed algorithms. The topics of the third volume are: applications of parallel and distributed computing and service dependability and security in distributed and parallel systems. The covered topics of the fourth volume are: software systems and programming models and performance modeling and evaluation.

Supercomputing

Supercomputing PDF Author: Vladimir Voevodin
Publisher: Springer Nature
ISBN: 3030646165
Category : Computers
Languages : en
Pages : 660

Get Book Here

Book Description
This book constitutes the refereed post-conference proceedings of the 6th Russian Supercomputing Days, RuSCDays 2020, held in Moscow, Russia, in September 2020.* The 51 revised full and 4 revised short papers presented were carefully reviewed and selected from 106 submissions. The papers are organized in the following topical sections: parallel algorithms; supercomputer simulation; HPC, BigData, AI: architectures, technologies, tools; and distributed and cloud computing. * The conference was held virtually due to the COVID-19 pandemic.

Enhancement of LIMIC-based Collectives for Multi-core Clusters

Enhancement of LIMIC-based Collectives for Multi-core Clusters PDF Author: Vijay Dhanraj
Publisher:
ISBN:
Category :
Languages : en
Pages : 63

Get Book Here

Book Description
Abstract: High Performance Computing (HPC) has made it possible for scientists and engi- neers to solve complex science, engineering and business problems using applications that require high bandwidth, low latency networking, and very high compute capa- bilities. To satisfy these ever increasing need for high compute capabilities, more and more clusters are deploying multi-core processors. A majority of these parallel applications are written in MPI, and employ collective operations in their communication kernels. Optimization of these collective operations on the multi-core platforms is one of the key factors to obtaining good performance speed-ups. However, the challenge for these applications lie in utilizing the Non- uniform memory access (NUMA) architecture and shared cache hierarchies, provided by the modern multi-core processors. Also, care must be taken so as to reduce traffic to remote NUMA memory, avoid memory bottlenecks for rooted collectives and reduce the multiple copies as in case of shared memory. The existing optimizations for the MPI collectives deploy a single-leader approach, using kernel assisted techniques (LiMIC) in a point-to-point manner to mitigate the above mentioned issues, but this would still not scale well as the number of cores per node increase. In this thesis, we present Direct LiMIC Primitives for MPI Collectives, such as MPI Bcast and MPI Gather, that use existing LiMIC APIs to perform the collec- tive operation. We deploy these new techniques on single-leader and multi-leader approaches, which are created from a hierarchical framework based on the underlying node topology. This helps us to have stable performance even in case of irregular process placement. Based on this hierarchical framework and Direct LiMIC primitives, we propose multiple hybrid schemes which utilize the existing optimized MPI point-to-point schemes and kernel assisted techniques and evaluate the performance of three MPI Collectives (Broadcast, Gather and Allgather) on a popular open source MPI stack, MVAPICH2. Based on our experimental evaluation on SDSC Trestles cluster, which is an ideal example for multi-core cluster and using our OMB benchmarks, we were able to see a performance improvement around 10-28% for MPI Bcast within a node, around 10-35% improvement for MPI Gather collective for system sizes ranging from 64-1,024 processes. With good improvements in MPI Bcast and MPI Gather, we also evaluated MPI Allgather which in turn would use these enhanced collectives. We were able to see performance improvement of around 8-60% for small messages (1-64 bytes) for system sizes greater than or equal to 256 processes.

2008 37th International Conference on Parallel Processing

2008 37th International Conference on Parallel Processing PDF Author: IEEE Staff
Publisher:
ISBN:
Category : Electronic data processing
Languages : en
Pages : 704

Get Book Here

Book Description


Topology-aware MPI Communication and Scheduling for High Performance Computing Systems

Topology-aware MPI Communication and Scheduling for High Performance Computing Systems PDF Author: Hari Subramoni
Publisher:
ISBN:
Category :
Languages : en
Pages : 132

Get Book Here

Book Description
Abstract: The designs proposed in this thesis have been successfully tested at up to 4,096 processes on the Stampede supercomputing system at TACC. We observe up to 14% improvement in the latency of the broadcast operation using our proposed topology-aware scheme over the default scheme at the micro-benchmark level for 1,024 processes. The topology-aware point-to-point communication and process placement scheme is able to improve the performance the MILC application up to 6% and 15% improvement in total execution time on 1,024 cores of Hyperion and 2,048 cores of Ranger, respectively. We also observe that our network topology-aware communication schedules for Alltoall is able to significantly reduce the amount of network contention observed during the Alltoall / FFT operations. It is also able to deliver up to 12% improvement in the communication time of P3DFFT at 4,096 processes on Stampede. The proposed network topology-aware plugin for SLURM is able to improve the throughput of a 512 core cluster by up to 8%.

Designing Optimized MPI+NCCL Hybrid Collective Communication Routines for Dense Many-GPU Clusters

Designing Optimized MPI+NCCL Hybrid Collective Communication Routines for Dense Many-GPU Clusters PDF Author: Nithin Senthil Kumar
Publisher:
ISBN:
Category : Graphics processing units
Languages : en
Pages : 74

Get Book Here

Book Description
CUDA-aware Message Passing Interface (MPI) libraries like MVAPICH2-GDR have rapidly evolved to keep up with the demand for efficient GPU buffer-based communication by incorporating the latest technological advances to drive down communication latency significantly. However, with the advent of Deep Learning (DL), vendors have started to introduce libraries that are DL-focused, but not MPI-compliant – like the NVIDIA Collective Communications Library (NCCL). Furthermore, there is a lack of a single common standardized benchmarking tool to evaluate the performance of both MPI and NCCL operations. In this work, we introduce a new set of collective benchmarks within OSU-Micro Benchmarks (OMB) to evaluate the performance of NCCL operations in a manner that is semantically equivalent to MPI benchmarks. We then tackle the challenge to see if modern CUDA-aware MPI libraries like MVAPICH2-GDR can take advantage of advances in collective communications libraries like NCCL to provide high-performance MPI-compliant collective communication primitives for High-Performance Computing (HPC) and DL applications. We incorporate the ability to invoke NCCL API into MVAPICH2-GDR’s tuning framework in order to select the best algorithm for any given message size. Finally, we evaluate the performance of our designs by investigating the improvement in latency at different message sizes and scales on the Lassen supercomputing system using OMB. The designs developed as a part of this thesis will be made available in future releases of MVAPICH2-GDR and OMB.

Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operations Performance

Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operations Performance PDF Author:
Publisher:
ISBN:
Category :
Languages : en
Pages : 8

Get Book Here

Book Description
The efficient implementation of collective communication operations has received much attention. Initial efforts modeled network communication and produced optimal trees based on those models. However, the models used by these initial efforts assumed equal point-to-point latencies between any two processes. This assumption is violated in heterogeneous systems such as clusters of SMPs and wide-area computational grids, and as a result, collective operations that utilize the trees generated by these models perform suboptimally. In response, more recent work has focused on creating topology-aware trees for collective operations that minimize communication across slower channels (e.g., a wide-area network). While these efforts have significant communication benefits, they all limit their view of the network to only two layers. The authors present a strategy based upon a multilayer view of the network. By creating multilevel topology trees they take advantage of communication cost differences at every level in the network. They used this strategy to implement topology-aware versions of several MPI collective operations in MPICH-G, the Globus-enabled version of the popular MPICH implementation of the MPI standard. Using information about topology discovered by Globus, they construct these topology-aware trees automatically during execution, thus freeing the MPI application programmer from having to write special files or functions to describe the topology to the MPICH library. They present results demonstrating the advantages of their multilevel approach by comparing it to the default (topology-unaware) implementation provided by MPICH and a topology-aware two-layer implementation.

Scalable and High Performance Collective Communication for Next Generation Multicore Infiniband Clusters

Scalable and High Performance Collective Communication for Next Generation Multicore Infiniband Clusters PDF Author: Amith Rajith Mamidala
Publisher:
ISBN:
Category : Computer networks
Languages : en
Pages : 144

Get Book Here

Book Description
Abstract: High Performance Computing is enabling rapid innovations spanning several key areas ranging from science, technology and manufacturing disciplines to entertainment and financial markets. One computing paradigm contributing significantly to the outreach of such capabilities is Cluster Computing. Cluster computing involves the use of multiple Commodity PCs interconnected by a network to provide the required computational resource in a cost-effective manner. Recently, commodity clusters are rapidly transforming into capability class machines with several of them featuring in the Top 10 list of supercomputers. The two primary drivers for this trend being: a) Advent of Multicore technology and b) Performance and Scalability of InfiniBand, an open standard based interconnection network. These two factors are ushering in an era of ultra-scale InfiniBand Multicore clusters comprising of tens of thousands of compute cores. Utilizing Message Passing Interface (MPI) is the most popular method of programming parallel applications. In this model, communication occurs via explicit exchange of data messages. MPI provides for plethora of communication primitives out of which Collective primtives are especially significant. These are extensively used in a variety of scientific and engineering applications (such as to compute fast fourier transforms and multiply large matrices, etc.). It is imperative that these collectives be designed efficiently to ensure good performance and scalability. MPI collectives pose several challenges and requirements in terms of guaranteeing data reliability, enabling efficient scalable means of data transfers and providing for process skew tolerance mechanisms. Moreover, the characteristics of underlying network and multicore systems directly impact the behavior of the collective operations and need to be taken into consideration for optimizing performance and resource usage. In this dissertation, we take on these challenges to design a Scalable and High Performance Collective Communication subsystem for MPI over InfiniBand Multicore clusters. The central theme used in our approach is to have an in-depth understanding of the capabilities of underlying network/system architecture and leverage these to provide optimal design alternatives. Specifically, the dissertation describes novel communication protocols and algorithms utilizing a) InfiniBand's hardware Multicast, RDMA capabilities and b) System's shared memory to meet the stated requirements and challenges. Also, the collective optimizations discussed in the dissertation take into account the different transport methods of InfiniBand and the architectural attributes of Multicore systems. The designs proposed in the dissertation have been incorporated into the open source MVAPICH software used by more than 680 organizations worldwide. It is used in several cluster installations, and currently used by the world's third fastest supercomputer.

Distributed and Cloud Computing

Distributed and Cloud Computing PDF Author: Kai Hwang
Publisher: Morgan Kaufmann
ISBN: 0128002042
Category : Computers
Languages : en
Pages : 671

Get Book Here

Book Description
Distributed and Cloud Computing: From Parallel Processing to the Internet of Things offers complete coverage of modern distributed computing technology including clusters, the grid, service-oriented architecture, massively parallel processors, peer-to-peer networking, and cloud computing. It is the first modern, up-to-date distributed systems textbook; it explains how to create high-performance, scalable, reliable systems, exposing the design principles, architecture, and innovative applications of parallel, distributed, and cloud computing systems. Topics covered by this book include: facilitating management, debugging, migration, and disaster recovery through virtualization; clustered systems for research or ecommerce applications; designing systems as web services; and social networking systems using peer-to-peer computing. The principles of cloud computing are discussed using examples from open-source and commercial applications, along with case studies from the leading distributed computing vendors such as Amazon, Microsoft, and Google. Each chapter includes exercises and further reading, with lecture slides and more available online. This book will be ideal for students taking a distributed systems or distributed computing class, as well as for professional system designers and engineers looking for a reference to the latest distributed technologies including cloud, P2P and grid computing. Complete coverage of modern distributed computing technology including clusters, the grid, service-oriented architecture, massively parallel processors, peer-to-peer networking, and cloud computing Includes case studies from the leading distributed computing vendors: Amazon, Microsoft, Google, and more Explains how to use virtualization to facilitate management, debugging, migration, and disaster recovery Designed for undergraduate or graduate students taking a distributed systems course—each chapter includes exercises and further reading, with lecture slides and more available online