Distributed Fault-tolerance Techniques for Local Computations

Distributed Fault-tolerance Techniques for Local Computations PDF Author: Brahim Hamid
Publisher:
ISBN:
Category :
Languages : en
Pages : 157

Get Book Here

Book Description
Ce travail constitue une contribution théorique et pratique à l'étude de la tolérance aux pannes dans les système distribués, et de spécifier l'impacte de la théorie des graphes dans ce domaine. Précisément, on s'intéresse aux pannes de processus et nous considérons les pannes transitoires et les pannes franches. Notre but est d'étudier ces systèmes dans le contexte de la localité: Les seuls calculs autorisés sont ceux à base d'informations locales. Ceci revient à capturer la notion de détection et de correction des configurations ``illégales'' qui résultent de l'état arbitraire du systeme. Comme modèle, on utilisera les calculs locaux et à passage de messages. Nos constructions sont basées sur les techniques les plus répondues dans ce domaine: La détection de pannes et l'auto-stabilisation. En combinant ces deux techniques et en gardant la localité comme but ultime, nous construisons un nouvel outil pour transformer un algorithme intolérant en un autre algorithme équivalent mais qui est tolérant aux pannes, dont la preuve de correction est déduite du premier. D'une autre part, nous avons augmenté la plate-forme Visidia pour simuler les pannes. Les pannes transitoires sont simplement simulées à travers des vues permettant de changer l'état des noeuds. Pour les pannes franches, en premier le détecteur de pannes est intégré dans Visidia, en plus d'une interface pour mesurer ses performances pour atteindre les comportements attendus. En second, à travers des vues, l'utilisateur peut stopper le travail d'un noeud et simuler la panne d'un processus. Pour termier cette thèse, nous sollicitons la théorie de graphe pour aider la tolérance aux pannes selon deux aspects. Nous présentons une nouvelle formalisation du test de la connexité de graphes en utilisant la notion de fils de secours. Un algorithme distribué et local pour tester la 2-connexité est proposé dans les deux modèles. Ce résultat nous illumine à propos de la manière dont un pont peut être construit entre ces deux modèles. En second, le calcul des fils de secours est entendu pour étudier la maintenance de forêts d'arbres recouvrants en présence de pannes franches.

Distributed Fault-tolerance Techniques for Local Computations

Distributed Fault-tolerance Techniques for Local Computations PDF Author: Brahim Hamid
Publisher:
ISBN:
Category :
Languages : en
Pages : 157

Get Book Here

Book Description
Ce travail constitue une contribution théorique et pratique à l'étude de la tolérance aux pannes dans les système distribués, et de spécifier l'impacte de la théorie des graphes dans ce domaine. Précisément, on s'intéresse aux pannes de processus et nous considérons les pannes transitoires et les pannes franches. Notre but est d'étudier ces systèmes dans le contexte de la localité: Les seuls calculs autorisés sont ceux à base d'informations locales. Ceci revient à capturer la notion de détection et de correction des configurations ``illégales'' qui résultent de l'état arbitraire du systeme. Comme modèle, on utilisera les calculs locaux et à passage de messages. Nos constructions sont basées sur les techniques les plus répondues dans ce domaine: La détection de pannes et l'auto-stabilisation. En combinant ces deux techniques et en gardant la localité comme but ultime, nous construisons un nouvel outil pour transformer un algorithme intolérant en un autre algorithme équivalent mais qui est tolérant aux pannes, dont la preuve de correction est déduite du premier. D'une autre part, nous avons augmenté la plate-forme Visidia pour simuler les pannes. Les pannes transitoires sont simplement simulées à travers des vues permettant de changer l'état des noeuds. Pour les pannes franches, en premier le détecteur de pannes est intégré dans Visidia, en plus d'une interface pour mesurer ses performances pour atteindre les comportements attendus. En second, à travers des vues, l'utilisateur peut stopper le travail d'un noeud et simuler la panne d'un processus. Pour termier cette thèse, nous sollicitons la théorie de graphe pour aider la tolérance aux pannes selon deux aspects. Nous présentons une nouvelle formalisation du test de la connexité de graphes en utilisant la notion de fils de secours. Un algorithme distribué et local pour tester la 2-connexité est proposé dans les deux modèles. Ce résultat nous illumine à propos de la manière dont un pont peut être construit entre ces deux modèles. En second, le calcul des fils de secours est entendu pour étudier la maintenance de forêts d'arbres recouvrants en présence de pannes franches.

Fault-Tolerance Techniques for High-Performance Computing

Fault-Tolerance Techniques for High-Performance Computing PDF Author: Thomas Herault
Publisher: Springer
ISBN: 3319209434
Category : Computers
Languages : en
Pages : 325

Get Book Here

Book Description
This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as ABFT. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Features: provides a survey of resilience methods and performance models; examines the various sources for errors and faults in large-scale systems; reviews the spectrum of techniques that can be applied to design a fault-tolerant MPI; investigates different approaches to replication; discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems.

Methods, Models and Tools for Fault Tolerance

Methods, Models and Tools for Fault Tolerance PDF Author: Michael Butler
Publisher: Springer
ISBN: 3642008674
Category : Computers
Languages : en
Pages : 350

Get Book Here

Book Description
The growing complexity of modern software systems increases the di?culty of ensuring the overall dependability of software-intensive systems. Complexity of environments, in which systems operate, high dependability requirements that systems have to meet, as well as the complexity of infrastructures on which they rely make system design a true engineering challenge. Mastering system complexity requires design techniques that support clear thinking and rigorous validation and veri?cation. Formal design methods help to achieve this. Coping with complexity also requires architectures that are t- erant of faults and of unpredictable changes in environment. This issue can be addressed by fault-tolerant design techniques. Therefore, there is a clear need of methods enabling rigorous modelling and development of complex fault-tolerant systems. This bookaddressessuchacuteissues indevelopingfault-tolerantsystemsas: – Veri?cation and re?nement of fault-tolerant systems – Integrated approaches to developing fault-tolerant systems – Formal foundations for error detection, error recovery, exception and fault handling – Abstractions, styles and patterns for rigorousdevelopment of fault tolerance – Fault-tolerant software architectures – Development and application of tools supporting rigorous design of depe- able systems – Integrated platforms for developing dependable systems – Rigorous approaches to speci?cation and design of fault tolerance in novel computing systems TheeditorsofthisbookwereinvolvedintheEU(FP-6)projectRODIN(R- orous Open Development Environment for Complex Systems), which brought together researchers from the fault tolerance and formal methods communi- 1 ties. In 2007 RODIN organized the MeMoT workshop held in conjunction with the Integrated Formal Methods 2007 Conference at Oxford University.

Fault Tolerance

Fault Tolerance PDF Author: Peter A. Lee
Publisher: Springer Science & Business Media
ISBN: 370918990X
Category : Computers
Languages : en
Pages : 326

Get Book Here

Book Description
The production of a new version of any book is a daunting task, as many authors will recognise. In the field of computer science, the task is made even more daunting by the speed with which the subject and its supporting technology move forward. Since the publication of the first edition of this book in 1981 much research has been conducted, and many papers have been written, on the subject of fault tolerance. Our aim then was to present for the first time the principles of fault tolerance together with current practice to illustrate those principles. We believe that the principles have (so far) stood the test of time and are as appropriate today as they were in 1981. Much work on the practical applications of fault tolerance has been undertaken, and techniques have been developed for ever more complex situations, such as those required for distributed systems. Nevertheless, the basic principles remain the same.

Fault Tolerance in Distributed Systems

Fault Tolerance in Distributed Systems PDF Author: Pankaj Jalote
Publisher: Prentice Hall
ISBN:
Category : Computers
Languages : en
Pages : 456

Get Book Here

Book Description
Fault tolerance is an approach by which reliability of a computer system can be increased beyond what can be achieved by traditional methods. Comprehensive and self-contained, this book explores the information available on software supported fault tolerance techniques, with a focus on fault tolerance in distributed systems.

Recoverable Computations

Recoverable Computations PDF Author: Samantha Jayaba W. Edirisooriya
Publisher:
ISBN:
Category :
Languages : en
Pages : 254

Get Book Here

Book Description


Communication-efficient and Fault-tolerant Algorithms for Distributed Machine Learning

Communication-efficient and Fault-tolerant Algorithms for Distributed Machine Learning PDF Author: Farzin Haddadpour
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description
Distributed computing over multiple nodes has been emerging in practical systems. Comparing to the classical single node computation, distributed computing offers higher computing speeds over large data. However, the computation delay of the overall distributed system is controlled by its slower nodes, i.e., straggler nodes. Furthermore, if we want to run iterative algorithms such as gradient descent based algorithms communication cost becomes a bottleneck. Therefore, it is important to design coded strategies while they are prone to these straggler nodes, at the same time they are communication-efficient. Recent work has developed coding theoretic approaches to add redundancy to distributed matrix-vector multiplications with the goal of speeding up the computation by mitigating the straggler effect in distributed computing. First, we consider the case where the matrix comes from a small (e.g., binary) alphabet, where a variant of a popular method called the ``Four-Russians method'' is known to have significantly lower computational complexity as compared with the usual matrix-vector multiplication algorithm. We develop novel code constructions that are applicable to binary matrix-vector multiplication {via a variant of the Four-Russians method called the Mailman algorithm}. Specifically, in our constructions, the encoded matrices have a low alphabet that ensures lower computational complexity, as well as good straggler tolerance. We also present a trade-off between the communication and computation cost of distributed coded matrix-vector multiplication {for general, possibly non-binary, matrices.} Second, we provide novel coded computation strategies, called MatDot, for distributed matrix-matrix products that outperform the recent ``Polynomial code'' constructions in recovery threshold, i.e., the required number of successful workers at the cost of higher computation cost per worker and higher communication cost from each worker to the fusion node. We also demonstrate a novel coding technique for multiplying $n$ matrices ($n \geq 3$) using ideas from MatDot codes. Third, we introduce the idea of \emph{cross-iteration coded computing}, an approach to reducing communication costs for a large class of distributed iterative algorithms involving linear operations, including gradient descent and accelerated gradient descent for quadratic loss functions. The state-of-the-art approach for these iterative algorithms involves performing one iteration of the algorithm per round of communication among the nodes. In contrast, our approach performs multiple iterations of the underlying algorithm in a single round of communication by incorporating some redundancy storage and computation. Our algorithm works in the master-worker setting with the workers storing carefully constructed linear transformations of input matrices and using these matrices in an iterative algorithm, with the master node inverting the effect of these linear transformations. In addition to reduced communication costs, a trivial generalization of our algorithm also includes resilience to stragglers and failures as well as Byzantine worker nodes. We also show a special case of our algorithm that trades-off between communication and computation. The degree of redundancy of our algorithm can be tuned based on the amount of communication and straggler resilience required. Moreover, we also describe a variant of our algorithm that can flexibly recover the results based on the degree of straggling in the worker nodes. The variant allows for the performance to degrade gracefully as the number of successful (non-straggling) workers is lowered. Communication overhead is one of the key challenges that hinders the scalability of distributed optimization algorithms to train large neural networks. In recent years, there has been a great deal of research to alleviate communication cost by compressing the gradient vector or using local updates and periodic model averaging. Next direction in this thesis, is to advocate the use of redundancy towards communication-efficient distributed stochastic algorithms for non-convex optimization. In particular, we, both theoretically and practically, show that by properly infusing redundancy to the training data with model averaging, it is possible to significantly reduce the number of communication rounds. To be more precise, we show that redundancy reduces residual error in local averaging, thereby reaching the same level of accuracy with fewer rounds of communication as compared with previous algorithms. Empirical studies on CIFAR10, CIFAR100 and ImageNet datasets in a distributed environment complement our theoretical results; they show that our algorithms have additional beneficial aspects including tolerance to failures, as well as greater gradient diversity. Next, we study local distributed SGD, where data is partitioned among computation nodes, and the computation nodes perform local updates with periodically exchanging the model among the workers to perform averaging. While local SGD is empirically shown to provide promising results, a theoretical understanding of its performance remains open. We strengthen convergence analysis for local SGD, and show that local SGD can be far less expensive and applied far more generally than current theory suggests. Specifically, we show that for loss functions that satisfy the \pl~condition, $O((pT)^{1/3})$ rounds of communication suffice to achieve a linear speed up, that is, an error of $O(1/pT)$, where $T$ is the total number of model updates at each worker. This is in contrast with previous work which required higher number of communication rounds, as well as was limited to strongly convex loss functions, for a similar asymptotic performance. We also develop an adaptive synchronization scheme that provides a general condition for linear speed up. We also validate the theory with experimental results, running over AWS EC2 clouds and an internal GPU cluster. In final section, we focus on Federated learning where communication cost is often a critical bottleneck to scale up distributed optimization algorithms to collaboratively learn a model from millions of devices with potentially unreliable or limited communication and heterogeneous data distributions. Two notable trends to deal with the communication overhead of federated algorithms are \emph{gradient compression} and \emph{local computation with periodic communication}. Despite many attempts, characterizing the relationship between these two approaches has proven elusive. We address this by proposing a set of algorithms with periodical compressed (quantized or sparsified) communication and analyze their convergence properties in both homogeneous and heterogeneous local data distributions settings. For the homogeneous setting, our analysis improves existing bounds by providing tighter convergence rates for both \emph{strongly convex} and \emph{non-convex} objective functions. To mitigate data heterogeneity, we introduce a \emph{local gradient tracking} scheme and obtain sharp convergence rates that match the best-known communication complexities without compression for convex, strongly convex, and nonconvex settings. We complement our theoretical results by demonstrating the effectiveness of our proposed methods on real-world datasets.

Introduction To Quantum Computation And Information

Introduction To Quantum Computation And Information PDF Author: Adriano Barenco
Publisher: World Scientific
ISBN: 9814496359
Category : Science
Languages : en
Pages : 364

Get Book Here

Book Description
This book aims to provide a pedagogical introduction to the subjects of quantum information and quantum computation. Topics include non-locality of quantum mechanics, quantum computation, quantum cryptography, quantum error correction, fault-tolerant quantum computation as well as some experimental aspects of quantum computation and quantum cryptography. Only knowledge of basic quantum mechanics is assumed. Whenever more advanced concepts and techniques are used, they are introduced carefully. This book is meant to be a self-contained overview. While basic concepts are discussed in detail, unnecessary technical details are excluded. It is well-suited for a wide audience ranging from physics graduate students to advanced researchers.This book is based on a lecture series held at Hewlett-Packard Labs, Basic Research Institute in the Mathematical Sciences (BRIMS), Bristol from November 1996 to April 1997, and also includes other contributions.

Fault-Tolerant Parallel and Distributed Systems

Fault-Tolerant Parallel and Distributed Systems PDF Author: Dimiter R Avresky
Publisher:
ISBN: 9781461554509
Category :
Languages : en
Pages : 420

Get Book Here

Book Description


Design And Analysis Of Reliable And Fault-tolerant Computer Systems

Design And Analysis Of Reliable And Fault-tolerant Computer Systems PDF Author: Mostafa I Abd-el-barr
Publisher: World Scientific
ISBN: 190897978X
Category : Computers
Languages : en
Pages : 463

Get Book Here

Book Description
Covering both the theoretical and practical aspects of fault-tolerant mobile systems, and fault tolerance and analysis, this book tackles the current issues of reliability-based optimization of computer networks, fault-tolerant mobile systems, and fault tolerance and reliability of high speed and hierarchical networks.The book is divided into six parts to facilitate coverage of the material by course instructors and computer systems professionals. The sequence of chapters in each part ensures the gradual coverage of issues from the basics to the most recent developments. A useful set of references, including electronic sources, is listed at the end of each chapter./a