A Declarative Framework for Big Graph Analytics and Their Provenance

A Declarative Framework for Big Graph Analytics and Their Provenance PDF Author: Vasiliki Papavasileiou
Publisher:
ISBN:
Category :
Languages : en
Pages : 127

Get Book Here

Book Description
Recent years have witnessed an explosion in size of graph data and complexity of graph analytics in fields such as social and mobile networks, science and advertisement. Analyzing and extracting knowledge from Big Graphs (in analogy to Big Data) is hard. The size of Big Graphs necessitates the use of distributed infrastructures and parallel programming. Moreover, implementing performant and correct analytics requires in depth knowledge of both algorithm and input data. Developers of graph analytics face two major challenges: i) There is a myriad of Big Graph processing frameworks, each uses a different imperative programming language and implements different low-level optimizations. Developers are burdened with understanding the low-level characteristics of an execution framework that suits best their algorithms and data. ii) Assessing the quality of both data and analytics is a tedious and manual task. Devising new graph analytics is an iterative process, where developers incrementally refine their algorithms and clean their data by analyzing results, correcting for errors and run again until the end results are satisfiable. In this dissertation we offer a declarative framework that addresses the entire life-cycle, from designing to executing, of Big Graph analytics. Our approach uses a single language for both authoring graph analytics and fine-tuning them. Specifically, this dissertation makes the following two main contributions: We design and demonstrate Datalography, the first approach for declarative graph analytics on Vertex-Centric graph processing engines. To accommodate different programming models, we design and implement a compiler that takes general Datalog queries and rewrites them into distribution-aware queries that can be efficiently evaluated on any Vertex-Centric framework. Moreover, our compiler implements automatic and transparent to the user optimizations in the form of logical query rewritings and thus are portable to any Vertex-Centric system. We demonstrate the effectiveness of our approach with an experimental evaluation on real-world graphs that indicates Datalography offers superior performance when compared to native, imperative implementations. Our second contribution is a novel provenance management approach that enables developers to customize provenance capturing and analysis with twofold benefits: the amount of captured provenance is minimized to include only the necessary information and analysis is extended beyond the traditional tracing queries. We present formal semantics of our provenance query language, based on Datalog, and identify an important class of queries that can be evaluated online, simultaneously with the graph analytic. We showcase our approach with Ariadne, a provenance management system that supports efficient debugging, auditing and fine-tuning of graph analytics.

A Declarative Framework for Big Graph Analytics and Their Provenance

A Declarative Framework for Big Graph Analytics and Their Provenance PDF Author: Vasiliki Papavasileiou
Publisher:
ISBN:
Category :
Languages : en
Pages : 127

Get Book Here

Book Description
Recent years have witnessed an explosion in size of graph data and complexity of graph analytics in fields such as social and mobile networks, science and advertisement. Analyzing and extracting knowledge from Big Graphs (in analogy to Big Data) is hard. The size of Big Graphs necessitates the use of distributed infrastructures and parallel programming. Moreover, implementing performant and correct analytics requires in depth knowledge of both algorithm and input data. Developers of graph analytics face two major challenges: i) There is a myriad of Big Graph processing frameworks, each uses a different imperative programming language and implements different low-level optimizations. Developers are burdened with understanding the low-level characteristics of an execution framework that suits best their algorithms and data. ii) Assessing the quality of both data and analytics is a tedious and manual task. Devising new graph analytics is an iterative process, where developers incrementally refine their algorithms and clean their data by analyzing results, correcting for errors and run again until the end results are satisfiable. In this dissertation we offer a declarative framework that addresses the entire life-cycle, from designing to executing, of Big Graph analytics. Our approach uses a single language for both authoring graph analytics and fine-tuning them. Specifically, this dissertation makes the following two main contributions: We design and demonstrate Datalography, the first approach for declarative graph analytics on Vertex-Centric graph processing engines. To accommodate different programming models, we design and implement a compiler that takes general Datalog queries and rewrites them into distribution-aware queries that can be efficiently evaluated on any Vertex-Centric framework. Moreover, our compiler implements automatic and transparent to the user optimizations in the form of logical query rewritings and thus are portable to any Vertex-Centric system. We demonstrate the effectiveness of our approach with an experimental evaluation on real-world graphs that indicates Datalography offers superior performance when compared to native, imperative implementations. Our second contribution is a novel provenance management approach that enables developers to customize provenance capturing and analysis with twofold benefits: the amount of captured provenance is minimized to include only the necessary information and analysis is extended beyond the traditional tracing queries. We present formal semantics of our provenance query language, based on Datalog, and identify an important class of queries that can be evaluated online, simultaneously with the graph analytic. We showcase our approach with Ariadne, a provenance management system that supports efficient debugging, auditing and fine-tuning of graph analytics.

Systems for Big Graph Analytics

Systems for Big Graph Analytics PDF Author: Da Yan
Publisher: Springer
ISBN: 3319582178
Category : Computers
Languages : en
Pages : 93

Get Book Here

Book Description
There has been a surging interest in developing systems for analyzing big graphs generated by real applications, such as online social networks and knowledge graphs. This book aims to help readers get familiar with the computation models of various graph processing systems with minimal time investment. This book is organized into three parts, addressing three popular computation models for big graph analytics: think-like-a-vertex, think-likea- graph, and think-like-a-matrix. While vertex-centric systems have gained great popularity, the latter two models are currently being actively studied to solve graph problems that cannot be efficiently solved in vertex-centric model, and are the promising next-generation models for big graph analytics. For each part, the authors introduce the state-of-the-art systems, emphasizing on both their technical novelties and hands-on experiences of using them. The systems introduced include Giraph, Pregel+, Blogel, GraphLab, CraphChi, X-Stream, Quegel, SystemML, etc. Readers will learn how to design graph algorithms in various graph analytics systems, and how to choose the most appropriate system for a particular application at hand. The target audience for this book include beginners who are interested in using a big graph analytics system, and students, researchers and practitioners who would like to build their own graph analytics systems with new features.

Practical Graph Analytics with Apache Giraph

Practical Graph Analytics with Apache Giraph PDF Author: Roman Shaposhnik
Publisher: Apress
ISBN: 1484212517
Category : Computers
Languages : en
Pages : 320

Get Book Here

Book Description
Practical Graph Analytics with Apache Giraph helps you build data mining and machine learning applications using the Apache Foundation’s Giraph framework for graph processing. This is the same framework as used by Facebook, Google, and other social media analytics operations to derive business value from vast amounts of interconnected data points. Graphs arise in a wealth of data scenarios and describe the connections that are naturally formed in both digital and real worlds. Examples of such connections abound in online social networks such as Facebook and Twitter, among users who rate movies from services like Netflix and Amazon Prime, and are useful even in the context of biological networks for scientific research. Whether in the context of business or science, viewing data as connected adds value by increasing the amount of information available to be drawn from that data and put to use in generating new revenue or scientific opportunities. Apache Giraph offers a simple yet flexible programming model targeted to graph algorithms and designed to scale easily to accommodate massive amounts of data. Originally developed at Yahoo!, Giraph is now a top top-level project at the Apache Foundation, and it enlists contributors from companies such as Facebook, LinkedIn, and Twitter. Practical Graph Analytics with Apache Giraph brings the power of Apache Giraph to you, showing how to harness the power of graph processing for your own data by building sophisticated graph analytics applications using the very same framework that is relied upon by some of the largest players in the industry today.

Distributed Graph Analytics

Distributed Graph Analytics PDF Author: Unnikrishnan Cheramangalath
Publisher: Springer Nature
ISBN: 3030418863
Category : Computers
Languages : en
Pages : 207

Get Book Here

Book Description
This book brings together two important trends: graph algorithms and high-performance computing. Efficient and scalable execution of graph processing applications in data or network analysis requires innovations at multiple levels: algorithms, associated data structures, their implementation and tuning to a particular hardware. Further, programming languages and the associated compilers play a crucial role when it comes to automating efficient code generation for various architectures. This book discusses the essentials of all these aspects. The book is divided into three parts: programming, languages, and their compilation. The first part examines the manual parallelization of graph algorithms, revealing various parallelization patterns encountered, especially when dealing with graphs. The second part uses these patterns to provide language constructs that allow a graph algorithm to be specified. Programmers can work with these language constructs without worrying about their implementation, which is the focus of the third part. Implementation is handled by a compiler, which can specialize code generation for a backend device. The book also includes suggestive results on different platforms, which illustrate and justify the theory and practice covered. Together, the three parts provide the essential ingredients for creating a high-performance graph application. The book ends with a section on future directions, which offers several pointers to promising topics for future research. This book is intended for new researchers as well as graduate and advanced undergraduate students. Most of the chapters can be read independently by those familiar with the basics of parallel programming and graph algorithms. However, to make the material more accessible, the book includes a brief background on elementary graph algorithms, parallel computing and GPUs. Moreover it presents a case study using Falcon, a domain-specific language for graph algorithms, to illustrate the concepts.

Big Graph Analytics Platforms

Big Graph Analytics Platforms PDF Author: Da Yan
Publisher:
ISBN:
Category :
Languages : en
Pages : 195

Get Book Here

Book Description


On Software Infrastructure for Scalable Graph Analytics

On Software Infrastructure for Scalable Graph Analytics PDF Author: Yingyi Bu
Publisher:
ISBN: 9781339124087
Category :
Languages : en
Pages : 129

Get Book Here

Book Description
Recently, there is a growing need for distributed graph processing systems that are capable of gracefully scaling to very large datasets. In the mean time, in real-world applications, it is highly desirable to reduce the tedious, inefficient ETL (extract, transform, load) gap between tabular data processing systems and graph processing systems. Unfortunately, those challenges have not been easily met due to the intense memory pressure imposed by process-centric, message passing designs that many graph processing systems follow, as well as the separation of tabular data processing runtimes and graph processing runtimes. In this thesis, we explore the application of programming techniques and algorithms from the database systems world to the problem of scalable graph analysis. We first propose a bloat-aware design paradigm towards the development of efficient and scalable Big Data applications in object-oriented, GC enabled languages and demonstrate that programming under this paradigm does not incur significant programming burden but obtains remarkable performance gains (e.g., 2.5X). Based on the design paradigm, we then build Pregelix, an open source distributed graph processing system which is based on an iterative dataflow design that is better tuned to handle both in-memory and out-of-core workloads. As such, Pregelix offers improved performance characteristics and scaling properties over current open source systems (e.g., we have seen up to 15X speedup compared to Apache Giraph and up to 35X speedup compared to distributed GraphLab). Finally, we integrate Pregelix with the open source Big Data management system AsterixDB to offer users a mix of a vertex-oriented programming model and a declarative query language for richer forms of Big Graph analytics with reduced ETL pains.

Big Graph Analytics Platforms

Big Graph Analytics Platforms PDF Author: Da Yan
Publisher:
ISBN: 9781680832426
Category : Computers
Languages : en
Pages : 218

Get Book Here

Book Description
A comprehensive survey that clearly summarizes the key features and techniques developed in existing big graph systems. It aims to help readers get a systematic picture of the landscape of recent big graph systems, focusing not just on the systems themselves, but also on the key innovations and design philosophies underlying them.

Big Graph Analytics on Just A Single PC

Big Graph Analytics on Just A Single PC PDF Author: Kai Wang
Publisher:
ISBN:
Category :
Languages : en
Pages : 146

Get Book Here

Book Description
As graph data becomes ubiquitous in modern computing, developing systems to efficiently process large graphs has gained increasing popularity. There are two major types of analytical problems over large graphs: graph computation and graph mining. Graph computation includes a set of problems that can be represented through liner algebra over an adjacency matrix based representation of the graph. Graph mining aims to discover complex structural patterns of a graph, for example, finding relationship patterns in social media network, detecting link spam in web data. Due to their importance in machine learning, web application and social media, graph analytical problems have been extensively studied in the past decade. Practical solutions have been implemented in a wide variety of graph analytical systems. However, most of the existing systems for graph analytics are distributed frameworks, which suffer from one or more of the following drawbacks: (1) many of the (current and future) users performing graph analytics will be domain experts with limited computer science background. They are faced with the challenge of managing a cluster, which involves tasks such as data partitioning and fault tolerance they are not familiar with; (2) not all users have access to enterprise cluster in their daily development tasks; (3) distributed graph systems commonly suffer from large startup and communication overhead; and (4) load balancing in a distributed system is another major challenge. Some graph algorithms have dynamic working sets and and it is thus hard to distribute the workload appropriately before the execution. In this dissertation, we identify three categories of graph workloads for which single-machine systems are more suitable than distributed systems: (1) analytical queries that do not need exact answers; (2) program analysis tasks that are widely used to find bugs in real-world software; and (3) graph mining algorithms that are important for many information-retrieval tasks. Based on these observations, we have developed a set of single-machine graph systems to deliver efficiency and scalability specifically for these workloads. In particular, this dissertation makes the following contributions. The first contribution is the design and implementation of a single-machine graph query system named GraphQ, which divides a large graph into partitions and merges them with the guidance from an abstraction graph. By using multiple levels of abstraction, it can quickly rule out infeasible solutions and identify mergeable partitions. GraphQ uses the memory capacity as a budget and tries its best to find solutions before exhausting the memory, making it possible to answer analytical queries over very large graphs with resources affordable to a single PC. The second contribution is the design and implementation of Graspan, a single-machine, disk-based graph processing system tailored for interprocedural static analyses. Given a program graph and a grammar specification of an analysis, Graspan uses an edge-pair centric computation model to compute dynamic transitive closures on very large program graphs. With the help of novel graph processing techniques, we turn sophisticated code analyses into scalable Big Graph analytics. The third contribution of this dissertation is a single-machine, out-of-core graph mining system, called RStream, which leverages disk support to support efficient edge streaming for mining very large graphs. RStream employs a rich programming model that exposes relational algebra for developers to express a wide variety of mining tasks and implements a runtime engine that delivers efficiency with tuple streaming. In conclusion, this dissertation attempts to explore the opportunities of building single-machine graph systems for scenarios where distributed systems do not work well. Our experimental results demonstrate that the techniques proposed in this dissertation can efficiently solve big graph analytical problems on a single consumer PC. We hope that these promising results will encourage future work to continue building affordable single-machine systems for a rich set of datasets and analytical tasks.

Declarative Languages and Scalable Systems for Graph Analytics and Knowledge Discovery

Declarative Languages and Scalable Systems for Graph Analytics and Knowledge Discovery PDF Author: Mohan Yang
Publisher:
ISBN:
Category :
Languages : en
Pages : 157

Get Book Here

Book Description
The growing importance of data science applications has motivated great research interest in powerful languages and scalable systems for supporting advanced analytics on massive data sets. Languages such as R and Scala are used to develop advanced analytical applications that are not supported by SQL, the traditional query language used for decades to search the database and analyze its data. An interesting research question that arises in this scenario is whether it is possible to design an efficient query language that simplifies the writing of advanced analytical applications and provides a unified environment for their development and deployment on multiple platforms, including massively parallel ones. In this thesis, we provide a positive answer to this question by demonstrating extensions of the logic-based query language Datalog and their implementation techniques to enable (i) scalable support for graph analytics and knowledge discovery applications, and (ii) portability between multicore machines and clusters. A first set of extensions discussed in this thesis is based on monotonic aggregates and led to the implementation of our Deductive Application Language (DeAL) system which (i) achieves superior performance for graph analytics applications compared with other Datalog systems on multicore machines, and (ii) outperforms other distributed Datalog systems, as well as both GraphX and native Apache Spark. We then tackle the difficult problem of supporting knowledge discovery applications, by introducing non-monotonic extensions to support generic user-defined aggregates, for which we provide a formal logic-based semantics. The Knowledge Discovery in Datalog (KDDlog) language so derived can express efficiently both descriptive analytics, such as rollups and data cubes, and predictive analytics, such as association rule mining, classification, regression analysis, and cluster analysis.

Provenance in Databases

Provenance in Databases PDF Author: James Cheney
Publisher: Now Publishers Inc
ISBN: 1601982321
Category : Computers
Languages : en
Pages : 111

Get Book Here

Book Description
Reviews research over the past ten years on why, how, and where provenance, clarifies the relationships among these notions of provenance, and describes some of their applications in confidence computation, view maintenance and update, debugging, and annotation propagation