Declarative Languages and Scalable Systems for Graph Analytics and Knowledge Discovery

Declarative Languages and Scalable Systems for Graph Analytics and Knowledge Discovery PDF Author: Mohan Yang
Publisher:
ISBN:
Category :
Languages : en
Pages : 157

Get Book Here

Book Description
The growing importance of data science applications has motivated great research interest in powerful languages and scalable systems for supporting advanced analytics on massive data sets. Languages such as R and Scala are used to develop advanced analytical applications that are not supported by SQL, the traditional query language used for decades to search the database and analyze its data. An interesting research question that arises in this scenario is whether it is possible to design an efficient query language that simplifies the writing of advanced analytical applications and provides a unified environment for their development and deployment on multiple platforms, including massively parallel ones. In this thesis, we provide a positive answer to this question by demonstrating extensions of the logic-based query language Datalog and their implementation techniques to enable (i) scalable support for graph analytics and knowledge discovery applications, and (ii) portability between multicore machines and clusters. A first set of extensions discussed in this thesis is based on monotonic aggregates and led to the implementation of our Deductive Application Language (DeAL) system which (i) achieves superior performance for graph analytics applications compared with other Datalog systems on multicore machines, and (ii) outperforms other distributed Datalog systems, as well as both GraphX and native Apache Spark. We then tackle the difficult problem of supporting knowledge discovery applications, by introducing non-monotonic extensions to support generic user-defined aggregates, for which we provide a formal logic-based semantics. The Knowledge Discovery in Datalog (KDDlog) language so derived can express efficiently both descriptive analytics, such as rollups and data cubes, and predictive analytics, such as association rule mining, classification, regression analysis, and cluster analysis.

Declarative Languages and Scalable Systems for Graph Analytics and Knowledge Discovery

Declarative Languages and Scalable Systems for Graph Analytics and Knowledge Discovery PDF Author: Mohan Yang
Publisher:
ISBN:
Category :
Languages : en
Pages : 157

Get Book Here

Book Description
The growing importance of data science applications has motivated great research interest in powerful languages and scalable systems for supporting advanced analytics on massive data sets. Languages such as R and Scala are used to develop advanced analytical applications that are not supported by SQL, the traditional query language used for decades to search the database and analyze its data. An interesting research question that arises in this scenario is whether it is possible to design an efficient query language that simplifies the writing of advanced analytical applications and provides a unified environment for their development and deployment on multiple platforms, including massively parallel ones. In this thesis, we provide a positive answer to this question by demonstrating extensions of the logic-based query language Datalog and their implementation techniques to enable (i) scalable support for graph analytics and knowledge discovery applications, and (ii) portability between multicore machines and clusters. A first set of extensions discussed in this thesis is based on monotonic aggregates and led to the implementation of our Deductive Application Language (DeAL) system which (i) achieves superior performance for graph analytics applications compared with other Datalog systems on multicore machines, and (ii) outperforms other distributed Datalog systems, as well as both GraphX and native Apache Spark. We then tackle the difficult problem of supporting knowledge discovery applications, by introducing non-monotonic extensions to support generic user-defined aggregates, for which we provide a formal logic-based semantics. The Knowledge Discovery in Datalog (KDDlog) language so derived can express efficiently both descriptive analytics, such as rollups and data cubes, and predictive analytics, such as association rule mining, classification, regression analysis, and cluster analysis.

Declarative Languages and Systems for Transparency, Performance and Scalability in Database Analytics

Declarative Languages and Systems for Transparency, Performance and Scalability in Database Analytics PDF Author: Youfu Li
Publisher:
ISBN:
Category :
Languages : en
Pages : 135

Get Book Here

Book Description
Demand for powerful, high-performance analytics on Big Data is ever growing. Developing tools and methodologies for advanced Database analytics, such as Data Mining applications, has long been an active area of research which posed elusive challenges to both academia and industry, on topics that include: 1) design of expressive high-level languages with declarative semantics for data analytics, 2) optimization and parallelization for efficient and scalable execution, and 3) transparency of analytics dataflow for error tracking and debugging. This thesis proposes methods and tools for developing powerful data analytics systems based on declarative languages, dataflow inspection and query optimization. By leveraging and integrating these tools we obtain i) a scalable data analytics framework for knowledge discovery by concise and declarative queries, ii) a unified solution that enables analytics dataflow inspection and further supports provenance and debugging for data analytic applications, and iii) an integrated runtime query optimizer to generate optimal execution plan for data analytics queries and achieve superior performance in application areas that had posed major challenges for traditional Database technology. In particular, our KDDLog system enables users to build or customize knowledge discovery models by concise and expressive language, via recursive queries with aggregates and our newly-proposed chain aggregates. We further provide specialized compilation techniques for semi-naive fix-point computation in the presence of aggregates, optimizations for complex recursive queries on distributed data platforms, KDDLib to build knowledge discovery tasks and advanced interfaces to assist users to port new knowledge discovery models. Following KDDLog, we present SEIZE, a unified framework that enables dataflow inspection---wiretapping the data-path of data analytics applications with listening logic. We generalize our lessons learned by providing a set of primitives defining dataflow inspection, orchestration options for different inspection granularities, and operator decomposition and dataflow punctuation strategy for dataflow intervention. Finally, we propose RIOS, a runtime integrated query optimizer for data analytics that lazily binds to execution plans at runtime, after collecting the statistics needed to make more optimal decisions. A specific focus in our design is to obtain accurate estimates on predicate (including UDF) selectivities for determining an optimal join order and physical join implementation, without incurring significant runtime overheads.

On Software Infrastructure for Scalable Graph Analytics

On Software Infrastructure for Scalable Graph Analytics PDF Author: Yingyi Bu
Publisher:
ISBN: 9781339124087
Category :
Languages : en
Pages : 129

Get Book Here

Book Description
Recently, there is a growing need for distributed graph processing systems that are capable of gracefully scaling to very large datasets. In the mean time, in real-world applications, it is highly desirable to reduce the tedious, inefficient ETL (extract, transform, load) gap between tabular data processing systems and graph processing systems. Unfortunately, those challenges have not been easily met due to the intense memory pressure imposed by process-centric, message passing designs that many graph processing systems follow, as well as the separation of tabular data processing runtimes and graph processing runtimes. In this thesis, we explore the application of programming techniques and algorithms from the database systems world to the problem of scalable graph analysis. We first propose a bloat-aware design paradigm towards the development of efficient and scalable Big Data applications in object-oriented, GC enabled languages and demonstrate that programming under this paradigm does not incur significant programming burden but obtains remarkable performance gains (e.g., 2.5X). Based on the design paradigm, we then build Pregelix, an open source distributed graph processing system which is based on an iterative dataflow design that is better tuned to handle both in-memory and out-of-core workloads. As such, Pregelix offers improved performance characteristics and scaling properties over current open source systems (e.g., we have seen up to 15X speedup compared to Apache Giraph and up to 35X speedup compared to distributed GraphLab). Finally, we integrate Pregelix with the open source Big Data management system AsterixDB to offer users a mix of a vertex-oriented programming model and a declarative query language for richer forms of Big Graph analytics with reduced ETL pains.

Big Data Analytics and Knowledge Discovery

Big Data Analytics and Knowledge Discovery PDF Author: Carlos Ordonez
Publisher: Springer
ISBN: 3030275205
Category : Computers
Languages : en
Pages : 323

Get Book Here

Book Description
This book constitutes the refereed proceedings of the 21st International Conference on Big Data Analytics and Knowledge Discovery, DaWaK 2019, held in Linz, Austria, in September 2019. The 12 full papers and 10 short papers presented were carefully reviewed and selected from 61 submissions. The papers are organized in the following topical sections: Applications; patterns; RDF and streams; big data systems; graphs and machine learning; databases.

Declarative Frameworks and Optimization Techniques for Developing Scalable Advanced Analytics Over Databases and Data Streams

Declarative Frameworks and Optimization Techniques for Developing Scalable Advanced Analytics Over Databases and Data Streams PDF Author: Ariyam Das
Publisher:
ISBN:
Category :
Languages : en
Pages : 127

Get Book Here

Book Description
In the past, the semantic issues raised by the non-monotonic nature of aggregates often prevented their use in the recursive statements of logic programs and deductive databases. However, the recently introduced notion of Pre-Mappability (PreM) has shown that, in key applications of interest, aggregates can be used in recursion to optimize the perfect-model semantics of aggregate-stratified programs. Therefore, we can preserve the declarative formal semantics of such programs, while achieving a highly efficient operational semantics that is conducive to scalable implementations on parallel and distributed platforms. In this work, we show that using PreM, a wide spectrum of classical algorithms, ranging from graph analytics and dynamic programming based optimization problems to data mining, machine learning and online streaming applications can be concisely expressed in declarative languages by using aggregates in recursion. We present a concise analysis of this very general property and characterize its different manifestations for different constraints and rules. Next, we prove that PreM-optimized plans are easily parallelizable and produce the same results as the single executor programs. Thus, PreM can be trivially assimilated into the data-parallel computation plans of different distributed systems, irrespective of whether these follow bulk synchronous parallel (BSP) or asynchronous computing models. This makes possible many advanced BigData applications to be now expressed declaratively in logic-based languages, including Datalog, Prolog, and even SQL, while enabling their execution with superior performance and scalability as compared to other specialized systems. Furthermore, we show that under PreM nonlinear recursive queries can be evaluated using a hybrid stale synchronous parallel (SSP) model with relaxed synchronization on distributed environments. We present empirical evidence of its benefits. We also compare the usability, expressivity and performance of PreM-optimized queries with queries written in quasi-declarative programming methodologies inspired by procedural languages like XY-stratification to showcase the different trade-offs and ramifications associated with each. Lastly, we present robust online optimization techniques using two popular case studies, namely online lossless frequent pattern mining and online decision tree construction, to show how compact representations and statistical approximations can deliver superior performances in real-time for several streaming data mining and machine learning applications.

Business Intelligence

Business Intelligence PDF Author: Patrick Marcel
Publisher: Springer
ISBN: 331961164X
Category : Business & Economics
Languages : en
Pages : 148

Get Book Here

Book Description
This book constitutes the tutorial lectures of the 6th European Business Intelligence and Big Data Summer School, eBISS 2016, held in Tours, France, in July 2016. Tutorials were given by renowned experts and covered recent and various aspects of Business Intelligence and Big Data processing, including analytics on graph data, machine translation, pattern mining, scalability, and energy consumption. This volume contains the corresponding lecture notes of the summer school.

A Declarative Framework for Big Graph Analytics and Their Provenance

A Declarative Framework for Big Graph Analytics and Their Provenance PDF Author: Vasiliki Papavasileiou
Publisher:
ISBN:
Category :
Languages : en
Pages : 127

Get Book Here

Book Description
Recent years have witnessed an explosion in size of graph data and complexity of graph analytics in fields such as social and mobile networks, science and advertisement. Analyzing and extracting knowledge from Big Graphs (in analogy to Big Data) is hard. The size of Big Graphs necessitates the use of distributed infrastructures and parallel programming. Moreover, implementing performant and correct analytics requires in depth knowledge of both algorithm and input data. Developers of graph analytics face two major challenges: i) There is a myriad of Big Graph processing frameworks, each uses a different imperative programming language and implements different low-level optimizations. Developers are burdened with understanding the low-level characteristics of an execution framework that suits best their algorithms and data. ii) Assessing the quality of both data and analytics is a tedious and manual task. Devising new graph analytics is an iterative process, where developers incrementally refine their algorithms and clean their data by analyzing results, correcting for errors and run again until the end results are satisfiable. In this dissertation we offer a declarative framework that addresses the entire life-cycle, from designing to executing, of Big Graph analytics. Our approach uses a single language for both authoring graph analytics and fine-tuning them. Specifically, this dissertation makes the following two main contributions: We design and demonstrate Datalography, the first approach for declarative graph analytics on Vertex-Centric graph processing engines. To accommodate different programming models, we design and implement a compiler that takes general Datalog queries and rewrites them into distribution-aware queries that can be efficiently evaluated on any Vertex-Centric framework. Moreover, our compiler implements automatic and transparent to the user optimizations in the form of logical query rewritings and thus are portable to any Vertex-Centric system. We demonstrate the effectiveness of our approach with an experimental evaluation on real-world graphs that indicates Datalography offers superior performance when compared to native, imperative implementations. Our second contribution is a novel provenance management approach that enables developers to customize provenance capturing and analysis with twofold benefits: the amount of captured provenance is minimized to include only the necessary information and analysis is extended beyond the traditional tracing queries. We present formal semantics of our provenance query language, based on Datalog, and identify an important class of queries that can be evaluated online, simultaneously with the graph analytic. We showcase our approach with Ariadne, a provenance management system that supports efficient debugging, auditing and fine-tuning of graph analytics.

Knowledge Graphs and Big Data Processing

Knowledge Graphs and Big Data Processing PDF Author: Valentina Janev
Publisher: Springer Nature
ISBN: 3030531996
Category : Computers
Languages : en
Pages : 212

Get Book Here

Book Description
This open access book is part of the LAMBDA Project (Learning, Applying, Multiplying Big Data Analytics), funded by the European Union, GA No. 809965. Data Analytics involves applying algorithmic processes to derive insights. Nowadays it is used in many industries to allow organizations and companies to make better decisions as well as to verify or disprove existing theories or models. The term data analytics is often used interchangeably with intelligence, statistics, reasoning, data mining, knowledge discovery, and others. The goal of this book is to introduce some of the definitions, methods, tools, frameworks, and solutions for big data processing, starting from the process of information extraction and knowledge representation, via knowledge processing and analytics to visualization, sense-making, and practical applications. Each chapter in this book addresses some pertinent aspect of the data processing chain, with a specific focus on understanding Enterprise Knowledge Graphs, Semantic Big Data Architectures, and Smart Data Analytics solutions. This book is addressed to graduate students from technical disciplines, to professional audiences following continuous education short courses, and to researchers from diverse areas following self-study courses. Basic skills in computer science, mathematics, and statistics are required.

Learning Neo4j

Learning Neo4j PDF Author: Rik Van Bruggen
Publisher: Packt Publishing Ltd
ISBN: 1849517177
Category : Computers
Languages : en
Pages : 296

Get Book Here

Book Description
This book is for developers who want an alternative way to store and process data within their applications. No previous graph database experience is required; however, some basic database knowledge will help you understand the concepts more easily.

Handbook of Research on Cloud Infrastructures for Big Data Analytics

Handbook of Research on Cloud Infrastructures for Big Data Analytics PDF Author: Raj, Pethuru
Publisher: IGI Global
ISBN: 1466658657
Category : Computers
Languages : en
Pages : 592

Get Book Here

Book Description
Clouds are being positioned as the next-generation consolidated, centralized, yet federated IT infrastructure for hosting all kinds of IT platforms and for deploying, maintaining, and managing a wider variety of personal, as well as professional applications and services. Handbook of Research on Cloud Infrastructures for Big Data Analytics focuses exclusively on the topic of cloud-sponsored big data analytics for creating flexible and futuristic organizations. This book helps researchers and practitioners, as well as business entrepreneurs, to make informed decisions and consider appropriate action to simplify and streamline the arduous journey towards smarter enterprises.