Declarative Frameworks and Optimization Techniques for Developing Scalable Advanced Analytics Over Databases and Data Streams

Declarative Frameworks and Optimization Techniques for Developing Scalable Advanced Analytics Over Databases and Data Streams PDF Author: Ariyam Das
Publisher:
ISBN:
Category :
Languages : en
Pages : 127

Get Book Here

Book Description
In the past, the semantic issues raised by the non-monotonic nature of aggregates often prevented their use in the recursive statements of logic programs and deductive databases. However, the recently introduced notion of Pre-Mappability (PreM) has shown that, in key applications of interest, aggregates can be used in recursion to optimize the perfect-model semantics of aggregate-stratified programs. Therefore, we can preserve the declarative formal semantics of such programs, while achieving a highly efficient operational semantics that is conducive to scalable implementations on parallel and distributed platforms. In this work, we show that using PreM, a wide spectrum of classical algorithms, ranging from graph analytics and dynamic programming based optimization problems to data mining, machine learning and online streaming applications can be concisely expressed in declarative languages by using aggregates in recursion. We present a concise analysis of this very general property and characterize its different manifestations for different constraints and rules. Next, we prove that PreM-optimized plans are easily parallelizable and produce the same results as the single executor programs. Thus, PreM can be trivially assimilated into the data-parallel computation plans of different distributed systems, irrespective of whether these follow bulk synchronous parallel (BSP) or asynchronous computing models. This makes possible many advanced BigData applications to be now expressed declaratively in logic-based languages, including Datalog, Prolog, and even SQL, while enabling their execution with superior performance and scalability as compared to other specialized systems. Furthermore, we show that under PreM nonlinear recursive queries can be evaluated using a hybrid stale synchronous parallel (SSP) model with relaxed synchronization on distributed environments. We present empirical evidence of its benefits. We also compare the usability, expressivity and performance of PreM-optimized queries with queries written in quasi-declarative programming methodologies inspired by procedural languages like XY-stratification to showcase the different trade-offs and ramifications associated with each. Lastly, we present robust online optimization techniques using two popular case studies, namely online lossless frequent pattern mining and online decision tree construction, to show how compact representations and statistical approximations can deliver superior performances in real-time for several streaming data mining and machine learning applications.

Declarative Frameworks and Optimization Techniques for Developing Scalable Advanced Analytics Over Databases and Data Streams

Declarative Frameworks and Optimization Techniques for Developing Scalable Advanced Analytics Over Databases and Data Streams PDF Author: Ariyam Das
Publisher:
ISBN:
Category :
Languages : en
Pages : 127

Get Book Here

Book Description
In the past, the semantic issues raised by the non-monotonic nature of aggregates often prevented their use in the recursive statements of logic programs and deductive databases. However, the recently introduced notion of Pre-Mappability (PreM) has shown that, in key applications of interest, aggregates can be used in recursion to optimize the perfect-model semantics of aggregate-stratified programs. Therefore, we can preserve the declarative formal semantics of such programs, while achieving a highly efficient operational semantics that is conducive to scalable implementations on parallel and distributed platforms. In this work, we show that using PreM, a wide spectrum of classical algorithms, ranging from graph analytics and dynamic programming based optimization problems to data mining, machine learning and online streaming applications can be concisely expressed in declarative languages by using aggregates in recursion. We present a concise analysis of this very general property and characterize its different manifestations for different constraints and rules. Next, we prove that PreM-optimized plans are easily parallelizable and produce the same results as the single executor programs. Thus, PreM can be trivially assimilated into the data-parallel computation plans of different distributed systems, irrespective of whether these follow bulk synchronous parallel (BSP) or asynchronous computing models. This makes possible many advanced BigData applications to be now expressed declaratively in logic-based languages, including Datalog, Prolog, and even SQL, while enabling their execution with superior performance and scalability as compared to other specialized systems. Furthermore, we show that under PreM nonlinear recursive queries can be evaluated using a hybrid stale synchronous parallel (SSP) model with relaxed synchronization on distributed environments. We present empirical evidence of its benefits. We also compare the usability, expressivity and performance of PreM-optimized queries with queries written in quasi-declarative programming methodologies inspired by procedural languages like XY-stratification to showcase the different trade-offs and ramifications associated with each. Lastly, we present robust online optimization techniques using two popular case studies, namely online lossless frequent pattern mining and online decision tree construction, to show how compact representations and statistical approximations can deliver superior performances in real-time for several streaming data mining and machine learning applications.

Declarative Languages and Systems for Transparency, Performance and Scalability in Database Analytics

Declarative Languages and Systems for Transparency, Performance and Scalability in Database Analytics PDF Author: Youfu Li
Publisher:
ISBN:
Category :
Languages : en
Pages : 135

Get Book Here

Book Description
Demand for powerful, high-performance analytics on Big Data is ever growing. Developing tools and methodologies for advanced Database analytics, such as Data Mining applications, has long been an active area of research which posed elusive challenges to both academia and industry, on topics that include: 1) design of expressive high-level languages with declarative semantics for data analytics, 2) optimization and parallelization for efficient and scalable execution, and 3) transparency of analytics dataflow for error tracking and debugging. This thesis proposes methods and tools for developing powerful data analytics systems based on declarative languages, dataflow inspection and query optimization. By leveraging and integrating these tools we obtain i) a scalable data analytics framework for knowledge discovery by concise and declarative queries, ii) a unified solution that enables analytics dataflow inspection and further supports provenance and debugging for data analytic applications, and iii) an integrated runtime query optimizer to generate optimal execution plan for data analytics queries and achieve superior performance in application areas that had posed major challenges for traditional Database technology. In particular, our KDDLog system enables users to build or customize knowledge discovery models by concise and expressive language, via recursive queries with aggregates and our newly-proposed chain aggregates. We further provide specialized compilation techniques for semi-naive fix-point computation in the presence of aggregates, optimizations for complex recursive queries on distributed data platforms, KDDLib to build knowledge discovery tasks and advanced interfaces to assist users to port new knowledge discovery models. Following KDDLog, we present SEIZE, a unified framework that enables dataflow inspection---wiretapping the data-path of data analytics applications with listening logic. We generalize our lessons learned by providing a set of primitives defining dataflow inspection, orchestration options for different inspection granularities, and operator decomposition and dataflow punctuation strategy for dataflow intervention. Finally, we propose RIOS, a runtime integrated query optimizer for data analytics that lazily binds to execution plans at runtime, after collecting the statistics needed to make more optimal decisions. A specific focus in our design is to obtain accurate estimates on predicate (including UDF) selectivities for determining an optimal join order and physical join implementation, without incurring significant runtime overheads.

Declarative Languages and Scalable Systems for Graph Analytics and Knowledge Discovery

Declarative Languages and Scalable Systems for Graph Analytics and Knowledge Discovery PDF Author: Mohan Yang
Publisher:
ISBN:
Category :
Languages : en
Pages : 157

Get Book Here

Book Description
The growing importance of data science applications has motivated great research interest in powerful languages and scalable systems for supporting advanced analytics on massive data sets. Languages such as R and Scala are used to develop advanced analytical applications that are not supported by SQL, the traditional query language used for decades to search the database and analyze its data. An interesting research question that arises in this scenario is whether it is possible to design an efficient query language that simplifies the writing of advanced analytical applications and provides a unified environment for their development and deployment on multiple platforms, including massively parallel ones. In this thesis, we provide a positive answer to this question by demonstrating extensions of the logic-based query language Datalog and their implementation techniques to enable (i) scalable support for graph analytics and knowledge discovery applications, and (ii) portability between multicore machines and clusters. A first set of extensions discussed in this thesis is based on monotonic aggregates and led to the implementation of our Deductive Application Language (DeAL) system which (i) achieves superior performance for graph analytics applications compared with other Datalog systems on multicore machines, and (ii) outperforms other distributed Datalog systems, as well as both GraphX and native Apache Spark. We then tackle the difficult problem of supporting knowledge discovery applications, by introducing non-monotonic extensions to support generic user-defined aggregates, for which we provide a formal logic-based semantics. The Knowledge Discovery in Datalog (KDDlog) language so derived can express efficiently both descriptive analytics, such as rollups and data cubes, and predictive analytics, such as association rule mining, classification, regression analysis, and cluster analysis.

Database Systems for Advanced Applications

Database Systems for Advanced Applications PDF Author: Christian S. Jensen
Publisher: Springer Nature
ISBN: 3030731944
Category : Computers
Languages : en
Pages : 683

Get Book Here

Book Description
The three-volume set LNCS 12681-12683 constitutes the proceedings of the 26th International Conference on Database Systems for Advanced Applications, DASFAA 2021, held in Taipei, Taiwan, in April 2021. The total of 156 papers presented in this three-volume set was carefully reviewed and selected from 490 submissions. The topic areas for the selected papers include information retrieval, search and recommendation techniques; RDF, knowledge graphs, semantic web, and knowledge management; and spatial, temporal, sequence, and streaming data management, while the dominant keywords are network, recommendation, graph, learning, and model. These topic areas and keywords shed the light on the direction where the research in DASFAA is moving towards. Due to the Corona pandemic this event was held virtually.

Support for Scalable Analytics Over Databases and Data-streams

Support for Scalable Analytics Over Databases and Data-streams PDF Author: Nikolay Pavlovich Laptev
Publisher:
ISBN:
Category :
Languages : en
Pages : 157

Get Book Here

Book Description
The world's information is doubling every two years, largely due to a tremendous growth of data from blogs, social medias and Internet searches. `Big Data Analytics' is now recognized as an emerging technology area of great opportunities and technical challenges. Parallel systems, such as those inspired by MapReduce architectures, provide a key technology to cope with those challenges---however they often cannot keep up with the fast-growing size of data and application complexity, nor can they deliver the response times required by data stream applications. In this thesis, therefore, we show that many of said limitations can be overcome by building on classical approximation techniques from statistics to estimate (i) the sample quality and (ii) the required sample size given the user-prescribed accuracy. To achieve (i) we look into the bootstrap theory. The bootstrap approach, based on resampling, provides a simple way of assessing the quality of an estimate. The bootstrap technique, however, is computationally expensive, thus our first contribution involves making the bootstrap estimation efficient. Following our initial results, we realized that in a distributed environment the cost of transferring the data to independent processors as well as the cost of computing a single resample can be high for large samples. Furthermore the lack of a scalable support for the popular time-series data was also a problem. For these reasons, we provide an improved bootstrap approach that uses the Bag of Little Bootstraps (BLB) along with other recent advances in bootstrap and time-series theory to provide an effective Hadoop-based implementation for assessing a time-series sample quality. To achieve (ii) we look into the data complexity and learning theory. Recently it has been shown that the performance of a classifier can be analyzed in terms of the data complexity. We start by analyzing how model complexity can be used to create a scalable pattern matching automaton. We then extend our findings to other algorithms where we explain how problem complexity affects the required sample size for a given machine-learning algorithm and accuracy requirement. We also use the learning theory to estimate the error convergence rate needed for sample size estimation. Our experimental results provide the motivation for further exploring these ideas. A spectrum of classical data mining tasks and newly developed mining applications are used to validate the effectiveness of the proposed approaches. For example, extensive empirical results on a Twitter dataset show that the proposed techniques provide substantial improvements in processing speeds while placing the user in control of the result accuracy.

Building Datalog Systems for Scalable and Efficient Data Analytics

Building Datalog Systems for Scalable and Efficient Data Analytics PDF Author: Zhiwei Fan (Ph.D.)
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
The ability to perform advanced data analytics efficiently is becoming increasingly important for a wide spectrum of data-driven applications in the big data era. The efficiency of data analysis is generally considered from two aspects: (i) the ability to quickly prototype and express the corresponding analysis tasks (here referred to as development efficiency)and (ii) the ability to process a large volume of data involved in the analysis with high performance and good scalability (here referred to as computational efficiency). Datalog as a declarative programming language is seeing a resurgence of interest in recent years and has found new applications in multiple domains such as data integration, graph analytics, security, program analysis, networking, and decision-making, largely attributed to its development efficiency. To seek better support for computational efficiency in using Datalog as the language for a wide variety of data-driven tasks, especially taking advantage of its superior ability to express applications involving recursive computations concisely, several research efforts, across multiple communities, have explored techniques for building efficient Datalog systems. However, our experience with the corresponding resulting systems indicates that their performance does not translate across different workloads (i.e., a system that performs well on one Datalog program and a particular dataset does not show comparable performance on the others). Furthermore, the lack of understanding of the property of varying Datalog workloads makes it challenging to analyze the performance difference observed on different systems, further impedes the progress in improving existing systems and building more efficient new systems. In this dissertation, we explore techniques for building a general-purpose Datalog system for scalable and efficient data analytics. The exploration has led to two prototype Datalog systems, RecStep, and FlowLog, which are implemented on top of a parallel single-node relational system and a modern stream processor, respectively. We first show that by leveraging multiple years of efforts in the advancement of database techniques such as query optimization and efficient parallel query execution, RecStep is able to outperform a few state-of-the-art specialized Datalog engines on complex and large-scale Datalog evaluation. Next, we present the important profiling components of a general- purpose recursive computation profiling framework, which provide insights regarding the performance behavior of different systems on varying workloads, guiding our design and implementation of FlowLog. Then, we present the prototype system FlowLog in detail, discussing the philosophy behind its design and its implementation, and showing the high performance it delivers. Finally, we show how we can leverage the development efficiency provided by Datalog to concisely express better algorithms for a specific application called consistent query answering (CQA) and how FlowLog efficiently evaluates the corresponding Datalog programs, often matching and sometimes surpassing the state-of-the-art performance numbers while other existing Datalog systems cannot achieve this.

Large-Scale Data Analytics

Large-Scale Data Analytics PDF Author: Aris Gkoulalas-Divanis
Publisher: Springer Science & Business Media
ISBN: 1461492424
Category : Computers
Languages : en
Pages : 276

Get Book Here

Book Description
This edited book collects state-of-the-art research related to large-scale data analytics that has been accomplished over the last few years. This is among the first books devoted to this important area based on contributions from diverse scientific areas such as databases, data mining, supercomputing, hardware architecture, data visualization, statistics, and privacy. There is increasing need for new approaches and technologies that can analyze and synthesize very large amounts of data, in the order of petabytes, that are generated by massively distributed data sources. This requires new distributed architectures for data analysis. Additionally, the heterogeneity of such sources imposes significant challenges for the efficient analysis of the data under numerous constraints, including consistent data integration, data homogenization and scaling, privacy and security preservation. The authors also broaden reader understanding of emerging real-world applications in domains such as customer behavior modeling, graph mining, telecommunications, cyber-security, and social network analysis, all of which impose extra requirements for large-scale data analysis. Large-Scale Data Analytics is organized in 8 chapters, each providing a survey of an important direction of large-scale data analytics or individual results of the emerging research in the field. The book presents key recent research that will help shape the future of large-scale data analytics, leading the way to the design of new approaches and technologies that can analyze and synthesize very large amounts of heterogeneous data. Students, researchers, professionals and practitioners will find this book an authoritative and comprehensive resource.

Readings in Database Systems

Readings in Database Systems PDF Author: Joseph M. Hellerstein
Publisher: MIT Press
ISBN: 9780262693141
Category : Computers
Languages : en
Pages : 884

Get Book Here

Book Description
The latest edition of a popular text and reference on database research, with substantial new material and revision; covers classical literature and recent hot topics. Lessons from database research have been applied in academic fields ranging from bioinformatics to next-generation Internet architecture and in industrial uses including Web-based e-commerce and search engines. The core ideas in the field have become increasingly influential. This text provides both students and professionals with a grounding in database research and a technical context for understanding recent innovations in the field. The readings included treat the most important issues in the database area--the basic material for any DBMS professional. This fourth edition has been substantially updated and revised, with 21 of the 48 papers new to the edition, four of them published for the first time. Many of the sections have been newly organized, and each section includes a new or substantially revised introduction that discusses the context, motivation, and controversies in a particular area, placing it in the broader perspective of database research. Two introductory articles, never before published, provide an organized, current introduction to basic knowledge of the field; one discusses the history of data models and query languages and the other offers an architectural overview of a database system. The remaining articles range from the classical literature on database research to treatments of current hot topics, including a paper on search engine architecture and a paper on application servers, both written expressly for this edition. The result is a collection of papers that are seminal and also accessible to a reader who has a basic familiarity with database systems.

Streaming Architecture

Streaming Architecture PDF Author: Ted Dunning
Publisher: "O'Reilly Media, Inc."
ISBN: 149195390X
Category : Computers
Languages : en
Pages : 119

Get Book Here

Book Description
More and more data-driven companies are looking to adopt stream processing and streaming analytics. With this concise ebook, you’ll learn best practices for designing a reliable architecture that supports this emerging big-data paradigm. Authors Ted Dunning and Ellen Friedman (Real World Hadoop) help you explore some of the best technologies to handle stream processing and analytics, with a focus on the upstream queuing or message-passing layer. To illustrate the effectiveness of these technologies, this book also includes specific use cases. Ideal for developers and non-technical people alike, this book describes: Key elements in good design for streaming analytics, focusing on the essential characteristics of the messaging layer New messaging technologies, including Apache Kafka and MapR Streams, with links to sample code Technology choices for streaming analytics: Apache Spark Streaming, Apache Flink, Apache Storm, and Apache Apex How stream-based architectures are helpful to support microservices Specific use cases such as fraud detection and geo-distributed data streams Ted Dunning is Chief Applications Architect at MapR Technologies, and active in the open source community. He currently serves as VP for Incubator at the Apache Foundation, as a champion and mentor for a large number of projects, and as committer and PMC member of the Apache ZooKeeper and Drill projects. Ted is on Twitter as @ted_dunning. Ellen Friedman, a committer for the Apache Drill and Apache Mahout projects, is a solutions consultant and well-known speaker and author, currently writing mainly about big data topics. With a PhD in Biochemistry, she has years of experience as a research scientist and has written about a variety of technical topics. Ellen is on Twitter as @Ellen_Friedman.

Big Data 2.0 Processing Systems

Big Data 2.0 Processing Systems PDF Author: Sherif Sakr
Publisher: Springer
ISBN: 3319387766
Category : Computers
Languages : en
Pages : 111

Get Book Here

Book Description
This book provides readers the “big picture” and a comprehensive survey of the domain of big data processing systems. For the past decade, the Hadoop framework has dominated the world of big data processing, yet recently academia and industry have started to recognize its limitations in several application domains and big data processing scenarios such as the large-scale processing of structured data, graph data and streaming data. Thus, it is now gradually being replaced by a collection of engines that are dedicated to specific verticals (e.g. structured data, graph data, and streaming data). The book explores this new wave of systems, which it refers to as Big Data 2.0 processing systems. After Chapter 1 presents the general background of the big data phenomena, Chapter 2 provides an overview of various general-purpose big data processing systems that allow their users to develop various big data processing jobs for different application domains. In turn, Chapter 3 examines various systems that have been introduced to support the SQL flavor on top of the Hadoop infrastructure and provide competing and scalable performance in the processing of large-scale structured data. Chapter 4 discusses several systems that have been designed to tackle the problem of large-scale graph processing, while the main focus of Chapter 5 is on several systems that have been designed to provide scalable solutions for processing big data streams, and on other sets of systems that have been introduced to support the development of data pipelines between various types of big data processing jobs and systems. Lastly, Chapter 6 shares conclusions and an outlook on future research challenges. Overall, the book offers a valuable reference guide for students, researchers and professionals in the domain of big data processing systems. Further, its comprehensive content will hopefully encourage readers to pursue further research on the subject.