Highly Efficient String Similarity Search and Join Over Compressed Indexes

Highly Efficient String Similarity Search and Join Over Compressed Indexes PDF Author: Guorui Xiao
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
String similarity search and string similarity join are essential operations in many fields. Existing solutions adopt a filter-and-verification framework and build inverted indexes based on generated signatures to prune dissimilar candidates. While existing solutions mainly focus on improving the query processing performance, little attention is paid to reducing the inverted indexes' memory consumption. In cases where the index size is larger than the memory, users must employ more expensive disk-based algorithms rather than in-memory ones. In this thesis, we propose a flexible framework CSS to reduce the index size and keep high query performance for string search and join applications. We give improved solutions for offline inverted list construction and introduce a new approach for the online construction of compressed inverted lists. Experimental results on large-scale datasets demonstrate that CSS can reduce memory consumption up to 5 times while having similar, or even better, query processing performance.

Highly Efficient String Similarity Search and Join Over Compressed Indexes

Highly Efficient String Similarity Search and Join Over Compressed Indexes PDF Author: Guorui Xiao
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
String similarity search and string similarity join are essential operations in many fields. Existing solutions adopt a filter-and-verification framework and build inverted indexes based on generated signatures to prune dissimilar candidates. While existing solutions mainly focus on improving the query processing performance, little attention is paid to reducing the inverted indexes' memory consumption. In cases where the index size is larger than the memory, users must employ more expensive disk-based algorithms rather than in-memory ones. In this thesis, we propose a flexible framework CSS to reduce the index size and keep high query performance for string search and join applications. We give improved solutions for offline inverted list construction and introduce a new approach for the online construction of compressed inverted lists. Experimental results on large-scale datasets demonstrate that CSS can reduce memory consumption up to 5 times while having similar, or even better, query processing performance.

Similarity Search

Similarity Search PDF Author: Pavel Zezula
Publisher: Springer Science & Business Media
ISBN: 0387291512
Category : Computers
Languages : en
Pages : 227

Get Book Here

Book Description
The area of similarity searching is a very hot topic for both research and c- mercial applications. Current data processing applications use data with c- siderably less structure and much less precise queries than traditional database systems. Examples are multimedia data like images or videos that offer query by example search, product catalogs that provide users with preference based search, scientific data records from observations or experimental analyses such as biochemical and medical data, or XML documents that come from hetero- neous data sources on the Web or in intranets and thus does not exhibit a global schema. Such data can neither be ordered in a canonical manner nor meani- fully searched by precise database queries that would return exact matches. This novel situation is what has given rise to similarity searching, also - ferred to as content based or similarity retrieval. The most general approach to similarity search, still allowing construction of index structures, is modeled in metric space. In this book. Prof. Zezula and his co authors provide the first monograph on this topic, describing its theoretical background as well as the practical search tools of this innovative technology.

Database Systems for Advanced Applications

Database Systems for Advanced Applications PDF Author: Shamkant B. Navathe
Publisher: Springer
ISBN: 3319320254
Category : Computers
Languages : en
Pages : 560

Get Book Here

Book Description
This two volume set LNCS 9642 and LNCS 9643 constitutes the refereed proceedings of the 21st International Conference on Database Systems for Advanced Applications, DASFAA 2016, held in Dallas, TX, USA, in April 2016. The 61 full papers presented were carefully reviewed and selected from a total of 183 submissions. The papers cover the following topics: crowdsourcing, data quality, entity identification, data mining and machine learning, recommendation, semantics computing and knowledge base, textual data, social networks, complex queries, similarity computing, graph databases, and miscellaneous, advanced applications.

Algorithms for Next-Generation Sequencing Data

Algorithms for Next-Generation Sequencing Data PDF Author: Mourad Elloumi
Publisher: Springer
ISBN: 3319598260
Category : Computers
Languages : en
Pages : 356

Get Book Here

Book Description
The 14 contributed chapters in this book survey the most recent developments in high-performance algorithms for NGS data, offering fundamental insights and technical information specifically on indexing, compression and storage; error correction; alignment; and assembly. The book will be of value to researchers, practitioners and students engaged with bioinformatics, computer science, mathematics, statistics and life sciences.

Query-efficient Algorithm for String Similarity Search

Query-efficient Algorithm for String Similarity Search PDF Author:
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description


Scientific and Statistical Database Management

Scientific and Statistical Database Management PDF Author: Michael Gertz
Publisher: Springer Science & Business Media
ISBN: 3642138179
Category : Computers
Languages : en
Pages : 673

Get Book Here

Book Description
This book constitutes the proceedings of the 22nd International Conference on Scientific and Statistical Database Management, SSDBM 2010, held in Heidelberg, Germany in June/July 2010. The 30 long and 11 short papers presented were carefully reviewed and selected from 94 submissions. The topics covered are query processing; scientific data management and analysis; data mining; indexes and data representation; scientific workflow and provenance; and data stream processing.

String Similarity Joins and Search Under Edit Distance

String Similarity Joins and Search Under Edit Distance PDF Author: Haoyu Zhang
Publisher:
ISBN:
Category : Bioinformatics
Languages : en
Pages : 132

Get Book Here

Book Description
As one of the most important distance metrics, edit distance can reflect noise and errors in sequence data and thus has various applications in data cleaning and integration, databases, bioinformatics, collaborative filtering, and natural language processing. On the other hand, edit distance is also difficult to compute and estimate, which draws both theorists' and practitioners' interest for decades. In this thesis, we will investigate several of the most critical problems related to edit distance and propose algorithms that are efficient in practice and have provable theoretical guarantees. We summarize our contributions as follows:We propose a randomized algorithm EmbedJoin for the edit similarity joins problem. Our algorithm achieves significant advantages in running time and memory usage for datasets with long strings and large edit thresholds compared to all the state-of-the-art deterministic algorithms.We introduce another randomized algorithm MinJoin for edit similarity joins. The algorithm further improves the accuracy of EmbedJoin while maintaining EmbedJoin's advantages in running time and memory usage. We believe it is the best scalable algorithm for edit similarity joins.We propose the MinSearch algorithm for edit similarity search, which is a problem closely related to edit similarity joins. Our algorithm works for both threshold queries and top-k queries, and obtains orders of magnitude improvements on query time compared to the best competitor.

Euro-Par 2011: Parallel Processing Workshops

Euro-Par 2011: Parallel Processing Workshops PDF Author: Michael Alexander
Publisher: Springer
ISBN: 3642297404
Category : Computers
Languages : en
Pages : 502

Get Book Here

Book Description
This book constitutes thoroughly refereed post-conference proceedings of the workshops of the 17th International Conference on Parallel Computing, Euro-Par 2011, held in Bordeaux, France, in August 2011. The papers of these 12 workshops CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS HPCF, PROPER, CCPI, and VHPC focus on promotion and advancement of all aspects of parallel and distributed computing.

High-Dimensional Indexing

High-Dimensional Indexing PDF Author: Cui Yu
Publisher: Springer
ISBN: 9783540441991
Category : Computers
Languages : en
Pages : 156

Get Book Here

Book Description
In this monograph, we study the problem of high-dimensional indexing and systematically introduce two efficient index structures: one for range queries and the other for similarity queries. Extensive experiments and comparison studies are conducted to demonstrate the superiority of the proposed indexing methods. Many new database applications, such as multimedia databases or stock price information systems, transform important features or properties of data objects into high-dimensional points. Searching for objects based on these features is thus a search of points in this feature space. To support efficient retrieval in such high-dimensional databases, indexes are required to prune the search space. Indexes for low-dimensional databases are well studied, whereas most of these application specific indexes are not scaleable with the number of dimensions, and they are not designed to support similarity searches and high-dimensional joins.

Modern B-Tree Techniques

Modern B-Tree Techniques PDF Author: Goetz Graefe
Publisher: Now Publishers Inc
ISBN: 1601984820
Category : Computers
Languages : en
Pages : 216

Get Book Here

Book Description
Invented about 40 years ago and called ubiquitous less than 10 years later, B-tree indexes have been used in a wide variety of computing systems from handheld devices to mainframes and server farms. Over the years, many techniques have been added to the basic design in order to improve efficiency or to add functionality. Examples include separation of updates to structure or contents, utility operations such as non-logged yet transactional index creation, and robust query processing such as graceful degradation during index-to-index navigation. Modern B-Tree Techniques reviews the basics of B-trees and of B-tree indexes in databases, transactional techniques and query processing techniques related to B-trees, B-tree utilities essential for database operations, and many optimizations and improvements. It is intended both as a tutorial and as a reference, enabling researchers to compare index innovations with advanced B-tree techniques and enabling professionals to select features, functions, and tradeoffs most appropriate for their data management challenges.