Estimating Frequency Distributions in Data Streams

Estimating Frequency Distributions in Data Streams PDF Author: Justin Y. Chen
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Streaming algorithms allow for space-efficient processing of massive datasets. The distribution of the frequencies of items in a large dataset is often used to characterize that data: e.g., the data is heavy-tailed, the data follows a power law, or there are many elements that only appear only once or twice. In this thesis, we focus on the problem of estimating the profile (a vector representation of the frequency distribution). Given a sequence of m elements from a universe of size n, its profile is a vector [phi] whose i-th entry [phi][subscript i] represents the number of distinct elements that appear in the stream exactly i times. A classic paper by Datar and Muthukrishan from 2002 gave an algorithm which estimates any entry [phi][subscript i] up to an additive error of ±[epsilon]D using O(1/[epsilon]2 log(nm)) bits of space, where D is the number of distinct elements in the stream. We considerably improve on this result by designing an algorithm which estimates the whole profile vector [phi], up to overall error ±[epsilon]m, using O(1/[epsilon]2 log(1/[epsilon]) + log(nm)) bits. More formally, we give an algorithm that computes an approximate profile [phi]̂ such that the L1 distance [parallel lines][phi] - [phi]̂[parallel lines]1 is at most [epsilon]m. In addition to bounding the error across all coordinates, our space bound separates the terms that depend on 1/[epsilon] and those that depend on n and m. Furthermore, we give a lower bound showing that our bound is optimal up to constant factors. "To achieve these results, we introduce two new techniques. First, we develop hashing-based sketches that keep very limited information about the identities of the hashed elements. As a result, elements with different frequencies are mixed together, and need to be unmixed using an iterative "deconvolution" process. Second, we reduce the randomness used by the algorithms in a somewhat subtle way: we first use Nisans generator to ensure that the random variables of interest are O(1)-wise independent, and then we analyze those variables by calculating their moments. (In our setting, using Nisans generator alone would not yield the desired space bound.) The latter technique seems quite versatile, and has been already used for other streaming problems [Ano23].

Estimating Frequency Distributions in Data Streams

Estimating Frequency Distributions in Data Streams PDF Author: Justin Y. Chen
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Streaming algorithms allow for space-efficient processing of massive datasets. The distribution of the frequencies of items in a large dataset is often used to characterize that data: e.g., the data is heavy-tailed, the data follows a power law, or there are many elements that only appear only once or twice. In this thesis, we focus on the problem of estimating the profile (a vector representation of the frequency distribution). Given a sequence of m elements from a universe of size n, its profile is a vector [phi] whose i-th entry [phi][subscript i] represents the number of distinct elements that appear in the stream exactly i times. A classic paper by Datar and Muthukrishan from 2002 gave an algorithm which estimates any entry [phi][subscript i] up to an additive error of ±[epsilon]D using O(1/[epsilon]2 log(nm)) bits of space, where D is the number of distinct elements in the stream. We considerably improve on this result by designing an algorithm which estimates the whole profile vector [phi], up to overall error ±[epsilon]m, using O(1/[epsilon]2 log(1/[epsilon]) + log(nm)) bits. More formally, we give an algorithm that computes an approximate profile [phi]̂ such that the L1 distance [parallel lines][phi] - [phi]̂[parallel lines]1 is at most [epsilon]m. In addition to bounding the error across all coordinates, our space bound separates the terms that depend on 1/[epsilon] and those that depend on n and m. Furthermore, we give a lower bound showing that our bound is optimal up to constant factors. "To achieve these results, we introduce two new techniques. First, we develop hashing-based sketches that keep very limited information about the identities of the hashed elements. As a result, elements with different frequencies are mixed together, and need to be unmixed using an iterative "deconvolution" process. Second, we reduce the randomness used by the algorithms in a somewhat subtle way: we first use Nisans generator to ensure that the random variables of interest are O(1)-wise independent, and then we analyze those variables by calculating their moments. (In our setting, using Nisans generator alone would not yield the desired space bound.) The latter technique seems quite versatile, and has been already used for other streaming problems [Ano23].

Techniques for Estimating Magnitude and Frequency of Floods on Streams in Indiana

Techniques for Estimating Magnitude and Frequency of Floods on Streams in Indiana PDF Author: Dale R. Glatfelter
Publisher:
ISBN:
Category : Flood forecasting
Languages : en
Pages : 120

Get Book Here

Book Description


Estimating the Magnitude and Frequency of Peak Streamflows for Ungaged Sites on Streams in Alaska and Conterminous Basins in Canada

Estimating the Magnitude and Frequency of Peak Streamflows for Ungaged Sites on Streams in Alaska and Conterminous Basins in Canada PDF Author: Janet H. Curran
Publisher:
ISBN:
Category : Flood forecasting
Languages : en
Pages : 116

Get Book Here

Book Description


Techniques for Estimating Peak-streamflow Frequency for Unregulated Streams and Streams Regulated by Small Floodwater Retarding Structures in Oklahoma

Techniques for Estimating Peak-streamflow Frequency for Unregulated Streams and Streams Regulated by Small Floodwater Retarding Structures in Oklahoma PDF Author: Robert L. Tortorelli
Publisher:
ISBN:
Category : Flood forecasting
Languages : en
Pages : 50

Get Book Here

Book Description


Methods for Estimating the Magnitude and Frequency of Peak Discharges of Rural, Unregulated Streams in Virginia

Methods for Estimating the Magnitude and Frequency of Peak Discharges of Rural, Unregulated Streams in Virginia PDF Author: James A. Bisese
Publisher:
ISBN:
Category : Flood forecasting
Languages : en
Pages : 86

Get Book Here

Book Description


Technique for Estimating Magnitude and Frequency of Floods in Illinois

Technique for Estimating Magnitude and Frequency of Floods in Illinois PDF Author: George W. Curtis
Publisher:
ISBN:
Category : Flood forcasting
Languages : en
Pages : 82

Get Book Here

Book Description


Automata, Languages and Programming

Automata, Languages and Programming PDF Author: Peter Widmayer
Publisher: Springer Science & Business Media
ISBN: 9783540438649
Category : Computers
Languages : en
Pages : 1100

Get Book Here

Book Description
This book constitutes the refereed proceedings of the 29th International Colloquium on Automata, Languages and Programming, ICALP 2002, held in Malaga, Spain, in July 2002. The 83 revised full papers presented together with 7 invited papers were carefully reviewed and selected from a total of 269 submissions. All current aspects of theoretical computer science are addressed and major new results are presented.

Estimation of Peak-discharge Frequency of Urban Streams in Jefferson County, Kentucky

Estimation of Peak-discharge Frequency of Urban Streams in Jefferson County, Kentucky PDF Author:
Publisher:
ISBN:
Category : Urban runoff
Languages : en
Pages : 54

Get Book Here

Book Description


A Method of Estimating Flood-frequency Parameters for Streams in Idaho

A Method of Estimating Flood-frequency Parameters for Streams in Idaho PDF Author: L. C. Kjelstrom
Publisher:
ISBN:
Category : Flood forecasting
Languages : en
Pages : 112

Get Book Here

Book Description


Estimating the Magnitude and Frequency of Low Flows of Streams in Massachusetts

Estimating the Magnitude and Frequency of Low Flows of Streams in Massachusetts PDF Author: John C. Risley
Publisher:
ISBN:
Category : Stream measurements
Languages : en
Pages : 42

Get Book Here

Book Description
...Presents techniques for estimating 7 day, 2 year and 7 day, 10 year flows at continuous and partial record streamflow gaging stations and techniques for estimating these values at ungaged stream sites...