Home/Magazine Archive/September 2023 (Vol. 66, No. 9)/Model Counting Meets Distinct Elements/Full Text

Research Highlights
## Model Counting Meets Distinct Elements

Constraint Satisfaction Problems (CSPs) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSPs and the computation of the number of distinct elements in a data stream, also known as the zeroth frequency moment (*F*_{0}) of a data stream.

Our investigations lead us to observe striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and distinct elements computation. We design a recipe for the translation of algorithms developed for distinct elements estimation to that of model counting, resulting in new algorithms for model counting. We then observe that algorithms in the context of distributed streaming can be transformed into distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing distinct elements estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works.

*Constraint Satisfaction Problems* (CSP's) and a *data stream model* are two core themes in computer science with a diverse set of applications in topics including probabilistic reasoning, networks, databases, and verification. *Model counting* and computation of *zeroth frequency moment* (*F*_{0}) are fundamental problems for CSPs and a data stream model respectively. This paper is motivated by our observation that despite the usage of similar algorithmic techniques for the two problems, the developments in the two communities have, surprisingly, evolved separately, and rarely has a paper from one community been cited by the other.

Given a set of constraints ϕ over a set of variables in a finite domain *D*, the problem of model counting is to estimate the number of solutions of ϕ. We are often interested when ϕ is restricted to a special class of representations such as Conjunctive Normal Form (CNF) and Disjunctive Normal Form (DNF). A data stream over a domain [*N*] is represented by **a** = 〈*a*_{1}, *a*_{2}, ···, *a _{m}*〉 where each item

The starting point of this work is the confluence of two viewpoints. The first viewpoint contends that some of the algorithms for model counting can conceptually be thought of as operating on the stream of the solutions of the constraints. The second viewpoint contends that a stream can be viewed as a DNF formula, and the problem of *F*_{0} estimation is similar to model counting. These viewpoints make it natural to believe that algorithms developed in the streaming setting can be directly applied to model counting, and vice versa. We explore this connection and indeed design new algorithms for model counting inspired by algorithms for estimating *F*_{0} in data streams. By exploring this connection further, we design new algorithms to estimate *F*_{0} for streaming sets that are succinctly represented by constraints. It is worth noting that the two communities focus on seemingly different efficiency objectives: in streaming algorithms, space complexity is of major concern while in the context of model counting, time (especially NP query complexity in the context of CNF formulas) is of primary concern. Therefore, it is striking to observe that our transformation recipe leads to the design of *efficient* algorithms for *F*_{0} estimation as well as model counting wherein *efficient* is measured by the concern of the corresponding community. We further investigate this observation and demonstrate that the space complexity of streaming algorithms provides an upper bound on the query complexity of model counting algorithms.

To put our contributions in context, we briefly survey the historical development of algorithmic frameworks in both model counting and *F*_{0} estimation and point out the similarities.

**1.1. Model counting**

The complexity-theoretic study of model counting was initiated by Valiant who showed that this problem, in general, is #P-complete.^{32} This motivated researchers to investigate approximate model counting and in particular to design (ε, δ)-approximation schemes. The complexity of approximate model counting depends on its representation. When the model ϕ is represented as a CNF formula ϕ, designing an efficient (ε, δ)-approximation is NP-hard.^{30} In contrast, when it is represented as a DNF formula, model counting admits an fully polynomial-time approximation scheme (FPRAS).^{21, 22} We will use #CNF to refer to the case when ϕ is a CNF formula and #DNF to refer to the case when ϕ is a DNF formula.

For #CNF, Stockmeyer^{30} provided a hashing-based randomized procedure that can compute an (ε, δ)-approximation with running time poly(|ϕ|, 1/ε, 1/δ), given access to an NP oracle. Building on Stockmeyer's approach and motivated by the unprecedented breakthroughs in the design of SAT solvers, researchers have proposed a series of algorithmic improvements that have allowed the hashing-based techniques for approximate model counting to scale to formulas involving hundreds of thousands of variables. The practical implementations substitute the NP oracle with an SAT solver. In the context of model counting, we are primarily interested in time complexity and therefore, the number of NP queries is of key importance. The emphasis on the number of NP calls also stems from practice as the practical implementation of model counting algorithms has shown to spend over 99% of their time in the underlying SAT calls.^{29}

Karp and Luby^{21} proposed the first FPRAS scheme for #DNF, which was improved in subsequent works.^{9, 22} Chakraborty et al.^{5} demonstrated that the hashing-based framework can be extended to #DNF, thereby providing a unified framework for both #CNF and #DNF. Meel et al.^{24, 25} subsequently improved the complexity of the hashing-based approach for #DNF and observed that hashing-based techniques achieve better scalability than Monte Carlo techniques.

**1.2. Zeroth frequency moment estimation**

Estimating (ε, δ)-approximation of the *k*^{th} frequency moments (*F _{k}*) of a stream is a central problem in a data streaming model.

The first algorithm for computing *F*_{0} with a constant factor approximation was proposed by Flajolet and Martin, who assumed the existence of hash functions with ideal properties resulting in an algorithm with undesirable space complexity.^{15} In their seminal work, Alon, Matias, and Szegedy designed an *O*(log *N*) space algorithm for *F*_{0} with a constant approximation ratio that employs 2-universal hash functions.^{1} Subsequent investigations into hashing-based schemes by Gibbons and Tirthapura^{16} and Bar-Yossef et al.^{3} provided (ε, δ)-approximation algorithms with space and time complexity . Later, Bar-Yossef et al. proposed *three algorithms* with improved space and time complexity.^{2} While the three algorithms employ hash functions, they differ conceptually in the usage of relevant random variables for the estimation of *F*_{0}. This line of work resulted in the development of an algorithm with optimal space complexity and *O* (log *N*) update time to estimate the *F*_{0} of a stream.^{20}

The aforementioned works are in the setting where each data item *a _{i}* is an element of the universe. Subsequently, there has been a series of results of estimating

**1.3. The road to a unifying framework**

As mentioned above, the algorithmic developments for model counting and *F*_{0} estimation have largely relied on the usage of hashing-based techniques and yet these developments have, surprisingly, been separate, and rarely has a work from one community been cited by the other. In this context, we wonder whether it is possible to bridge this gap and if such an exercise would contribute to new algorithms for model counting as well as for *F*_{0} estimation? The main conceptual contribution of this work is an affirmative answer to the above question. First, we point out that the two well-known algorithms; Stockmeyer's #CNF algorithm^{30} which is further refined by Chakraborty et al.^{5} and Gibbons and Tirthapura's *F*_{0} estimation algorithm,^{16} are essentially the same.

The core idea of the hashing-based technique of Stockmeyer's and Chakraborty et al's scheme is to use pair-wise independent hash functions to partition the solution space (satisfying assignments of a CNF formula) into *roughly equal and small* cells, wherein a cell is *small* if the number of solutions is less than a pre-computed threshold, denoted by Thresh. Then a good estimate for the number of solutions is the *number of solutions in an arbitrary cell* x *number of cells.* To determine the appropriate number of cells, the solution space is iteratively partitioned as follows. At the *m ^{th}* iteration, a hash function with range {0, 1}

We now describe Gibbons and Tirthapura's algorithm for *F*_{0} estimation which we call the Bucketing algorithm. Without loss of generality, we assume that *N* is a power of two and thus identify [*N*] with {0, 1}^{n}. The algorithm maintains a bucket of size Thresh and starts by picking a hash function *h* : {0, 1}^{n} → {0, 1}^{n}. It iterates over sampling levels. At level *m*, when a data item *x* comes, if *h*(*x*) starts with 0^{m}, then *x* is added to the bucket. If the bucket overflows, then the sampling level is increased to *m* + 1 and all elements *x* in the bucket other than the ones with *h*(*x*) = 0^{m+1} are deleted. At the end of the stream, the value *t* × 2^{m} is output as the estimate where *t* is the number of elements in the bucket and *m* is the sampling level.

These two algorithms are conceptually the same. In the Bucketing algorithm, at the sampling level *m*, it looks at only the first *m* bits of the hashed value; this is equivalent to considering a hash function with range {0, 1}^{m}. Thus the bucket is nothing but all the elements in the stream that belong to the cell *h*^{−1}(0^{m}). The final estimate is the number of elements in the bucket times the number of cells, identical to Chakraborty et al.'s algorithm. In both algorithms, to obtain an (ε, δ) approximation, the Thresh value is chosen as . To reduce the error probability to 1/δ, the median of independent estimations is output.

**1.4. Our contributions**

Motivated by the conceptual identity between the two algorithms, we further explore the connections between algorithms for model counting and *F*_{0} estimation.

First, we formalize a recipe to transform streaming algorithms for *F*_{0} estimation to those for model counting. Such a transformation yields new (ε, δ)-approximate algorithms for model counting, which are different from currently known algorithms. Our transformation recipe from *F*_{0} estimation to model counting allows us to view the problem of the design of distributed #DNF algorithms through the lens of *distributed functional monitoring* which is well-studied in a data streaming literature.

Building on the connection between model counting and *F*_{0} estimation algorithms, we design new algorithms to estimate *F*_{0} over *structured set streams* where each element of the stream is a (succinct representation of a) subset of the universe. Thus, the stream is *S*_{1}, *S*_{2}, ··· where each *S _{i}* ⊆ [

We observe that several structured sets can be represented as small DNF formulae and thus *F*_{0} counting over these structured set data streams can be viewed as a special case of #DNF. Using the hashing-based techniques for #DNF, we obtain a general recipe for a rich class of structured sets that include DNF sets, affine spaces, and multidimensional ranges. Prior work on structured sets had to rely on involved analysis for each of the specific instances, while our work provides a general recipe for both analysis and implementation.

A natural question that arises from the transformation recipe is the relationship between the space complexity of the streaming algorithms and the query complexity of the obtained model counting algorithms. We establish a relationship between these two quantities by showing that the space complexity is an upper bound on the query complexity.

We will assume the universe [*N*] = {0, 1}^{n}.

*F*_{0} *Estimation:* A data stream **a** over domain [*N*] can be represented as **a** = *a*_{1}, *a*_{2}, ···, *a _{m}* wherein each item

*Model Counting:* Let *X* = (*X*_{1}, *X*_{2}, …, *X _{n}*} be a set of Boolean variables. For a Boolean formula ϕ over variables

*k-wise Independent hash functions:* Let *n, m* ∈ ℕ and be a family of hash functions mapping {0, 1}^{n} to {0, 1}^{m}.

DEFINITION 1. *A family of hash functions H*(*n, m*) *is k-wise independent, denoted* *H*_{k-wise} (*n, m*), *if* ∀α_{1,} α_{2,} ..., α_{k} ∈{0, 1}^{m}, *for all distinct* *x*_{1}, *x*_{2,} ... *x*_{k} ∈{0, 1}^{n},

*Explicit families.* An explicit hash family that we use is *H*_{Toeplitz} (*n, m*), which is known to be 2-wise independent.^{4} The family is defined as follows: is the family of functions of the form *h* (*x*) = *Ax* + *b* with and , and F_{2} is the finite field of size 2. For a hash function *h*: {0,1}^{n}→ {0, 1}^{m}, *h*_{ℓ}: {0, 1}^{n} →{0, 1}_{ℓ}, ℓ ∈ {1, …,*m*} is the function where *h _{ℓ}* (

As a first step, we present a unified view of the three hashing-based algorithms proposed in Bar-Yossef et al.^{2} Their first algorithm, the Bucketing algorithm discussed above, is a modification of an *F*_{0} estimation algorithm due to Gibbons and Tirthapura.^{16} The second algorithm, which we call Minimum, is based on the idea that if we hash all the items of the stream, then *O*(1/ε^{2})-th minimum of the hash values can be used to compute a good estimate of *F*_{0}. The third algorithm, which we call Estimation, chooses a set of *k* functions, {*h*_{1}, *h*_{2}, …, *h _{k}*}, such that each

Algorithm 1, called ComputeF0, presents the overarching architecture of the three proposed algorithms. The architecture of ComputeF0 is fairly simple: it chooses a collection of hash functions using ChooseHashFunctions, calls the subroutine ProcessUpdate for every incoming element of the stream and invokes ComputeEst at the end of the stream to return the *F*_{0} approximation.

ChooseHashFunctions. As shown in Algorithm 2, the hash functions depend on the strategy being implemented. The subroutine PickHashFunctions(*H*, *t*) returns a collection of *t* independently chosen hash functions from the family *H.* We use *H* to denote the collection of hash functions returned, this collection is viewed as either a 1D array or as a 2-dimensional (2D) array. When *H* is a 1D array, *H* [*i*] denotes the *i*th hash function of the collection and when *H* is a 2D array *H* [*i*] [*j*] is the [*i, j*]th hash function.

ProcessUpdate. For a new item *x*, the update of *S*, as shown in Algorithm 3 is as follows:

- Bucketing: For a new item
*x*, if*H*[*i*](_{m}_{i}*x*) = 0, then we add it to^{mi}*S*[*i*] if*x*is not already present in*S*[*i*]. If the size of*S*[*i*] is greater than Thresh (which is set to be*O*(1/ε^{2})), then we increment the*m*as in line 8._{i} - Minimum: For a new item
*x*, if*H*[*i*](*x*) is smaller than max*S*[*i*], then we replace max*S*[*i*] with*H*[*i*](*x*). - Estimation: For a new item
*x*, compute*z*= TrailZero(*H*[*i, j*](*x*)), that is, the number of trailing zeros in*H*[*i, j*](*x*), and replace*S*[*i, j*] with*z*if*z*is larger than*S*[*i, j*].

ComputeEst. Finally, for each of the algorithms, we estimate *F*_{0} based on the sketch *S* as described in the sub-routine ComputeEst presented as Algorithm 4. It is crucial to note that the estimation of *F*_{0} is performed solely using the sketch *S* for the Bucketing and Minimum algorithms. The Estimation algorithm requires an additional parameter *r* that depends on a loose estimate of *F*_{0}.

**Algorithm 1** ComputeF0(*n*, ε, δ)

- Thresh ← 96/ε
^{2} *t*← 35 log(1/δ)*H*← ChooseHashFunctions(*n*, Thresh,*t*)*S*← {}**while**true**do****if**EndStream**then**exit;*x*←*input*()- ProcessUpdate(
*S*,*H*,*x*, Thresh) *Est*← ComputeEst(*S*, Thresh)**return***Est*

*Sketch Properties.* For each of the three algorithms, their corresponding sketches can be viewed as arrays of size 35 log(1/δ). The parameter Thresh is set to 96/ε^{2}.

- Bucketing: The element
*S*[*i*] is a tuple 〈*ℓ*,_{i}*m*〉 where_{i}*ℓ*is a list of size at most Thresh, where_{i}*ℓ*_{i}= {*x*∈**a**\*H*[*i*]_{mi}(*x*) = 0^{mi}}. We use*S*[*i*](0) to denote*ℓ*and_{i}*S*[*i*](1) to denote*m*_{i}. - Minimum: Each
*S*[*i*] holds the lexicographically Thresh many smallest elements of {*H*[*i*](*x*)|*x*∈**a**}. - Estimation: Each
*S*[*i*] holds a tuple of size Thresh. The*j*'th entry of this tuple is the largest number of trailing zeros in any element of*H*[*i,j*](**a**).

**Algorithm 2** ChooseHashFunctions(*n*, Thresh, *t*)

**switch**AlgorithmType**do****case**AlgorithmType==Bucketing*H*← PickHashFunctions(*H*_{Toeplitz}(*n, n*),*t*)**case**AlgorithmType==Minimum*H*← PickHashFunctions(*H*_{Toeplitz}(*n*, 3*n*),*t*)**case**AlgorithmType==Estimation*s*← 10 log(1/ε)*H*← PickHashFunctions(*H*_{s-wise}(*n, n*),*t*× Thresh)

**return***H*

**Algorithm 3** ProcessUpdate(*S, H, x*, Thresh)

**for***i*∈ [1, |*H*|]**do****switch**AlgorithmType**do****case**Bucketing*m*=_{i}*S*[*i*](0)**if***H*[*i*]_{mi}(*x*) == 0^{mi}**then***S*[*i*](0) ←*S*[*i*](0) ∪ {*x*}**if**size(*S*[*i*](0)) > Thresh**then***S*[*i*](1) ←*S*[*i*](1) + 1**for***y*∈*S***do****if***H*[*i*]_{mi+1(y)}≠ 0^{mi+1}**then**- Remove(
*S*[*i*](0),*y*) **case**Minimum**if**size(*S*[*i*]) < Thresh**then***S*[*i*]. Append(*H*[*i*](*x*))**else***j*← arg max(*S*[*i*])**if***S*[*i*](*j*)>*H*[*i*](*x*)**then***S*[*i*](*j*) ←*H*[*i*](*x*)**case**Estimation**for***j*∈ [1, Thresh]**do***S*[*i, j*] ← max(*S*[*i, j*], TrailZero(*H*[*i, j*](*x*)))**return***S*

**3.1. A recipe for transformation**

Observe that for each of the algorithms, the final computation of *F*_{0} estimation depends on the sketch *S.* Therefore, as long as for two streams **a** and **â**, if their corresponding sketches (and the hash functions chosen) match, then the three schemes presented above would return the same estimates. The recipe for a transformation of streaming algorithms to model counting algorithms is based on the following insight:

- Capture the relationship
*P*(*S*,*H*,**a**_{u}) between the sketch*S*, set of hash functions*H*, and set**a**_{u}at the end of stream. - View the formula ϕ as a symbolic representation of the unique set
**a**_{u}represented by the stream**a**such that Sol(ϕ) =**a**_{u}. - Given a formula ϕ and set of hash functions
*H*, design an algorithm to construct sketch*S*such that the property*P*(*S*,*H*, Sol(ϕ)) holds. Using the sketch*S*, |Sol(ϕ)| can be estimated.

**Algorithm 4** ComputeEst(*S*, Thresh)

**switch**AlgorithmType**do****case**Bucketing**return**Median**case**Minimum**return**Median**case**Estimation(*r*)**return**Median

By applying the above recipe to the three *F*_{0} estimation algorithms, we can derive corresponding model counting algorithms. In particular, applying the above recipe to the Bucketing algorithm leads us to the state-of-the-art hashing-based model counting algorithm, ApproxMC, proposed by Chakraborty et al.^{5} Applying the above recipe to Minimum and Estimation allows us to obtain different model counting schemes. In this extended abstract, we illustrate this transformation for the Minimum-based algorithm.

**3.2. Example application of recipe: Minimum-based algorithm**

We showcase the application of the recipe in the context of a minimum-based algorithm. For a given multiset **a** (e.g., a data stream or solutions to a model), we now specify the property *P*(*S*, *H*, **a**_{u}) as follows: The sketch *S* is an array of sets indexed by members of *H* that holds lexicographically *p* minimum elements of *H* [*i*](**a**_{u}) where *p* is is the property that specifies this relationship.

The following lemma due by Bar-Yossef et al.^{2} establishes the relationship between the property *P* and the number of distinct elements of a multiset. Let max(*S _{i}*) denote the largest element of the set

LEMMA 1. Let **a** ⊆ {0, 1}^{n} *be a multiset. Let H* ⊆ *H*_{Toeplitz} (*n*, 3*n*) *where each H [i] is independently drawn from* *H*_{Toeplitz} (*n*, 3*n*), *and* |*H*| = *O* (log 1/*δ*). *Let S be such that* *P*(*S*, *H*, *a _{u}*)

Therefore, we can transform the minimum algorithm for *F*_{0} estimation to that of model counting given access to a sub-routine that can compute *S* such that *P*(*S*, *H*, Sol(*ϕ*)) holds. The following proposition establishes the existence and complexity of such a subroutine, called FindMin.

PROPOSITION 1. *There is an algorithm* FindMin *that, given ϕ over n variables, h* ∈ *H*_{Toeplitz} (*n, m*), *and p as input, returns a set*, *B* ⊆ *h* (Sol(*ϕ*)) *so that if* | *h* (Sol(*ϕ*))| ≤ *p, then B = h* (Sol(ϕ)), *otherwise B is the p lexicographically minimum elements of h* (Sol(ϕ)). *Moreover, if ϕ is a CNF formula, then* FindMin *makes* *O*(*p* · *m*) *calls to an NP oracle, and if ϕ is a DNF formula with k terms, then* FindMin *takes* *O*(*m*^{3} · *n* · *k* · *p*) *time.*

Equipped with Proposition 1, we are now ready to present the algorithm, called ApproxModelCountMin, for model counting. Since the complexity of FindMin is PTIME when *ϕ* is in DNF, we have ApproxModelCountMin as a FPRAS for DNF formulas.

**Algorithm 5** ApproxModelCountMin(*ϕ, ε, δ*)

*t*← 35 log(1/*δ*)*H*← PickHashFunctions(*H*_{Toeplitz}(*n*, 3*n*),*t*)*S*← {}- Thresh
**for***i*∈ [1,*t*]**do***S*[*i*] ← FindMin(*ϕ*,*H*[*i*], Thresh)*Est*←*Median***return***Est*

THEOREM 1. *Given ϕ, ε, δ*, ApproxModelCountMin *returns Est such that*

*If ϕ is a CNF formula, then* ApproxModelCountMin *is a polynomial-time algorithm that makes* *calls to an NP oracle. If ϕ is a DNF formula, then* ApproxModelCountMin *is an FPRAS.*

We now give proof of Proposition 1 by describing the subroutine FindMin.

PROOF. We first present the algorithm when the formula *ϕ* is a DNF formula. Adapting the algorithm for the case of CNF can be done by using similar ideas. Let ∅ = *T*_{1} ∨ *T*_{2} ∨ ··· ∨ *T _{k}* be a DNF formula over

We illustrate an algorithm with running time *O*(*m*^{3}*np*) to compute *C _{p}* when the formula is just a term

We will compute *C _{p}* iteratively as follows: assuming we have computed the (

Let *y* = *y*_{1} ···· *y _{m}* be the (

We now discuss distributed DNF counting problem.

**3.3. Distributed DNF counting**

Consider the problem of *distributed DNF counting.* In this setting, there are *k* sites that can each communicate with a central coordinator. The input DNF formula ϕ is partitioned into *k* DNF subformulas *ϕ*_{1}, ···, *ϕ*_{k}, where each *ϕ*_{i} is a subset of the terms of the original *ϕ*, with the *j*'th site receiving only *ϕ*_{j}. The goal is for the coordinator to obtain an (*ε*, *δ*)-approximation of the number of solutions to *ϕ*, while minimizing the total number of bits communicated between the sites and the coordinator. Distributed algorithms for sampling and counting solutions to CSP's have been studied recently in other models of distributed computation.^{11, 12, 13, 14} From a practical perspective, given the centrality of #DNF in the context of probabilistic databases,^{27, 28} a distributed DNF counting algorithm would entail applications in distributed probabilistic databases.

From our perspective, distributed DNF counting falls within the *distributed functional monitoring* framework formalized by Cormode et al.^{7} Here, the input is a stream **a** which is partitioned arbitrarily into sub-streams **a**_{1}, …, **a**_{k} that arrive at each of *k* sites. Each site can communicate with the central coordinator, and the goal is for the coordinator to compute a function of the joint stream **a** while minimizing the total communication. This general framework has several direct applications and has been studied extensively (see Cormode et al.,^{8} Huang et al.,^{18} Woodruff and Zhang^{33} and the references therein). In distributed DNF counting problem, each sub-stream **a**_{i} corresponds to the set of satisfying assignments to each subformula ϕ_{i}, while the function to be computed is *F*_{0}.

The algorithms discussed in Section 3 can be extended to the distributed setting. We briefly describe the distributed implementation of the minimum-based algorithm. *Distributed implementation of the minimum-based algorithm.* The coordinator chooses hash functions *H*[1], …, *H*[*t*] from *H*_{Toeplitz} (*n*, 3*n*) and sends them to the *k* sites. Each site runs the FindMin algorithm for each hash function and sends the outputs to the coordinator. So, the coordinator receives sets *S*[*i, j*], consisting of the Thresh lexicographically smallest hash values of the solutions to ϕ_{j}. The coordinator then extracts *S*[*i*], the Thresh lexicographically smallest elements of *S*[*i*, 1] ∪ ··· *S*[*i*, *k*], and proceeds with the rest of the algorithm ApproxModelCountMin. The communication cost is *O*(*kn*/ε^{2} · log(1/*δ*)) to account for the *k* sites sending the outputs of their FindMin invocations. The time complexity for each site is polynomial in *n*, ε^{−1}, and log(*δ*^{−1}).

A straightforward implementation of the Bucketing algorithm leads to a distributed DNF counting algorithm whose communication cost is *Õ*(*k*(*n* + 1/ε^{2}) · log(1/δ)) and time complexity per site is polynomial in *n*, ε^{−1}, and log(δ^{−1}). Similarly, the estimation-based algorithm leads to a distributed algorithm with *Õ*(*k*(*n* + 1/ε^{2}) log(1/δ)) communication cost. However, we do not know a polynomial time algorithm to implement the last algorithm on DNF terms.

**3.4. Lower bound**

The communication cost for the Bucketing and Estimation-based algorithms is nearly optimal in their dependence on *k* and ε. Woodruff and Zhang^{33} showed that the randomized communication complexity of estimating *F*_{0} up to a 1 + ε factor in the distributed functional monitoring setting is Ω(*k*/ε^{2}). We can reduce the *F*_{0} estimation problem to distributed DNF counting. Namely, if for the *F*_{0} estimation problem, the *j*’th site receives items *a*_{1}, ···, *a _{m}* ∈ [

In this section, we consider the *structured set streaming model* where each item *S _{i}* of the stream is a succinct representation of a set over the universe

**4.1. DNF sets**

A particular representation we are interested in is where each set is presented as the set of satisfying assignments to a DNF formula. Let ϕ be a DNF formula over *n* variables. Then the DNF set corresponding to ϕ be the set of satisfying assignments of ϕ. The *size* of this representation is the number of terms in the formula ϕ. A stream over DNF sets is a stream of DNF formulas ϕ_{1}, ϕ_{2}, …. Given such a DNF stream, the goal is to estimate |∪_{i} *S _{i}*| where

THEOREM 2. *There is a streaming algorithm to compute an* (ε, δ) *approximation of F*_{0} *over DNF sets. This algorithm takes space* *and processing time* *per item where k is the size (number of terms) of the corresponding DNF formula.*

PROOF. We show how to adapt the Minimum-value based algorithm from Section 3.2 to this setting. The algorithm picks a hash function *h* ∈ *H*_{Toeplitz} (*n*, 3*n*) and maintains the set *B* consisting of *t* lexicographically minimum elements of the set {*h* (Sol(*ϕ*_{1} ∨ ··· ∨ *ϕ*_{i−1}))} after processing *i*–1 items. When ϕ_{i} arrives, it computes the set *B*' consisting of the *t* lexicographically minimum values of the set {*h* (Sol(*ϕ _{i}*))} and subsequently updates

Instead of the Minimum-value based algorithm, we could also adapt the Bucketing-based algorithm to obtain an algorithm with similar space and time complexities.

**4.2. Affine spaces**

Another example of a structured stream is where each item of the stream is an affine space represented by *Ax* = *B* where *A* is a Boolean matrix and *B* is a zero-one vector. Without loss of generality, we may assume that *A* is a *n* × *n* matrix. Thus an affine stream consists of 〈*A*_{1}, *B*_{1}〉, 〈*A*_{2}, *B*_{2}〉, ···, where each 〈*A _{i}*,

PROPOSITION 2. *Given* (*A, B*), *h* ∈ *H*_{Toeplitz} (*n*, 3*n*), *and t as input, there is an algorithm*, AffineFindMin, *that returns a set*, *B* ⊆ *h*(Sol(〈*A, B*〉)) *so that if* |*h*(Sol(〈*A, B*〉))| ≤ *t*, *then* *B* = *h* (Sol(〈*A, B*〉)), *otherwise B is the t lexicographically minimum elements of h*(Sol(〈*A*, *B*〉)). *The time taken by this algorithm is O*(*n*^{4}*t*) *and the space taken by the algorithm is O*(*tn*).

The above proposition together with the minimum-based algorithm gives the following theorem.

THEOREM 3. *There is a streaming algorithm that computes a* (ε δ)-*approximation of F*_{0} *over affine spaces. This algorithm takes space* *and processing time of O* *per item.*

Our investigations reveal surprising connections between algorithms for *F*_{0} estimation and model counting that are of interest to two different communities. It is noteworthy that the two communities often have different concerns: in the context of model counting, one is focused on the NP-query complexity while in the context of streaming, the focus is on the space complexity. This begs the question of whether the connections are a matter of happenstance or there is an inherent relationship between the space complexity in the context of streaming and the query complexity for model counting. We detail our investigations on the existence of such a relationship.

In the following, we will fold the hash function *h* also in the sketch *S.* With this simplification, instead of writing *P*(*S, h,* Sol(*ϕ*)) we write *P*(*S*, Sol(*ϕ*)).

We first introduce some complexity-theoretic notation. For a complexity class *C*, a language *L* belongs to the complexity class ∃·*C* if there is a polynomial *q*(·) and a language *L'* ∈ *C* such that for every *x, x* ∈ *L* ⇔ ∃*y*, |*y*| ≤ *q*(|*x*|), 〈*x, y*〉 ∈ *L*'..

Consider a streaming algorithm for *F*_{0} that constructs a sketch such that *P* (*S*, *a _{u}*) holds for some property

THEOREM 4. *If L _{sketh} belongs to the complexity class C, then there exists a* FP

Let us apply the above theorem to the minimum-based algorithm. The sketch language consists of tuples of the form 〈*ϕ*, 〈*h, v*_{1}, ···, *v*_{t}〉〉 where {*v*_{1}, ··· *v _{t}*} is the set of

Interestingly, all the model counting algorithms that were obtained following our recipe are probabilistic polynomialtime algorithms that make queries to languages in NP. The above generic transformation gives a deterministic polynomialtime algorithm that makes queries to a language. Precisely characterizing the properties of the sketch that lead to probabilistic algorithms making only NP queries is an interesting direction to explore.

Our investigation led to a diverse set of results that unify over two decades of work in model counting and *F*_{0} estimation. The viewpoint presented in this work has the potential to spur several new interesting research directions.

**Higher Moments.** There has been a long line of work on the estimation of higher moments, that is, *F _{k}* over data streams. A natural direction of future research is to adapt the notion of

**Sparse XORs.** In the context of model counting, the performance of underlying SAT solvers strongly depends on the size of XORs. The standard constructions lead to XORs of size Θ(*n*) and an interesting line of research has focused on the design of sparse XOR-based hash functions^{10,17,19} culminating in showing that one can use hash functions of the form *h* (*x*) = *Ax* + *b* wherein each entry of the *m*-th row of *A* is 1 with probability .^{23} Such XORs were shown to improve runtime efficiency. In this context, a natural direction would be to explore the usage of sparse XORs in the context of *F*_{0} estimation.

We thank the anonymous reviewers of PODS 21 for their valuable comments. We are grateful to Phokion Kolaitis for suggesting exploration beyond the transformation recipe that led to results in Section 5. We thank Wim Martens for providing valuable suggestions on an earlier version of the manuscript. Bhattacharyya was supported in part by the NRF Fellowship Programme [NRF-NRFFAI1-2019-0002] and an Amazon Research Award. Meel was supported in part by the NRF Fellowship Programme[NRF-NRFFAI1-2019-0004] and the AI Singapore Programme [AISG-RP-2018-005], and NUS ODPRT Grant [R-252-000-685-13]. Vinod was supported in part by NSF CCF-2130608, NSF CCF-184908, and NSF HDR:TRIPODS-1934884 awards. Pavan was supported in part by NSF CCF-2130536, NSF CCF-1849053, and NSF HDR:TRIPODS-1934884 awards.

1. Alon, N., Matias, Y., Szegedy, M. The space complexity of approximating the frequency moments. *J. Comput. Syst. Sci. 58*, 1 (1999), 137–147.

2. Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L. Counting distinct elements in a data stream. In *Volume 2483 of Proceedings of RANDOM* (2002), Springer, Cambridge, USA, 1–10.

3. Bar-Yossef, Z., Kumar, R., Sivakumar, D. Reductions in streaming algorithms, with an application to counting triangles in graphs. In *Proceedings of SODA* (2002), ACM/SIAM, NY, 623–632.

4. Carter, J.L., Wegman, M.N. Universal classes of hash functions. In *Proceedings of the 9 ^{th} Annual ACM Symposium on Theory of Computing* (1977), ACM, NY, 106–112.

5. Chakraborty,, S., Meel, K.S., Vardi, M.Y. Algorithmic improvements in approximate counting for probabilistic inference: From linear to logarithmic SAT calls. In *Proceedings of IJCAI* (2016), IJCAI/AAAI Press, New York, USA.

6. Cormode, G., Muthukrishnan, S. Estimating dominance norms of multiple data streams. In *Proceedings of ESA, Volume 2832 of Lecture Notes in Computer Science.* G.D. Battista and U. Zwick, eds. Springer, Budapest, Hungary, 2003, 148–160.

7. Cormode, G., Muthukrishnan, S., Yi, K. Algorithms for distributed functional monitoring. *ACM Trans. Algorithms (TALG) 7*, 2 (2011), 1–20.

8. Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q. Continuous sampling from distributed streams. *J. ACM (JACM) 59*, 2 (2012), 1–25.

9. Dagum, P., Karp, R., Luby, M., Ross, S. An optimal algorithm for monte carlo estimation. *SIAM J. Comput. 29*, 5 (2000), 1484–1496.

10. Ermon, S., Gomes, C.P., Sabharwal, A., Selman, B. Low-density parity constraints for hashing-based discrete integration. In *Proceedings of ICML* (2014), JMLR, Beijing, China, 271–279.

11. Feng, W., Hayes, T.P., Yin, Y. Distributed symmetry breaking in sampling (optimal distributed randomly coloring with fewer colors). *arXiv preprint arXiv:1802.06953* (2018).

12. Feng, W., Sun, Y., Yin, Y. What can be sampled locally? *Distrib. Comput. 33* (2018), 1–27.

13. Feng, W., Yin, Y. On local distributed sampling and counting. In *Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing* (2018), ACM, NY, 189–198.

14. Fischer, M., Ghaffari, M. A simple parallel and distributed sampling technique: Local glauber dynamics. In *32 ^{nd} International Symposium on Distributed Computing* (2018) Schloss Dagstuhl - Leibniz-Zentrum für Informatik, New Orleans, USA.

15. Flajolet, P., Martin, G.N. Probabilistic counting algorithms for data base applications. *J. Comput. Syst. Sci. 31*, 2 (1985), 182–209.

16. Gibbons, P.B., Tirthapura, S. Estimating simple functions on the union of data streams. In *Proceedings of SPAA.* A. L. Rosenberg, ed. ACM, NY, 2001, 281–291.

17. Gomes, C.P., Hoffmann, J., Sabharwal, A., Selman, B. From sampling to model counting. In *Proceedings of IJCAI* (2007), IJCAI/AAAI Press, Hyderabad, India, 2293–2299.

18. Huang, Z., Yi, K., Zhang, Q. Randomized algorithms for tracking distributed count, frequencies, and ranks. In *Proceedings of PODS* (2012), ACM, Scottsdale, USA 295–306.

19. Ivrii, A., Malik, S., Meel, K.S., Vardi, M.Y. On computing minimal independent support and its applications to sampling and counting. *Constraints An Int. J. 21*, 1 (2016), 41–58.

20. Kane, D.M., Nelson, J., Woodruff, D.P. An optimal algorithm for the distinct elements problem. In *Proceedings of PODS* (2010), ACM, NY, 41–52.

21. Karp, R., Luby, M. Monte-carlo algorithms for enumeration and reliability problems. In *Proceedings of FOCS* (1983), IEEE Computer Society, Arizona, USA.

22. Karp, R.M., Luby, M., Madras, N. Monte-carlo approximation algorithms for enumeration problems. *J. Algorithms 10*, 3 (1989), 429–448.

23. Meel, K.S., Akshay, S. Sparse hashing for scalable approximate model counting: Theory and practice. In *Proceedings of LICS* (2020) ACM, Saarbrücken, Germany.

24. Meel, K.S., Shrotri, A.A., Vardi, M.Y. On hashing-based approaches to approximate dnf-counting. In *Proceedings of FSTTCS* (2017) Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Kanpur, India.

25. Meel, K.S., Shrotri, A.A., Vardi, M.Y. Not all fprass are equal: Demystifying fprass for dnf-counting (extended abstract). In *Volume 8 of Proceedings of IJCAI* (2019), IJCAI, Macau, China.

26. Pavan, A., Tirthapura, S. Range-efficient counting of distinct elements in a massive data stream. *SIAM J. Comput. 37*, 2 (2007), 359–379.

27. Ré, C., Suciu, D. Approximate lineage for probabilistic databases. *Proc. VLDB Endowment 1*, 1 (2008), 797–808.

28. Senellart, P. Provenance and probabilities in relational databases. *ACM SIGMOD Rec. 46*, 4 (2018), 5–15.

29. Soos, M., Meel, K.S. Bird: Engineering an efficient cnf-xor sat solver and its applications to approximate model counting. In *Proceedings of AAAI Conference on Artificial Intelligence (AAAI)* (2019) AAAI Press, Honolulu, USA.

30. Stockmeyer, L. The complexity of approximate counting. In *Proceedings of STOC* (1983), ACM, Boston, 118–126.

31. Tirthapura, S., Woodruff, D.P. Rectangle-efficient aggregation in spatial data streams. In *Proceedings of PODS* (2012), ACM, NY, 283–294.

32. Valiant, L. The complexity of enumeration and reliability problems. *SIAM J. Comput. 8*, 3 (1979), 410–421.

33. Woodruff, D.P., Zhang, Q. Tight bounds for distributed functional monitoring. In *Proceedings of the 44 ^{th} Annual ACM Symposium on Theory of Computing* (2012), ACM, New York, USA 941–960.

To view the accompanying Technical Perspective, visit doi.acm.org/10.1145/3607825

The original version of this paper is entitled "Model Counting Meets *F*_{0} Estimation," and was published in the *Proceedings of the 40 ^{th} ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event*, China, June 20–25, 2021; https://dl.acm.org/doi/10.1145/3452021.3458311

This work is licensed under a Creative Commons Attribution International 4.0 License. https://creativecommons.org/licenses/by/4.0/

The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.

No entries found