What is the cardinal number?
The number of different elements in a (finite) set S is called the cardinality of the set, which is also called "potential" and is recorded as |S|. For example, S={ tomato, potato, carrot, potato, onion and tomato}, then |S|=4.
In our daily work, we often encounter situations that require a statistical basis. The most common is the number of daily active users (DAU). For example, the total number of unique visitors (UVs) who log in to an App in one day, or the total number of UVs who enter a product detail page in one day, and so on. DAU is the most direct indicator to measure the activity of Internet products, and it is indispensable to deal with Internet products. If you only consider the simplest cardinality statistics, it is very simple:
The above two methods are accurate statistics, which are commonly used when the data volume is moderate. But in the face of massive data, their space (memory) occupation and time efficiency will become uncontrollable. With the thinking of big data, can you sacrifice a little accuracy in exchange for a substantial increase in efficiency? The answer is naturally yes. Cardinality estimation has been realized in many mature ways. HyperLogLog is widely used, and readers familiar with Redis must be familiar with it. But this paper studies its parents, that is, two early cardinality estimation algorithms-linear counting algorithm and Flajolet-Martin algorithm.
The linear counting algorithm was proposed by Kyu-Young Whang et al. in the paper "Linear Time Probability Counting Algorithm for Database Application" in 1990. It is not the earliest cardinality estimation algorithm, but its idea is relatively direct and does not involve anything profound. I can try to describe it in detail.
The algorithm flow is as follows, which is not difficult to understand.
The following figure is an example of the hashing process given in the paper.
Because H is divided into m buckets per bit, and the n hash results obey uniform distribution, we can get:
The number of empty buckets u is a random variable, so we can calculate its expected value:
When both n and m tend to infinity, we can get:
That is to say, in this case, the relationship between cardinality n and m, E(u) is obtained as follows:
Because the value of each bit obeys the distribution of 0- 1, u obeys the binomial distribution. And because both n and m tend to infinity, u asymptotically obeys the normal distribution, that is:
For normal distribution, the maximum likelihood estimation (MLE) of μ is the sample mean, and the proof process can be referred to here.
And u is a sample randomly selected from the normally distributed population, so u is the MLE of E(u).
According to the invariance of MLE, because the function f (x) =-m ln (x/m) is reversible, that is, MLE of radix n is obtained:
The algorithm is proved.
Because the calculation process is too long, the conclusion in this paper is given directly:
It can be seen that Bias refers to the expected relative deviation between the estimated value and the accurate value, and StdError refers to the "standard deviation", that is, the standard deviation of the ratio of the estimated value to the accurate value of n.
If we limit the standard error, that is, stderr or
But even this would not be enough From the proof process of the above algorithm, we can know that once u=0 (that is, all barrels are full), the algorithm will be invalid, so we still have to ensure that it is in T.
When n and m tend to infinity, the standard deviation can be deduced (the calculation process is omitted):
Solution:
In other words, M will eventually satisfy:
In the last section, we have said that u obeys the normal distribution asymptotically, which is the case that the binomial distribution approaches the continuous type. If the discrete form is still considered, then when n of u~B(n, p) is large and p is small, u will approximately obey Poisson distribution:
When k=0, it is the probability that the bit array is filled, that is, e -λ.
Now we assign a value to α, which is √5 in this paper. And because the expectation and variance of Poisson distribution are λ, it is easy to get:
That is to say, even if ε is not considered, as long as we ensure that the expected value of U deviates from the standard deviation by more than √5 times, the probability of algorithm failure can be guaranteed to be less than 0.7%. In this paper, a numerical table of m increasing with n when α=√5 and ε=0.0 1 or 0. 10 is provided.
As can be seen from the table in the previous section, when n reaches a relatively large scale, the spatial complexity of the linear counting algorithm is O(n/C) and c is constant. Taking n= 10 8 as an example, the size of the bit array is less than 10 7 bit( 1MB multipoint), which is equivalent to only occupying112 of the native bit array method. If you want to calculate the cardinality of the union of two sets, you only need the bitwise OR of O( 1), which is simple and convenient.
However, this algorithm can only ensure the continuous reduction of space occupation, so it is mainly used in scenes with small data, and it is still not suitable for big data. Let's take a look at the more "smart" Flajolet-Martin algorithm.
This algorithm was put forward by Philippe Flajolet and G.Nigel Martin in the article 1984 "Probability Counting Algorithm for Database Application", hence its name, and it is the real ancestor of radix estimation algorithm. Its paper is more chewy. After all, I didn't graduate from the department of mathematics, so the details of mathematics will be rough, but I will ensure that it conforms to the original idea.
Define hash function:
This function can ensure that the hash results obey the uniform distribution as much as possible. In other words, the hash result space of H(x) is a set of binary strings with a fixed length of L.
For any nonnegative integer y, the value of bit k(k≥0) in the binary representation of y is recorded as bit(y, k), and then we can get:
Then, define ρ(y), which represents the position of the first 1 (least significant bit) that appears from the end in the binary representation of y, namely:
That is to say, the low order of the binary representation of y has continuous ρ(y) zeros.
The flow of Flajolet-Martin algorithm is as follows:
Does it feel a little foggy? Simply prove it.
If bitmap[j]= 1, it means that a value in m has been hashed and there are j consecutive zeros at the end of its binary string. Because the result of H(x) conforms to the uniform distribution as much as possible, each bit in the binary string of the hash result obeys the distribution of 0- 1 and is independent of each other.
If we regard the binary string as the result of a coin toss, where 0 stands for tails and 1 stands for heads, it is obvious that "scanning 1 from the end of the binary string" is equivalent to "constantly tossing coins until heads appear". And it is a Bernoulli experiment with a parameter p= 1/2. We can draw the following conclusions:
Remember that the cardinality |M| of a set is n, which is easy to obtain:
That is to say, if j>& gtLog 2 n, then the probability of bitmap[j]=0 is extremely high; On the other hand, if J.
Φ ? 0.77351is a correction coefficient obtained through complex calculation, so I won't mention it.
Under the condition of uniform distribution of H(x), we can draw the conclusion that:
Because the estimated value is an integer power of 2, it is obviously very inaccurate. In this paper, a solution called PCSA (Probability Counting with Statistical Average) is proposed, that is, "Probability Counting Based on Discrete Average". The idea is as follows:
In the original algorithm, the bitmap is extended to m groups, each hash is numbered by H(x) mod m, and the subscripts in each group are determined by the method of floor[H(x)/m]. In this way, each group will have n/m elements, and the calculated R value is no longer 1, but the estimator of m and n can be expressed as:
The detailed pseudo-code description is as follows.
The corresponding m value, deviation value and standard error table are also given.
Besides PCSA, there are other methods, such as using multiple hash functions or using median instead of mean. Of course, we can easily understand that the method of multiple hash functions is not realistic, because it is difficult to design multiple hash functions with uniform distribution and as few conflicts as possible, and the calculation of hash values requires CPU.
The calculation of deviation and standard deviation of PCSA method is extremely complicated. In this paper, the approximate value is obtained by computer:
It can be seen that it is only related to the choice of m. If m > 256, the standard error will be reduced to less than 5%.
The ideal range of bit array length l is:
When m=64, if L= 16, the estimated cardinality can reach100000 orders of magnitude; If L=24, the estimated cardinality can reach tens of millions of orders of magnitude.