However, we find that the two definitions have a rather close connection. Both definitions consider the support of an itemset as a random variable following the Poisson Binomial distribution [2], that is the expected support of an itemset equals to the expectation of the random variable. Conse- quently, computing the frequent probability of an itemset is equivalent to calculating the cumulative distribution func- tion of this random variable. In addition, the existing math- ematical theory shows that Poisson distribution and Normal distribution can approximate Poisson Binomial distribution under high confidence [31, 10]. Based on the Lyapunov Cen- tral Limit Theory [25], the Normal distribution converges to Poisson Binomial distribution with high probability. More- over, the Poisson Binomial distribution has a sound prop- erty: the computation of the expectation and variance are the same in terms of computational complexity. Therefore, the frequent probability of an itemset can be directly com- puted as long as we know the expected value and variance of the support of such itemset when the number of trans- actions in the uncertain database is large enough [10] (due to the requirement of the Lyapunov Central Limit Theo- ry). In other words, the second definition is identical to the first definition if the first definition also considers the vari- ance of the support at the same time. Moreover, another interesting result is that existing algorithms for mining ex- pected support-based frequent itemsets are applicable to the problem of mining probabilistic frequent itemsets as long as they also calculate the variance of the support of each item- set when they calculate each expected support. Thus, the efficiency of mining probabilistic frequent itemsets can be greatly improved due to the existence of many efficient ex- pected support-based frequent itemset mining algorithms. In this paper, we verify the conclusion through extensive experimental comparisons.
Besides the overlooking of the hidden relationship between the two above definitions, existing research on the same def- inition also shows contradictory conclusions. For example, in the research of mining expected support-based frequent itemsets, [22] shows that UFP-growth algorithm always out- performs UApriori algorithm with respect to the running time. However, [4] reports that UFP-growth algorithm is always slower than UApriori algorithm. These inconsisten- t conclusions make later researchers confused about which result is correct.
The lacking of uniform baseline implementations is one of the factors causing the inconsistent conclusions. There- fore, different experimental results originate from discrep- ancy among many implementation skills, blurring what are the contributions of the algorithms. For instance, the imple- mentation for UFP-growth algorithm uses the ”float type” to store each probability. While the implementation for UH- Mine algorithm adopts the ”double type”. The difference of their memory cost cannot reflect the effectiveness of the two algorithms objectively. Thus, uniform baseline imple- mentations can eliminate interferences from implementation details and report true contributions of each algorithm.
Except uniform baseline implementations, the selection of objective and scientific measures is also one of the most im- portant factors in the fair experimental comparison. Because uncertain data mining algorithms need to process a large amount of data, the running time, memory cost and scala- bility are basic measures when the correctness of algorithms is guaranteed. In addition, to trade off the accuracy for ef-
ficiency, approximate probabilistic frequent itemset mining algorithms are also proposed [10, 31]. For comparing the re- lationship between the two frequent itemset definitions, we use precision and recall as measures to evaluate the approxi- mation effectiveness. Moreover, since the above inconsistent conclusions may be caused by the dependence on datasets, in this work, we choose six different datasets, three dense ones and three sparse ones with different probability distri- butions (e.g. Normal distribution Vs. Zipf distribution or High probability Vs. Low probability).
To sum up, we try to achieve the following goals:
• Clarify the relationship of the existing two definitions
of frequent itemsets over uncertain databases. In fac- t, there is a mathematical correlation between them. Thus, the two definitions can be integrated together. Based on this relationship, instead of spending expen- sive computation cost to mine probabilistic frequent itemsets, we can directly use the solutions for mining expected support-based itemsets as long as the size of data is large enough.
• Verify the contradictory conclusions in the existing re- search and summarize a series of fair results.
• Provide uniform baseline implementations for all ex- isting representative algorithms under two definition- s. These implementations adopt common basic oper- ations and offer a base for comparing with the future work in this area. In addition, we also proposed a novel approximate probabilistic frequent itemset min- ing algorithm, NDUH-Mine which is combined with t- wo existing classic algorithm: UH-Mine algorithm and Normal distribution-based frequent itemset mining al- gorithm.
• Propose an objective and sufficient experimental eval- uation and test the performances of the existing rep- resentative algorithms over extensive benchmarks.
The rest of the paper is organized as follows. In Sec-
tion 2, we give some basic definitions about mining frequent itemset over uncertain databases. Eight representative algo- rithms are reviewed in Section 3. Section 4 presents all the experimental comparisons and the performance evaluations. We conclude in Section 5.
2. DEFINITIONS
In this section, we give several basic definitions about min- ing frequent itemsets over uncertain databases.
Let I = {i1 , i2 , . . . , in } be a set of distinct items. We name
a non-empty subset, X , of I as an itemset. For brevity, we use X = x1 x2 . . . xn to denote itemset X = {x1 , x2 , . . . , xn }. X is a l − itemset if it has l items. Given an uncertain
transaction database U DB, each transaction is denoted as a tuple < tid, Y > where tid is the transaction identifier, and Y = {y1 (p1 ), y2 (p2 ), . . . , ym (pm )}. Y contains m units. Each unit has an item yi and a probability, pi , denoting the
possibility of item yi appearing in the tid tuple. The number of transactions containing X in U DB is a random variable,
denoted as sup(X ). Given U DB, the expected support- based frequent itemset and probabilistic frequent itemsets are defined as follows.
Definition 1. (Expected Support) Given an uncertain tra-
nsaction database U DB which includes N transactions, and an itemset X , the expected support of X is:
đang được dịch, vui lòng đợi..
