However, we find that the two defin

However, we find that the two definitions have a rather close connection. Both definitions consider the support of an itemset as a random variable following the Poisson Binomial distribution [2], that is the expected support of an itemset equals to the expectation of the random variable. Conse- quently, computing the frequent probability of an itemset is equivalent to calculating the cumulative distribution func- tion of this random variable. In addition, the existing math- ematical theory shows that Poisson distribution and Normal distribution can approximate Poisson Binomial distribution under high confidence [31, 10]. Based on the Lyapunov Cen- tral Limit Theory [25], the Normal distribution converges to Poisson Binomial distribution with high probability. More- over, the Poisson Binomial distribution has a sound prop- erty: the computation of the expectation and variance are the same in terms of computational complexity. Therefore, the frequent probability of an itemset can be directly com- puted as long as we know the expected value and variance of the support of such itemset when the number of trans- actions in the uncertain database is large enough [10] (due to the requirement of the Lyapunov Central Limit Theo- ry). In other words, the second definition is identical to the first definition if the first definition also considers the vari- ance of the support at the same time. Moreover, another interesting result is that existing algorithms for mining ex- pected support-based frequent itemsets are applicable to the problem of mining probabilistic frequent itemsets as long as they also calculate the variance of the support of each item- set when they calculate each expected support. Thus, the efficiency of mining probabilistic frequent itemsets can be greatly improved due to the existence of many efficient ex- pected support-based frequent itemset mining algorithms. In this paper, we verify the conclusion through extensive experimental comparisons.
Besides the overlooking of the hidden relationship between the two above definitions, existing research on the same def- inition also shows contradictory conclusions. For example, in the research of mining expected support-based frequent itemsets, [22] shows that UFP-growth algorithm always out- performs UApriori algorithm with respect to the running time. However, [4] reports that UFP-growth algorithm is always slower than UApriori algorithm. These inconsisten- t conclusions make later researchers confused about which result is correct.
The lacking of uniform baseline implementations is one of the factors causing the inconsistent conclusions. There- fore, different experimental results originate from discrep- ancy among many implementation skills, blurring what are the contributions of the algorithms. For instance, the imple- mentation for UFP-growth algorithm uses the ”float type” to store each probability. While the implementation for UH- Mine algorithm adopts the ”double type”. The difference of their memory cost cannot reflect the effectiveness of the two algorithms objectively. Thus, uniform baseline imple- mentations can eliminate interferences from implementation details and report true contributions of each algorithm.
Except uniform baseline implementations, the selection of objective and scientific measures is also one of the most im- portant factors in the fair experimental comparison. Because uncertain data mining algorithms need to process a large amount of data, the running time, memory cost and scala- bility are basic measures when the correctness of algorithms is guaranteed. In addition, to trade off the accuracy for ef-

ficiency, approximate probabilistic frequent itemset mining algorithms are also proposed [10, 31]. For comparing the re- lationship between the two frequent itemset definitions, we use precision and recall as measures to evaluate the approxi- mation effectiveness. Moreover, since the above inconsistent conclusions may be caused by the dependence on datasets, in this work, we choose six different datasets, three dense ones and three sparse ones with different probability distri- butions (e.g. Normal distribution Vs. Zipf distribution or High probability Vs. Low probability).
To sum up, we try to achieve the following goals:
• Clarify the relationship of the existing two definitions
of frequent itemsets over uncertain databases. In fac- t, there is a mathematical correlation between them. Thus, the two definitions can be integrated together. Based on this relationship, instead of spending expen- sive computation cost to mine probabilistic frequent itemsets, we can directly use the solutions for mining expected support-based itemsets as long as the size of data is large enough.
• Verify the contradictory conclusions in the existing re- search and summarize a series of fair results.
• Provide uniform baseline implementations for all ex- isting representative algorithms under two definition- s. These implementations adopt common basic oper- ations and offer a base for comparing with the future work in this area. In addition, we also proposed a novel approximate probabilistic frequent itemset min- ing algorithm, NDUH-Mine which is combined with t- wo existing classic algorithm: UH-Mine algorithm and Normal distribution-based frequent itemset mining al- gorithm.
• Propose an objective and sufficient experimental eval- uation and test the performances of the existing rep- resentative algorithms over extensive benchmarks.
The rest of the paper is organized as follows. In Sec-
tion 2, we give some basic definitions about mining frequent itemset over uncertain databases. Eight representative algo- rithms are reviewed in Section 3. Section 4 presents all the experimental comparisons and the performance evaluations. We conclude in Section 5.
2. DEFINITIONS
In this section, we give several basic definitions about min- ing frequent itemsets over uncertain databases.
Let I = {i1 , i2 , . . . , in } be a set of distinct items. We name
a non-empty subset, X , of I as an itemset. For brevity, we use X = x1 x2 . . . xn to denote itemset X = {x1 , x2 , . . . , xn }. X is a l − itemset if it has l items. Given an uncertain
transaction database U DB, each transaction is denoted as a tuple < tid, Y > where tid is the transaction identifier, and Y = {y1 (p1 ), y2 (p2 ), . . . , ym (pm )}. Y contains m units. Each unit has an item yi and a probability, pi , denoting the
possibility of item yi appearing in the tid tuple. The number of transactions containing X in U DB is a random variable,
denoted as sup(X ). Given U DB, the expected support- based frequent itemset and probabilistic frequent itemsets are defined as follows.
Definition 1. (Expected Support) Given an uncertain tra-
nsaction database U DB which includes N transactions, and an itemset X , the expected support of X is:

However, we find that the two definitions have a rather close connection. Both definitions consider the support of an itemset as a random variable following the Poisson Binomial distribution [2], that is the expected support of an itemset equals to the expectation of the random variable. Conse- quently, computing the frequent probability of an itemset is equivalent to calculating the cumulative distribution func- tion of this random variable. In addition, the existing math- ematical theory shows that Poisson distribution and Normal distribution can approximate Poisson Binomial distribution under high confidence [31, 10]. Based on the Lyapunov Cen- tral Limit Theory [25], the Normal distribution converges to Poisson Binomial distribution with high probability. More- over, the Poisson Binomial distribution has a sound prop- erty: the computation of the expectation and variance are the same in terms of computational complexity. Therefore, the frequent probability of an itemset can be directly com- puted as long as we know the expected value and variance of the support of such itemset when the number of trans- actions in the uncertain database is large enough [10] (due to the requirement of the Lyapunov Central Limit Theo- ry). In other words, the second definition is identical to the first definition if the first definition also considers the vari- ance of the support at the same time. Moreover, another interesting result is that existing algorithms for mining ex- pected support-based frequent itemsets are applicable to the problem of mining probabilistic frequent itemsets as long as they also calculate the variance of the support of each item- set when they calculate each expected support. Thus, the efficiency of mining probabilistic frequent itemsets can be greatly improved due to the existence of many efficient ex- pected support-based frequent itemset mining algorithms. In this paper, we verify the conclusion through extensive experimental comparisons.
Besides the overlooking of the hidden relationship between the two above definitions, existing research on the same def- inition also shows contradictory conclusions. For example, in the research of mining expected support-based frequent itemsets, [22] shows that UFP-growth algorithm always out- performs UApriori algorithm with respect to the running time. However, [4] reports that UFP-growth algorithm is always slower than UApriori algorithm. These inconsisten- t conclusions make later researchers confused about which result is correct.
The lacking of uniform baseline implementations is one of the factors causing the inconsistent conclusions. There- fore, different experimental results originate from discrep- ancy among many implementation skills, blurring what are the contributions of the algorithms. For instance, the imple- mentation for UFP-growth algorithm uses the ”float type” to store each probability. While the implementation for UH- Mine algorithm adopts the ”double type”. The difference of their memory cost cannot reflect the effectiveness of the two algorithms objectively. Thus, uniform baseline imple- mentations can eliminate interferences from implementation details and report true contributions of each algorithm.
Except uniform baseline implementations, the selection of objective and scientific measures is also one of the most im- portant factors in the fair experimental comparison. Because uncertain data mining algorithms need to process a large amount of data, the running time, memory cost and scala- bility are basic measures when the correctness of algorithms is guaranteed. In addition, to trade off the accuracy for ef-
 
ficiency, approximate probabilistic frequent itemset mining algorithms are also proposed [10, 31]. For comparing the re- lationship between the two frequent itemset definitions, we use precision and recall as measures to evaluate the approxi- mation effectiveness. Moreover, since the above inconsistent conclusions may be caused by the dependence on datasets, in this work, we choose six different datasets, three dense ones and three sparse ones with different probability distri- butions (e.g. Normal distribution Vs. Zipf distribution or High probability Vs. Low probability).
To sum up, we try to achieve the following goals:
• Clarify the relationship of the existing two definitions
of frequent itemsets over uncertain databases. In fac- t, there is a mathematical correlation between them. Thus, the two definitions can be integrated together. Based on this relationship, instead of spending expen- sive computation cost to mine probabilistic frequent itemsets, we can directly use the solutions for mining expected support-based itemsets as long as the size of data is large enough.
• Verify the contradictory conclusions in the existing re- search and summarize a series of fair results.
• Provide uniform baseline implementations for all ex- isting representative algorithms under two definition- s. These implementations adopt common basic oper- ations and offer a base for comparing with the future work in this area. In addition, we also proposed a novel approximate probabilistic frequent itemset min- ing algorithm, NDUH-Mine which is combined with t- wo existing classic algorithm: UH-Mine algorithm and Normal distribution-based frequent itemset mining al- gorithm.
• Propose an objective and sufficient experimental eval- uation and test the performances of the existing rep- resentative algorithms over extensive benchmarks.
The rest of the paper is organized as follows. In Sec-
tion 2, we give some basic definitions about mining frequent itemset over uncertain databases. Eight representative algo- rithms are reviewed in Section 3. Section 4 presents all the experimental comparisons and the performance evaluations. We conclude in Section 5.
2. DEFINITIONS
In this section, we give several basic definitions about min- ing frequent itemsets over uncertain databases.
Let I = {i1 , i2 , . . . , in } be a set of distinct items. We name
a non-empty subset, X , of I as an itemset. For brevity, we use X = x1 x2 . . . xn to denote itemset X = {x1 , x2 , . . . , xn }. X is a l − itemset if it has l items. Given an uncertain
transaction database U DB, each transaction is denoted as a tuple < tid, Y > where tid is the transaction identifier, and Y = {y1 (p1 ), y2 (p2 ), . . . , ym (pm )}. Y contains m units. Each unit has an item yi and a probability, pi , denoting the
possibility of item yi appearing in the tid tuple. The number of transactions containing X in U DB is a random variable,
denoted as sup(X ). Given U DB, the expected support- based frequent itemset and probabilistic frequent itemsets are defined as follows.
Definition 1. (Expected Support) Given an uncertain tra-
nsaction database U DB which includes N transactions, and an itemset X , the expected support of X is:

0/5000

Từ: -

Sang: -

Kết quả (Việt) 1: [Sao chép]

Sao chép!

đang được dịch, vui lòng đợi..

Kết quả (Việt) 2:[Sao chép]

Sao chép!

Tuy nhiên, chúng ta thấy rằng cả hai định nghĩa có một kết nối khá gần. Cả hai định nghĩa xem xét hỗ trợ của một tập phổ biến như là một biến ngẫu nhiên theo phân phối Poisson nhị thức [2], đó là sự hỗ trợ dự kiến của một tập phổ biến bằng với kỳ vọng của biến ngẫu nhiên. Conse- xuyên, tính toán xác suất thường xuyên của một tập phổ biến là tương đương với việc tính toán phân phối tích lũy hàm sự của biến ngẫu nhiên này. Ngoài ra, các lý thuyết ematical math- hiện cho thấy rằng phân phối Poisson và phân phối bình thường có thể xấp xỉ Poisson phân phối nhị thức dưới sự tự tin cao [31, 10]. Dựa trên Lyapunov ương tral Limit Theory [25], sự phân bố bình thường hội tụ để phân phối Poisson nhị thức với xác suất cao. Hơn nữa, sự phân bố nhị thức Poisson có đói prop- âm thanh: các tính toán của các kỳ vọng và phương sai là như nhau về độ phức tạp tính toán. Do đó, xác suất thường xuyên của một tập phổ biến có thể được trực tiếp com- puted miễn là chúng ta biết giá trị kỳ vọng và phương sai của sự hỗ trợ của tập phổ biến như vậy khi số lượng bạch các hoạt động trên cơ sở dữ liệu không chắc chắn là đủ lớn [10] (do các yêu cầu của ry Lyapunov Trung Limit Theo-). Nói cách khác, định nghĩa thứ hai là giống hệt nhau để định nghĩa đầu tiên nếu định nghĩa đầu tiên cũng coi việc bảo vari của sự hỗ trợ cùng một lúc. Hơn nữa, một kết quả thú vị là các thuật toán hiện có để khai thác Thí kiến sẽ hỗ trợ dựa trên tập phổ biến được áp dụng cho các vấn đề khai thác xác suất tập phổ biến miễn là họ cũng tính phương sai của sự hỗ trợ của mỗi item- thiết lập khi họ tính toán từng dự kiến hỗ trợ. Như vậy, hiệu quả của khai thác tập phổ biến xác suất có thể được cải thiện rất nhiều do sự tồn tại của nhiều thuật toán khai thác hiệu quả Thí ngờ hỗ trợ dựa trên tập phổ biến. Trong bài báo này, chúng tôi xác minh kết luận thông qua so sánh thử nghiệm rộng rãi.
Bên cạnh nhìn của các mối quan hệ ẩn giữa hai định nghĩa trên, nghiên cứu hiện có trên inition def- cùng cũng cho thấy kết luận trái ngược nhau. Ví dụ, trong nghiên cứu khai thác dự kiến sẽ hỗ trợ dựa trên tập phổ biến, [22] cho thấy thuật toán UFP-tăng trưởng luôn ngoài thực hiện thuật toán UApriori đối với thời gian chạy với. Tuy nhiên, [4] báo cáo rằng thuật toán UFP-tăng trưởng luôn chậm hơn so với thuật toán UApriori. Những kết luận t inconsisten- làm cho các nhà nghiên cứu sau này nhầm lẫn về mà kết quả là chính xác.
Việc thiếu cơ sở triển khai thực hiện thống nhất là một trong những yếu tố gây ra các kết luận không phù hợp. Cho nên, kết quả thí nghiệm khác nhau có nguồn gốc từ Ancy discrep- trong số rất nhiều các kỹ năng thực hiện, làm mờ những đóng góp của các thuật toán là gì. Ví dụ, các thuật toán thực hiện để UFP tăng trưởng sử dụng các loại "phao" để lưu trữ từng xác suất. Trong khi thực hiện thuật toán cho UH- Mine thông qua các "loại kép". Sự khác biệt về chi phí bộ nhớ của họ không thể phản ánh hiệu quả của hai thuật toán một cách khách quan. Như vậy, cơ bản thống nhất thöïc hieän mentations có thể loại bỏ nhiễu từ các chi tiết thực hiện và báo cáo đóng góp thực sự của mỗi thuật toán.
Ngoại trừ việc triển khai cơ sở thống nhất, việc lựa chọn các biện pháp khách quan và khoa học cũng là một trong những yếu tố quan trọng nhất của trong việc so sánh công bằng thực nghiệm. Bởi vì các thuật toán khai thác dữ liệu không chắc chắn cần phải xử lý một lượng lớn dữ liệu, thời gian chạy, chi phí bộ nhớ và tính scala- là biện pháp cơ bản khi tính đúng đắn của thuật toán được đảm bảo. Ngoài ra, để đánh đổi tính chính xác cho cách hiệu ficiency, các thuật toán khai thác tập phổ biến xấp xỉ xác suất cũng được đề xuất [10, 31]. Để so sánh các lationship lại giữa hai định nghĩa tập phổ biến, chúng tôi sử dụng chính xác và thu hồi các biện pháp để đánh giá hiệu quả sự tương đối. Hơn nữa, kể từ khi kết luận không phù hợp trên có thể được gây ra bởi sự phụ thuộc vào dữ liệu này, trong tác phẩm này, chúng tôi chọn sáu bộ dữ liệu khác nhau, ba người dày đặc và ba người thưa thớt với xác suất khác nhau những đóng phối (ví dụ như phân phối bình thường Vs. Zipf phân phối hoặc xác suất cao . Vs. xác suất thấp) Tóm lại, chúng tôi cố gắng để đạt được những mục tiêu sau: • Xác định rõ mối quan hệ của hai định nghĩa hiện tại của tập phổ biến trên cơ sở dữ liệu không chắc chắn. Trong t tố, có một mối tương quan toán học giữa chúng. Như vậy, cả hai định nghĩa có thể được tích hợp với nhau. Dựa vào mối quan hệ này, thay vì chi tiêu chi phí tính toán sive expen- để tập phổ biến xác suất của tôi, chúng tôi có thể trực tiếp sử dụng các giải pháp khai thác tập phổ biến hỗ trợ dựa trên dự kiến miễn là kích thước của dữ liệu là đủ lớn. • Kiểm tra lại kết luận mâu thuẫn trong hiện tìm kiếm lại và tóm tắt một loạt các kết quả công bằng. • Cung cấp cơ sở triển khai thực hiện thống nhất cho tất cả các thuật toán EX isting đại diện dưới hai s definition-. Những việc triển khai áp dụng ations oper- cơ bản phổ biến và cung cấp một cơ sở để so sánh với các công việc tương lai trong lĩnh vực này. Ngoài ra, chúng tôi cũng đề xuất một xấp xỉ xác suất tập phổ biến cuốn tiểu thuyết min- thuật toán ing, NDUH-Mine mà là kết hợp với t- wo thuật toán cổ điển hiện có:. Thuật toán UH-Mine và thường xuyên khai thác tập phổ biến phân phối dựa trên bình thường al- gorithm • Đề xuất một Mục tiêu và đủ uation eval- thử nghiệm và kiểm tra các màn trình diễn của các thuật toán của đại diện resentative hiện trên các tiêu chuẩn mở rộng. Phần còn lại của bài báo được tổ chức như sau. Trong giây- tion 2, chúng tôi đưa ra một số định nghĩa cơ bản về khai thác tập phổ biến trên cơ sở dữ liệu không chắc chắn. Tám rithms algo- đại diện được xem xét trong Phần 3. Phần 4 trình bày tất cả những sự so sánh thử nghiệm và đánh giá kết quả. Chúng tôi kết luận trong phần 5. 2. ĐỊNH NGHĨA Trong phần này, chúng tôi đưa ra một số định nghĩa cơ bản về min- ing tập phổ biến trên cơ sở dữ liệu không chắc chắn. Cho I = {i1, i2,. . . , In} là một tập hợp các mục riêng biệt. Chúng tôi đặt tên cho một tập con không rỗng, X, của tôi như là một tập phổ biến. Cho ngắn gọn, chúng tôi sử dụng X = x1 x2. . . xn để biểu thị itemset X = {x1, x2,. . . , Xn}. X là al - itemset nếu nó có l mục. Cho một không chắc chắn cơ sở dữ liệu giao dịch U DB, mỗi giao dịch được ký hiệu là một tuple <tid, Y> nơi tid là định danh giao dịch, và Y = {y1 (p1), y2 (p2),. . . , Ym (pm)}. Y chứa các đơn vị m. Mỗi đơn vị có một yi mục và một xác suất, pi, thể hiện khả năng của item yi xuất hiện trong các tuple tid. Số lượng giao dịch có chứa X ở U DB là một biến ngẫu nhiên, ký hiệu là sup (X). Với U DB, dự kiến Support dựa tập phổ biến và tập phổ biến xác suất được quy định như sau. 1. Định nghĩa (Hỗ trợ dự kiến) Cho một không chắc chắn tra- cơ sở dữ liệu nsaction U DB gồm N giao dịch, và một itemset X, hỗ trợ dự kiến X là:

đang được dịch, vui lòng đợi..

Kết quả (Việt) 3:[Sao chép]

Sao chép!

đang được dịch, vui lòng đợi..

Các ngôn ngữ khác

Hỗ trợ công cụ dịch thuật: Albania, Amharic, Anh, Armenia, Azerbaijan, Ba Lan, Ba Tư, Bantu, Basque, Belarus, Bengal, Bosnia, Bulgaria, Bồ Đào Nha, Catalan, Cebuano, Chichewa, Corsi, Creole (Haiti), Croatia, Do Thái, Estonia, Filipino, Frisia, Gael Scotland, Galicia, George, Gujarat, Hausa, Hawaii, Hindi, Hmong, Hungary, Hy Lạp, Hà Lan, Hà Lan (Nam Phi), Hàn, Iceland, Igbo, Ireland, Java, Kannada, Kazakh, Khmer, Kinyarwanda, Klingon, Kurd, Kyrgyz, Latinh, Latvia, Litva, Luxembourg, Lào, Macedonia, Malagasy, Malayalam, Malta, Maori, Marathi, Myanmar, Mã Lai, Mông Cổ, Na Uy, Nepal, Nga, Nhật, Odia (Oriya), Pashto, Pháp, Phát hiện ngôn ngữ, Phần Lan, Punjab, Quốc tế ngữ, Rumani, Samoa, Serbia, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenia, Somali, Sunda, Swahili, Séc, Tajik, Tamil, Tatar, Telugu, Thái, Thổ Nhĩ Kỳ, Thụy Điển, Tiếng Indonesia, Tiếng Ý, Trung, Trung (Phồn thể), Turkmen, Tây Ban Nha, Ukraina, Urdu, Uyghur, Uzbek, Việt, Xứ Wales, Yiddish, Yoruba, Zulu, Đan Mạch, Đức, Ả Rập, dịch ngôn ngữ.