Mining Frequent Itemsets over Uncer

Mining Frequent Itemsets over Uncertain Databases
Yongxin Tong † Lei Chen † Yurong Cheng ‡ Philip S. Yu §
†Hong Kong University of Science & Technology, Hong Kong, China
‡Northeastern University, China
§University of Illinois at Chicago, USA
†{yxtong, leichen}@cse.ust.hk, ‡cyrneu@gmail.com, §psyu@cs.uic.edu
ABSTRACT
In recent years, due to the wide applications of uncertain data,
mining frequent itemsets over uncertain databases has attracted
much attention. In uncertain databases, the support
of an itemset is a random variable instead of a fixed occurrence
counting of this itemset. Thus, unlike the corresponding
problem in deterministic databases where the frequent
itemset has a unique definition, the frequent itemset under
uncertain environments has two different definitions so far.
The first definition, referred as the expected support-based
frequent itemset, employs the expectation of the support
of an itemset to measure whether this itemset is frequent.
The second definition, referred as the probabilistic frequent
itemset, uses the probability of the support of an itemset
to measure its frequency. Thus, existing work on mining
frequent itemsets over uncertain databases is divided into
two different groups and no study is conducted to comprehensively
compare the two different definitions. In addition,
since no uniform experimental platform exists, current solutions
for the same definition even generate inconsistent
results. In this paper, we firstly aim to clarify the relationship
between the two different definitions. Through extensive
experiments, we verify that the two definitions have a
tight connection and can be unified together when the size of
data is large enough. Secondly, we provide baseline implementations
of eight existing representative algorithms and
test their performances with uniform measures fairly. Finally,
according to the fair tests over many different benchmark
data sets, we clarify several existing inconsistent conclusions
and discuss some new findings.
1. INTRODUCTION
Recently, with many new applications, such as sensor network
monitoring [23, 24, 26], moving object search [13, 14,
15] and protein-protein interaction (PPI) network analysis
[29], uncertain data mining has become a hot topic in data
mining communities [3, 4, 5, 6, 20, 21]. Since the problem of
frequent itemset mining is fundamental in data mining area,
mining frequent itemsets over uncertain databases has also
attracted much attention [4, 9, 10, 11, 17, 18, 22, 28, 30, 31,
33]. For example, with the popularization of wireless sensor
networks, wireless sensor network systems collect huge
amount of data. However, due to the inherent uncertainty
of sensors, the collected data are often inaccurate. For
the probability-included uncertain data, how can we discover
frequent patterns (itemsets) so that the users can understand
the hidden rules in data? The inherent probability
property of data is ignored if we simply apply the traditional
method of frequent itemset mining in deterministic data
to uncertain data. Thus, it is necessary to design specialized
algorithms for mining frequent itemsets over uncertain
databases.
Before finding frequent itemsets over uncertain databases,
the definition of the frequent itemset is the most essential
issue. In deterministic data, it is clear that an itemset is frequent
if and only if the support (frequency) of such itemset
is not smaller than a specified minimum support, min sup
[7, 8, 19, 32]. However, different from the deterministic case,
the definition of a frequent itemset over uncertain data has
two different semantic explanations: expected support-based
frequent itemset [4, 18] and probabilistic frequent itemset [9].
Both of which consider the support of an itemset as a discrete
random variable. However, the two definitions are
different on using the random variable to define frequent
itemsets. In the definition of the expected support-based
frequent itemset, the expectation of the support of an itemset
is defined as the measurement, called as the expected
support of this itemset. In this definition [4, 17, 18, 22], an
itemset is frequent if and only if the expected support of such
itemset is no less than a specified minimum expected sup-
port threshold, min esup. In the definition of probabilistic
frequent itemset [9, 28, 31], the probability that an itemset
appears at least the minimum support (min sup) times is
defined as the measurement, called as the frequent probability
of an itemset, and an itemset is frequent if and only if
the frequent probability of such itemset is larger than a given
probabilistic threshold.
The definition of expected support-based frequent itemset
uses the expectation to measure the uncertainty, which is a
simply extension of the definition of the frequent itemset in
deterministic data. The definition of probabilistic frequent
itemset includes the complete probability distribution of the
support of an itemset. Although the expectation is known
as an important statistic, it cannot show the complete probability
distribution. Most prior researches believe that the
two definitions should be studied respectively [9, 28, 31].
1650
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 38th International Conference on Very Large Data Bases,
August 27th 31st
2012, Istanbul, Turkey.
Proceedings of the VLDB Endowment, Vol. 5, No. 11
Copyright 2012 VLDB Endowment 21508097/
12/07... $ 10.00.
However, we find that the two definitions have a rather
close connection. Both definitions consider the support of an
itemset as a random variable following the Poisson Binomial
distribution [2], that is the expected support of an itemset
equals to the expectation of the random variable. Consequently,
computing the frequent probability of an itemset is
equivalent to calculating the cumulative distribution function
of this random variable. In addition, the existing mathematical
theory shows that Poisson distribution and Normal
distribution can approximate Poisson Binomial distribution
under high confidence [31, 10]. Based on the Lyapunov Central
Limit Theory [25], the Normal distribution converges to
Poisson Binomial distribution with high probability. Moreover,
the Poisson Binomial distribution has a sound property:
the computation of the expectation and variance are
the same in terms of computational complexity. Therefore,
the frequent probability of an itemset can be directly computed
as long as we know the expected value and variance
of the support of such itemset when the number of transactions
in the uncertain database is large enough [10] (due
to the requirement of the Lyapunov Central Limit Theory).
In other words, the second definition is identical to the
first definition if the first definition also considers the variance
of the support at the same time. Moreover, another
interesting result is that existing algorithms for mining expected
support-based frequent itemsets are applicable to the
problem of mining probabilistic frequent itemsets as long as
they also calculate the variance of the support of each itemset
when they calculate each expected support. Thus, the
efficiency of mining probabilistic frequent itemsets can be
greatly improved due to the existence of many efficient expected
support-based frequent itemset mining algorithms.
In this paper, we verify the conclusion through extensive
experimental comparisons.
Besides the overlooking of the hidden relationship between
the two above definitions, existing research on the same definition
also shows contradictory conclusions. For example,
in the research of mining expected support-based frequent
itemsets, [22] shows that UFP-growth algorithm always outperforms
UApriori algorithm with respect to the running
time. However, [4] reports that UFP-growth algorithm is
always slower than UApriori algorithm. These inconsistent
conclusions make later researchers confused about which
result is correct.
The lacking of uniform baseline implementations is one
of the factors causing the inconsistent conclusions. Therefore,
different experimental results originate from discrepancy
among many implementation skills, blurring what are
the contributions of the algorithms. For instance, the implementation
for UFP-growth algorithm uses the ”float type” to
store each probability. While the implementation for UHMine
algorithm adopts the ”double type”. The difference
of their memory cost cannot reflect the effectiveness of the
two algorithms objectively. Thus, uniform baseline implementations
can eliminate interferences from implementation
details and report true contributions of each algorithm.
Except uniform baseline implementations, the selection of
objective and scientific measures is also one of the most important
factors in the fair experimental comparison. Because
uncertain data mining algorithms need to process a large
amount of data, the running time, memory cost and scalability
are basic measures when the correctness of algorithms
is guaranteed. In addition, to trade off the accuracy for efficiency,
approximate probabilistic frequent itemset mining
algorithms are also proposed [10, 31]. For comparing the relationship
between the two frequent itemset definitions, we
use precision and recall as measures to evaluate the approximation
effectiveness. Moreover, since the above inconsistent
conclusions may be caused by the dependence on datasets,
in this work, we choose six different datasets, three dense
ones and three sparse ones with different probability distributions
(e.g. Normal distribution Vs. Z

0/5000

Từ: -

Sang: -

Kết quả (Việt) 1: [Sao chép]

Sao chép!

đang được dịch, vui lòng đợi..

Kết quả (Việt) 2:[Sao chép]

Sao chép!

Khai thác tập phổ biến hơn bất định Databases
Philip S. Yu § Yongxin Tong † Lei Chen Cheng Yurong † ‡
† Hồng Kông Đại học Khoa học & Công nghệ, Hồng Kông, Trung Quốc
Đại học Northeastern ‡, Trung Quốc
§University Illinois tại Chicago, USA
† {yxtong, leichen}@cse.ust.hk, ‡cyrneu@gmail.com, §psyu@cs.uic.edu
TÓM TẮT
Trong những năm gần đây, do sự ứng dụng rộng rãi của các dữ liệu không chắc chắn,
khai thác tập phổ biến trên cơ sở dữ liệu không chắc chắn đã thu hút
nhiều sự chú ý. Trong cơ sở dữ liệu không chắc chắn, sự hỗ trợ
của một tập phổ biến là một biến ngẫu nhiên thay vì một sự xuất hiện cố định
đếm của tập phổ biến này. Vì vậy, không giống như các tương ứng
trong cơ sở dữ liệu xác định vấn đề nơi thường xuyên
tập phổ biến có một định nghĩa duy nhất, các tập phổ biến dưới
các môi trường không chắc chắn có hai định nghĩa khác nhau cho đến nay.
Các định nghĩa đầu tiên, gọi là hỗ trợ dựa trên dự kiến
tập phổ biến, sử dụng kỳ vọng của hỗ trợ
của một tập phổ biến để đo lường xem liệu tập phổ biến này là thường xuyên.
Các định nghĩa thứ hai, gọi là xác suất thường xuyên
tập phổ biến, sử dụng xác suất của sự hỗ trợ của một tập phổ biến
để đo tần số của nó. Như vậy, công việc hiện tại về khai thác
tập phổ biến trên cơ sở dữ liệu không chắc chắn được chia thành
hai nhóm khác nhau và không có nghiên cứu được tiến hành một cách toàn diện để
so sánh hai định nghĩa khác nhau. Ngoài ra,
vì không có nền tảng thực nghiệm thống nhất tồn tại, giải pháp hiện nay
cho các định nghĩa tương tự thậm chí tạo ra không phù hợp
kết quả. Trong bài báo này, chúng ta trước hết nhằm mục đích để làm rõ mối quan hệ
giữa hai định nghĩa khác nhau. Thông qua mở rộng
thí nghiệm, chúng tôi xác minh rằng hai định nghĩa có một
kết nối chặt chẽ và thống nhất được với nhau khi kích thước của
dữ liệu là đủ lớn. Thứ hai, chúng tôi cung cấp cơ sở triển khai thực hiện
trong tám đại diện các thuật toán hiện hành và
kiểm tra màn trình diễn của họ với các biện pháp đồng bằng. Cuối cùng,
theo các thử nghiệm công bằng hơn nhiều chuẩn khác nhau
tập hợp dữ liệu, chúng tôi làm rõ một số kết luận không phù hợp hiện tại
và thảo luận về một số kết quả nghiên cứu mới.
1. GIỚI THIỆU
Gần đây, với nhiều ứng dụng mới, chẳng hạn như mạng cảm biến
giám sát [23, 24, 26], di chuyển đối tượng tìm kiếm [13, 14,
15] và protein-protein tương tác (PPI) phân tích mạng
[29], khai thác dữ liệu không chắc chắn đã trở thành một chủ đề nóng trong dữ liệu
cộng đồng khai thác mỏ [3, 4, 5, 6, 20, 21]. Kể từ khi vấn đề
khai thác tập phổ biến là cơ bản trong lĩnh vực khai thác dữ liệu,
khai thác tập phổ biến trên cơ sở dữ liệu không chắc chắn cũng đã
thu hút được nhiều sự chú ý [4, 9, 10, 11, 17, 18, 22, 28, 30, 31,
33]. Ví dụ, với sự phổ biến của cảm biến không dây
mạng, hệ thống mạng cảm biến không dây thu rất lớn
số lượng dữ liệu. Tuy nhiên, do sự không chắc chắn vốn có
của các cảm biến, các dữ liệu thu được thường không chính xác. Đối với
các dữ liệu không chắc chắn xác suất đã tính, làm thế nào chúng ta có thể khám phá
các mẫu thường xuyên (tập phổ biến) để người sử dụng có thể hiểu được
các quy tắc ẩn trong dữ liệu? Xác suất cố hữu
tài sản của dữ liệu được bỏ qua nếu chúng ta chỉ cần áp dụng truyền thống
phương pháp khai thác tập phổ biến trong dữ liệu xác định
dữ liệu không chắc chắn. Vì vậy, nó là cần thiết để thiết kế chuyên biệt
cho các thuật toán khai thác tập phổ biến hơn không chắc chắn
cơ sở dữ liệu.
Trước khi tìm tập phổ biến trên cơ sở dữ liệu không chắc chắn,
các định nghĩa của các tập phổ biến là thiết yếu nhất
vấn đề. Trong dữ liệu xác định, rõ ràng là một tập phổ biến là thường xuyên
nếu và chỉ nếu hỗ trợ (tần số) của tập phổ biến như vậy
không phải là nhỏ hơn so với một sự hỗ trợ tối thiểu quy định, min sup
[7, 8, 19, 32]. Tuy nhiên, khác với trường hợp xác định,
định nghĩa của một tập phổ biến trên các dữ liệu không chắc chắn có
hai cách giải thích ngữ nghĩa khác nhau: dự kiến sẽ hỗ trợ dựa trên
tập phổ biến [4, 18] và xác suất tập phổ biến [9].
Cả hai đều xem xét sự hỗ trợ của một itemset là rời rạc
biến ngẫu nhiên. Tuy nhiên, cả hai định nghĩa là
khác nhau về cách sử dụng các biến ngẫu nhiên để xác định thường xuyên
tập phổ biến. Trong định nghĩa của sự hỗ trợ dựa trên dự kiến
tập phổ biến, những kỳ vọng về sự hỗ trợ của một tập phổ biến
được định nghĩa là đo lường, được gọi như dự kiến
hỗ trợ của tập phổ biến này. Trong định nghĩa này [4, 17, 18, 22], một
itemset là thường xuyên nếu và chỉ nếu sự hỗ trợ dự kiến như
itemset là không ít hơn mức tối thiểu quy định dự kiến sẽ sup-
ngưỡng cổng, min esup. Trong định nghĩa của xác suất
tập phổ biến [9, 28, 31], xác suất một itemset
xuất hiện ít nhất là sự hỗ trợ tối thiểu (min sup) lần được
định nghĩa là đo lường, được gọi là xác suất thường xuyên
của một tập phổ biến, và một tập phổ biến là thường xuyên nếu và chỉ nếu
xác suất thường xuyên của tập phổ biến như vậy là lớn hơn so với một định
ngưỡng xác suất.
Các định nghĩa về dự kiến hỗ trợ dựa trên tập phổ biến
sử dụng kỳ vọng để đo lường sự không chắc chắn, đó là một
phần mở rộng đơn giản của định nghĩa của tập phổ biến trong
tất định dữ liệu. Các định nghĩa của xác suất thường xuyên
tập phổ biến bao gồm các phân bố xác suất đầy đủ của các
hỗ trợ của một tập phổ biến. Mặc dù kỳ vọng được biết đến
như là một số liệu thống kê quan trọng, nó không thể cho thấy xác suất hoàn thành
phân phối. Hầu hết các nghiên cứu trước cho rằng
hai định nghĩa này phải được nghiên cứu tương ứng [9, 28, 31].
1650
Giấy phép làm bản sao kỹ thuật số hoặc khó khăn của tất cả hoặc một phần của tác phẩm này với
mục đích cá nhân hoặc lớp học được cấp mà không cần lệ phí cung cấp bản sao được mà
không làm hoặc phân phối để thu lợi nhuận hoặc lợi thế thương mại và các bản sao
chịu thông báo này và trích dẫn đầy đủ trên trang đầu tiên. Để sao chép khác, để
tái xuất, đăng bài trên các máy chủ hoặc để phân phối lại các danh sách, yêu cầu cụ thể trước khi
cho phép và / hoặc lệ phí. Bài viết từ khối lượng này đã được mời để trình bày
kết quả của mình tại Hội nghị quốc tế lần thứ 38 về căn cứ dữ liệu rất lớn,
ngày 27 tháng 8 ngày 31
năm 2012, Istanbul, Thổ Nhĩ Kỳ.
Proceedings của VLDB Endowment, Vol. 5, số 11
Copyright 2012 VLDB Endowment 21.508.097 /
12/07 ... $ 10,00.
Tuy nhiên, chúng ta thấy rằng cả hai định nghĩa có một thay
kết nối chặt chẽ. Cả hai định nghĩa xem xét hỗ trợ của một
tập phổ biến như là một biến ngẫu nhiên sau Poisson nhị thức
phân phối [2], đó là sự hỗ trợ dự kiến của một tập phổ biến
bằng với kỳ vọng của biến ngẫu nhiên. Do đó,
tính toán xác suất thường xuyên của một tập phổ biến là
tương đương để tính toán các hàm phân phối tích lũy
của biến ngẫu nhiên này. Ngoài ra, toán học hiện
lý thuyết cho thấy rằng phân phối Poisson và Bình thường
phân phối có thể xấp xỉ Poisson phân phối nhị thức
dưới sự tự tin cao [31, 10]. Dựa trên Lyapunov Trung
Limit Theory [25], sự phân bố bình thường hội tụ để
phân phối Poisson nhị thức với xác suất cao. Hơn nữa,
sự phân bố nhị thức Poisson có một tài sản âm thanh:
các tính toán của các kỳ vọng và phương sai là
như nhau về độ phức tạp tính toán. Do đó,
xác suất thường xuyên của một tập phổ biến có thể được tính trực tiếp
miễn là chúng ta biết giá trị kỳ vọng và phương sai
của sự hỗ trợ của tập phổ biến như vậy khi số lượng giao dịch
trong cơ sở dữ liệu không chắc chắn là đủ lớn [10] (do
yêu cầu của các Lyapunov Trung Limit Thuyết).
Nói cách khác, định nghĩa thứ hai là giống với
định nghĩa đầu tiên nếu định nghĩa đầu tiên cũng xem xét phương sai
của sự hỗ trợ cùng một lúc. Hơn nữa, một
kết quả thú vị là các thuật toán hiện có để khai thác dự kiến sẽ
hỗ trợ dựa trên tập phổ biến được áp dụng cho các
vấn đề khai thác xác suất tập phổ biến miễn là
họ cũng tính phương sai của sự hỗ trợ của mỗi tập phổ biến
khi họ tính toán từng hỗ trợ dự kiến. Như vậy,
hiệu quả khai thác tập phổ biến xác suất có thể được
cải thiện rất nhiều do sự tồn tại của nhiều hiệu quả dự kiến
hỗ trợ các thuật toán khai thác dựa trên tập phổ biến.
Trong bài báo này, chúng tôi xác minh kết luận thông qua mở rộng
so sánh thực nghiệm.
Bên cạnh nhìn về mối quan hệ giữa ẩn
hai định nghĩa trên, nghiên cứu hiện có về cùng một định nghĩa
cũng cho thấy kết luận trái ngược nhau. Ví dụ,
trong nghiên cứu khai thác dự kiến sẽ hỗ trợ dựa trên thường xuyên
tập phổ biến, [22] cho thấy thuật toán UFP-tăng trưởng nhanh hơn so với lúc nào
thuật toán UApriori đối với các hoạt động với
thời gian. Tuy nhiên, [4] báo cáo rằng thuật toán UFP-tăng trưởng là
luôn luôn chậm hơn so với thuật toán UApriori. Những trái
kết luận làm cho các nhà nghiên cứu sau đó nhầm lẫn về
kết quả là chính xác.
Việc thiếu cơ sở triển khai thực hiện thống nhất là một
trong những yếu tố gây ra các kết luận không phù hợp. Do đó,
kết quả thử nghiệm khác nhau bắt nguồn từ sự khác biệt
trong số rất nhiều các kỹ năng thực hiện, làm mờ những gì đang có
sự đóng góp của các thuật toán. Ví dụ, việc thực hiện
cho các thuật toán UFP tăng trưởng sử dụng các loại "phao" để
lưu trữ từng xác suất. Trong khi thực hiện cho UHMine
thuật toán thông qua các "loại kép". Sự khác biệt
về chi phí bộ nhớ của họ không thể phản ánh hiệu quả của
hai thuật toán một cách khách quan. Vì vậy, việc triển khai cơ sở thống nhất
có thể loại bỏ nhiễu từ việc thực hiện
chi tiết và báo cáo đóng góp thực sự của mỗi thuật toán.
Ngoại trừ việc triển khai cơ sở thống nhất, việc lựa chọn các
biện pháp khách quan và khoa học cũng là một trong những quan trọng nhất
trong các yếu tố so sánh công bằng thực nghiệm. Bởi vì
các thuật toán khai thác dữ liệu không chắc chắn cần phải xử lý một lượng lớn
số lượng dữ liệu, thời gian chạy, chi phí bộ nhớ và khả năng mở rộng
là biện pháp cơ bản khi tính đúng đắn của thuật toán
được đảm bảo. Ngoài ra, để đánh đổi tính chính xác cho hiệu quả,
khai thác tập phổ biến xấp xỉ xác suất
các thuật toán cũng được đề xuất [10, 31]. Để so sánh mối quan hệ
giữa hai định nghĩa tập phổ biến, chúng tôi
sử dụng chính xác và thu hồi các biện pháp để đánh giá xấp xỉ
hiệu quả. Hơn nữa, do không phù hợp trên
các kết luận có thể được gây ra bởi sự phụ thuộc vào dữ liệu này,
trong tác phẩm này, chúng tôi chọn sáu bộ dữ liệu khác nhau, ba dày đặc
những người thân và ba người thưa thớt với phân bố xác suất khác nhau
(ví dụ như phân phối bình thường Vs. Z

đang được dịch, vui lòng đợi..

Kết quả (Việt) 3:[Sao chép]

Sao chép!

đang được dịch, vui lòng đợi..

Các ngôn ngữ khác

Hỗ trợ công cụ dịch thuật: Albania, Amharic, Anh, Armenia, Azerbaijan, Ba Lan, Ba Tư, Bantu, Basque, Belarus, Bengal, Bosnia, Bulgaria, Bồ Đào Nha, Catalan, Cebuano, Chichewa, Corsi, Creole (Haiti), Croatia, Do Thái, Estonia, Filipino, Frisia, Gael Scotland, Galicia, George, Gujarat, Hausa, Hawaii, Hindi, Hmong, Hungary, Hy Lạp, Hà Lan, Hà Lan (Nam Phi), Hàn, Iceland, Igbo, Ireland, Java, Kannada, Kazakh, Khmer, Kinyarwanda, Klingon, Kurd, Kyrgyz, Latinh, Latvia, Litva, Luxembourg, Lào, Macedonia, Malagasy, Malayalam, Malta, Maori, Marathi, Myanmar, Mã Lai, Mông Cổ, Na Uy, Nepal, Nga, Nhật, Odia (Oriya), Pashto, Pháp, Phát hiện ngôn ngữ, Phần Lan, Punjab, Quốc tế ngữ, Rumani, Samoa, Serbia, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenia, Somali, Sunda, Swahili, Séc, Tajik, Tamil, Tatar, Telugu, Thái, Thổ Nhĩ Kỳ, Thụy Điển, Tiếng Indonesia, Tiếng Ý, Trung, Trung (Phồn thể), Turkmen, Tây Ban Nha, Ukraina, Urdu, Uyghur, Uzbek, Việt, Xứ Wales, Yiddish, Yoruba, Zulu, Đan Mạch, Đức, Ả Rập, dịch ngôn ngữ.