FD_Mine:DiscoveringFunctional Depen

FD_Mine:Discovering
Functional Dependencies in a Database
Using Equivalences
Hong Yao, Howard J.Hamilton and Cory J. Butz
Technical Report TR 2002-04
August, 2002
Copyright Ó 2002 Hong Yao, Howard J.Hamilton and Cory J. Butz
Department of Computer Science
University of Regina
Regina, Saskatchewan
Canada S4S 0A2
ISBN 0-7731-0441-0
2
FD_Mine: Discovering Functional Dependencies
in a Database Using Equivalences
Hong Yao, Howard J.Hamilton, and Cory Butz
Department of Computer Science, University of Regina
Regina, SK, Canada, S4S 0A2
{yao2hong, hamilton, butz}@cs.uregina.ca
Abstract
Functional dependency (FD) traditionally plays an important role in the design of
relational databases, and the study of FDs has produced a rich and elegant theory. The
discovery of FDs from databases has recently become a significant research problem. In
this paper, we propose a new algorithm, called FD_Mine, for the discovery of all minimal
FDs from a database. FD_Mine takes advantage of the rich theory of FDs to guide the
search for FDs. More specifically, the use of FD theory can reduce both the size of the
dataset and the number of FDs to be checked by pruning redundant data and skipping the
search for FDs that follow logically from the FDs already discovered. We show that our
method is sound, that is, the pruning does not lead to loss of information. Experiments on
15 UCI datasets show that FD_Mine can prune more candidates than previous methods.
Keywords
Data mining, functional dependency, relational database theory
1. Introduction
This paper proposes a new method for finding functional dependencies in data. A
functional dependency (FD) expresses a value constraint between attributes in a relation
[7]. Formally, for a relational schema R with X ÍR and AÎR, a functional dependency
X

A can be defined if for all pairs of tuples t1 and t2 over R, for all BÎX if t1[B] = t2[B]
then t1[A] = t2[A]. If A is not functionally dependent on any proper subset of X, then
X
A is minimal [5]. Henceforth, all FDs mentioned in this paper can be assumed to be
minimal.
FDs play an important role in relational theory and relational database design [7]. In
early work, the study of FDs focused on the fact that data consistency could be
guaranteed by using FDs to reduce the amount of redundant data. FDs were usually
obtained from semantic models rather than from data. Current research is based on the
fact that FDs may exist in a dataset that are independent of the relational model of the
dataset. It is useful to discover these FDs. For example, from a database of chemical
compounds, it is valuable to discover compounds that are functionally dependent on a
certain structure attribute [5]. In addition, as a kind of data dependency [3, 12], a large
dataset can be losslessly decomposed into a set of smaller datasets using the discovered
FDs. As a result, the discovery of FDs from database has recently become a popular
research problem [2, 4, 5, 6, 8, 11, 12, 13].
3
Research on FDs was an important aspect of relational database design [7] in the
1980s, and led to achievements such as the Armstrong Axioms, minimal cover, closure
and an algorithm for decomposing a schema into third normal form while preserving
dependency. The Armstrong axioms can be stated as three rules [7]: Reflexivity rule: if
YÍX, then X

Y; Augmentation rule: if X

Y, then XZ

YZ; Transitivity rule: if
X

Y and Y
Z, then X

Y. These rules can be applied repeatedly to infer all FDs
implied by a set of FDs.
Relational database theory forms the underlying basis for the FD_Mine algorithm.
Combining domain knowledge and data mining algorithm is better than using either
alone. The Apriori algorithm [1] for discovering frequent itemsets can be used to find
FDs from a dataset by setting the minimum support to 2/n, where n is the number of
instances in the dataset. In this case, the discovered FDs are the association rules with
100% confidence that can be formed from the itemsets. Typically, only a small fraction
of the rules have 100% confidence. This constraint can be embedded into algorithm to
increase efficiency, as we do in FD_Mine.
The FD_Mine algorithm improves efficiency by pruning redundant candidates. A
candidate is a combinations of attributes over a dataset. To delete redundant candidates
from the database, we use four pruning rules. For example, if the FDs A

B and B

A
are discovered, then no further candidates containing B need be considered, since
attributes A and B are equivalent. This pruning is valuable because the number of
candidates increases exponentially with the number of attributes.
To prune redundant candidates, relationships among the FDs are analyzed. The set of
FDs can be divided into two parts: FDs that can be inferred from the discovered FDs
using the theory of relational databases, and those that cannot. The FD_Mine algorithm
only examines the database to find the second type of FDs. Relevant aspect of relational
database theory are collected here as lemmas, theorems, properties, and pruning rules.
For example, if A

B and C

D are discovered to hold in a database, then AC

BD
must also hold, so it does not need be checked in the database. By eliminating redundant
data and pruning candidates, the FD_Mine algorithm improves mining performance.
The remainder of this paper organized as follows. A statement of the problem and an
example of it are given in section 2. In section 3, the relationship among FDs is analyzed,
and the formal definitions, lemmas, theorems, and properties are given. Pruning rules and
FD_Mine algorithm are presented in section 4. A detailed example is also discussed in
this section. Next, the experimental results are shown in section 5. Finally, conclusions
are drawn in section 6.
2. Problem Statement
The problem addressed in this paper is to find all functional dependencies among
attributes in a database relation. Specifically, we want to improve on previous proposed
methods for this problem.
Early methods for discovering of FDs were based on repeatedly sorting and
comparing tuples to determine whether or not these tuples meet the FD definition. For
example, in Table 2.1, the tuples are first sorted on attribute A, then each pair of tuples
that have the same value on attribute A is compared on attribute B, C, D, and E, in turn,
to decide whether or not A

B, A

C, A

D, or A

E holds. Then the tuples are sorted
on attribute B and examined to decide whether or not B
A, B

C, B

D or B

E holds.
4
This process is repeated for C, D, E, AB, AC, AD, and so on. After the last candidate
BCDE has been checked, all FDs will have been discovered. All candidates of five
attributes are represented in Figure 2.1.
The disadvantage of this approach is that it does not utilize the discovered FDs as
knowledge to obtain new knowledge. If A

B has been discovered, a check is still made
to determine whether or not AC

B holds, by sorting on attributes AC and comparing on
attribute B. Instead, AC

B can be directly inferred from the previously obtained A

B
without sorting and comparing tuples again. This approach is inefficient because of this
extra sorting and because it needs to examine every value of the candidate attributes to
decide whether or not a FD holds. As a result, this approach is highly sensitive to the
number of tuples and attributes. It is impracticable for a large dataset.
A B C D E
t1 0 0 0 1 0
t2 0 1 0 1 0
t3 0 2 0 1 2
t4 0 3 1 1 0
t5 4 1 1 2 4
t6 4 3 1 2 2
t7 0 0 1 0 0
Table 2.1 An example dataset.
Figure 2.1 Lattice for 5 attributes.
Recent papers have proposed algorithms that do not sort on any attribute or compare
any values. Mannila et al. [8, 9, 10] introduced the concept of a partition, which places
tuples that have the same values for an attribute into the same group. The problem of
determining whether or not a FD holds on a given dataset can be addressed by comparing
the number of the groups among partitions for various attributes. For the dataset r, shown
in Table 2.1, the partition for attribute A can be denoted as
П
A(r) ={{t1, t2, t3, t4, t7}, {t5,
5
t6}}. Because the values of tuples t1, t2, t3, t4, and t7 on attribute A are all the same, they
are assigned to the same group. Similarly, because the values of t5 and t6 are the same,
they are placed in another group. The partition for the attribute combination AD for Table
2.1 is
П
AD(r) ={{t1, t2, t3, t4, t7}, {t5, t6}}. The cardinality of the partition |
П
A(r)|, which is
the number of groups in partition
П
A, is 2, and |
П
AD(r)| is 2 too. Because |
A(r)| is equal
to |
AD(r)|, A

D can be obtained [5].
Algorithm TANE [5] uses the partition concept to discover FDs. In addition, the set of
the candidates are pruned based on the discovered FDs. For instance, if AC

B has been
discovered, then ACD

B and ACDE

B can be inferred from AC

B without checking
the data, so candidates ACD and ACDE are redundant. According to the dependencies in
a dataset, only a portion of the lattice shown in Figure 2.1 may need to be traversed to
find all FDs in the relation.
Algorithm FUN [11, 12] uses a procedure that checks for embedded FDs, which are
FDs that hold on the projection of the dataset. By using embedded FDs, other candidates
can be pruned. For example, suppose that A

B holds over ABC. If |
П
AB(r)| > |
П
BC(r)|,
then BC

A does not hold over ABC. On synthetic datasets with correlation rates of 30%
to 70%, FUN is faster than TANE. Apparently, FUN was not tested on any UCI datasets.
Our research addresses two related questions. First, can other information from
discovered FDs be used to prune more candidates than previous approaches? Secondly,
can this pruning be done so that the overall efficiency of the algorithm is improved? We
address both these problems by further considering the theoretical properties of FDs,
formulating the FD_Mine algorithm to take advantage of these properties, and testing the
algorithm on a variety of datasets.
3. Theoretical A

A can be defined if for all pairs of tuples t1 and t2 over R, for all BÎX if t1[B] = t2[B]
then t1[A] = t2[A]. If A is not functionally dependent on any proper subset of X, then
X
A is minimal [5]. Henceforth, all FDs mentioned in this paper can be assumed to be
minimal.
FDs play an important role in relational theory and relational database design [7]. In
early work, the study of FDs focused on the fact that data consistency could be
guaranteed by using FDs to reduce the amount of redundant data. FDs were usually
obtained from semantic models rather than from data. Current research is based on the
fact that FDs may exist in a dataset that are independent of the relational model of the
dataset. It is useful to discover these FDs. For example, from a database of chemical
compounds, it is valuable to discover compounds that are functionally dependent on a
certain structure attribute [5]. In addition, as a kind of data dependency [3, 12], a large
dataset can be losslessly decomposed into a set of smaller datasets using the discovered
FDs. As a result, the discovery of FDs from database has recently become a popular
research problem [2, 4, 5, 6, 8, 11, 12, 13].
3
Research on FDs was an important aspect of relational database design [7] in the
1980s, and led to achievements such as the Armstrong Axioms, minimal cover, closure
and an algorithm for decomposing a schema into third normal form while preserving
dependency. The Armstrong axioms can be stated as three rules [7]: Reflexivity rule: if
YÍX, then X

Y; Augmentation rule: if X

Y, then XZ

YZ; Transitivity rule: if
X

Y and Y
Z, then X

Y. These rules can be applied repeatedly to infer all FDs
implied by a set of FDs.
Relational database theory forms the underlying basis for the FD_Mine algorithm.
Combining domain knowledge and data mining algorithm is better than using either
alone. The Apriori algorithm [1] for discovering frequent itemsets can be used to find
FDs from a dataset by setting the minimum support to 2/n, where n is the number of
instances in the dataset. In this case, the discovered FDs are the association rules with
100% confidence that can be formed from the itemsets. Typically, only a small fraction
of the rules have 100% confidence. This constraint can be embedded into algorithm to
increase efficiency, as we do in FD_Mine.
The FD_Mine algorithm improves efficiency by pruning redundant candidates. A
candidate is a combinations of attributes over a dataset. To delete redundant candidates
from the database, we use four pruning rules. For example, if the FDs A

B and B

A
are discovered, then no further candidates containing B need be considered, since
attributes A and B are equivalent. This pruning is valuable because the number of
candidates increases exponentially with the number of attributes.
To prune redundant candidates, relationships among the FDs are analyzed. The set of
FDs can be divided into two parts: FDs that can be inferred from the discovered FDs
using the theory of relational databases, and those that cannot. The FD_Mine algorithm
only examines the database to find the second type of FDs. Relevant aspect of relational
database theory are collected here as lemmas, theorems, properties, and pruning rules.
For example, if A

B and C

D are discovered to hold in a database, then AC

BD
must also hold, so it does not need be checked in the database. By eliminating redundant
data and pruning candidates, the FD_Mine algorithm improves mining performance.
The remainder of this paper organized as follows. A statement of the problem and an
example of it are given in section 2. In section 3, the relationship among FDs is analyzed,
and the formal definitions, lemmas, theorems, and properties are given. Pruning rules and
FD_Mine algorithm are presented in section 4. A detailed example is also discussed in
this section. Next, the experimental results are shown in section 5. Finally, conclusions
are drawn in section 6.
2. Problem Statement
The problem addressed in this paper is to find all functional dependencies among
attributes in a database relation. Specifically, we want to improve on previous proposed
methods for this problem.
Early methods for discovering of FDs were based on repeatedly sorting and
comparing tuples to determine whether or not these tuples meet the FD definition. For
example, in Table 2.1, the tuples are first sorted on attribute A, then each pair of tuples
that have the same value on attribute A is compared on attribute B, C, D, and E, in turn,
to decide whether or not A

B, A

C, A

D, or A

E holds. Then the tuples are sorted
on attribute B and examined to decide whether or not B
A, B

C, B

D or B

E holds.
4
This process is repeated for C, D, E, AB, AC, AD, and so on. After the last candidate
BCDE has been checked, all FDs will have been discovered. All candidates of five
attributes are represented in Figure 2.1.
The disadvantage of this approach is that it does not utilize the discovered FDs as
knowledge to obtain new knowledge. If A

B has been discovered, a check is still made
to determine whether or not AC

B holds, by sorting on attributes AC and comparing on
attribute B. Instead, AC

B can be directly inferred from the previously obtained A

B
without sorting and comparing tuples again. This approach is inefficient because of this
extra sorting and because it needs to examine every value of the candidate attributes to
decide whether or not a FD holds. As a result, this approach is highly sensitive to the
number of tuples and attributes. It is impracticable for a large dataset.
A B C D E
t1 0 0 0 1 0
t2 0 1 0 1 0
t3 0 2 0 1 2
t4 0 3 1 1 0
t5 4 1 1 2 4
t6 4 3 1 2 2
t7 0 0 1 0 0
Table 2.1 An example dataset.
Figure 2.1 Lattice for 5 attributes.
Recent papers have proposed algorithms that do not sort on any attribute or compare
any values. Mannila et al. [8, 9, 10] introduced the concept of a partition, which places
tuples that have the same values for an attribute into the same group. The problem of
determining whether or not a FD holds on a given dataset can be addressed by comparing
the number of the groups among partitions for various attributes. For the dataset r, shown
in Table 2.1, the partition for attribute A can be denoted as
П
A(r) ={{t1, t2, t3, t4, t7}, {t5,
5
t6}}. Because the values of tuples t1, t2, t3, t4, and t7 on attribute A are all the same, they
are assigned to the same group. Similarly, because the values of t5 and t6 are the same,
they are placed in another group. The partition for the attribute combination AD for Table
2.1 is
П
AD(r) ={{t1, t2, t3, t4, t7}, {t5, t6}}. The cardinality of the partition |
П
A(r)|, which is
the number of groups in partition
П
A, is 2, and |
П
AD(r)| is 2 too. Because |
A(r)| is equal
to |
AD(r)|, A

D can be obtained [5].
Algorithm TANE [5] uses the partition concept to discover FDs. In addition, the set of
the candidates are pruned based on the discovered FDs. For instance, if AC

B has been
discovered, then ACD

B and ACDE

B can be inferred from AC

B without checking
the data, so candidates ACD and ACDE are redundant. According to the dependencies in
a dataset, only a portion of the lattice shown in Figure 2.1 may need to be traversed to
find all FDs in the relation.
Algorithm FUN [11, 12] uses a procedure that checks for embedded FDs, which are
FDs that hold on the projection of the dataset. By using embedded FDs, other candidates
can be pruned. For example, suppose that A

B holds over ABC. If |
П
AB(r)| > |
П
BC(r)|,
then BC

A does not hold over ABC. On synthetic datasets with correlation rates of 30%
to 70%, FUN is faster than TANE. Apparently, FUN was not tested on any UCI datasets.
Our research addresses two related questions. First, can other information from
discovered FDs be used to prune more candidates than previous approaches? Secondly,
can this pruning be done so that the overall efficiency of the algorithm is improved? We
address both these problems by further considering the theoretical properties of FDs,
formulating the FD_Mine algorithm to take advantage of these properties, and testing the
algorithm on a variety of datasets.
3. Theoretical A

0/5000

Từ: -

Sang: -

Kết quả (Việt) 1: [Sao chép]

Sao chép!

FD_Mine: khám pháChức năng phụ thuộc trong cơ sở dữ liệuBằng cách sử dụng EquivalencesHong Yao, Howard J.Hamilton và Cory J. ButzBáo cáo kỹ thuật TR 2002-04Tháng 8 năm 2002Bản quyền Ó 2002 Hong Yao, Howard J.Hamilton và Cory J. ButzVùng máy tính khoa họcTrường đại học của ReginaRegina, SaskatchewanCanada S4S 0A2ISBN 0-7731-0441-02FD_Mine: Khám phá chức năng phụ thuộctrong cơ sở dữ liệu bằng cách sử dụng EquivalencesHong Yao, Howard J.Hamilton và Cory ButzVùng máy tính khoa học, trường đại học của ReginaRegina, SK, Canada, S4S 0A2{yao2hong, hamilton, butz}@cs.uregina.caTóm tắtChức năng phụ thuộc (FD) truyền thống đóng một vai trò quan trọng trong việc thiết kếcơ sở dữ liệu quan hệ, và nghiên cứu của FDs đã sản xuất một lý thuyết giàu và thanh lịch. Cáckhám phá FDs từ cơ sở dữ liệu mới đã trở thành một vấn đề nghiên cứu quan trọng. Ởbài báo này, chúng tôi đề xuất một thuật toán mới, được gọi là FD_Mine, cho việc phát hiện tất cả tối thiểuFDs từ cơ sở dữ liệu. FD_Mine lợi dụng lý thuyết phong phú của FDs để hướng dẫn cácTìm kiếm các FDs. Cụ thể hơn, việc sử dụng các FD lý thuyết có thể giảm cả hai kích thước của cácsố liệu và số lượng các FDs để được kiểm tra bởi cắt tỉa dữ liệu dự phòng và bỏ qua cácTìm kiếm các FDs mà theo một cách hợp lý từ các FDs đã phát hiện ra. Chúng tôi thấy rằng chúng tôiphương pháp là âm thanh, có nghĩa là, cắt tỉa không dẫn đến mất thông tin. Các thí nghiệm trên15 UCI datasets Hiển thị FD_Mine có thể prune ứng cử viên thêm hơn phương pháp trước đó.Từ khóaKhai thác dữ liệu, chức năng phụ thuộc, lý thuyết cơ sở dữ liệu quan hệ1. giới thiệuBài báo này đề xuất một phương pháp mới cho việc tìm kiếm chức năng phụ thuộc vào dữ liệu. Achức năng phụ thuộc (FD) thể hiện một hạn chế giá trị giữa các thuộc tính trong một mối quan hệ[7]. chính thức, đối với một đồ điểm quan hệ R với X ÍR và AÎR, một phụ thuộc chức năngXA có thể được xác định nếu cho tất cả các cặp tuples t1 và t2 trên R, cho tất cả BÎX nếu t1 [B] = t2 [B]sau đó t1 [A] = t2 [A]. Nếu A không phải là chức năng phụ thuộc vào bất kỳ tập hợp con thích hợp của X, sau đóXA là tối thiểu [5]. Từ đó, tất cả FDs được đề cập trong bài báo này có thể được giả định làtối thiểu.FDs đóng một vai trò quan trọng trong quan hệ lý thuyết và thiết kế cơ sở dữ liệu quan hệ [7]. Ởcông việc đầu, nghiên cứu của FDs tập trung vào thực tế là sự đồng bộ dữ liệu có thểbảo đảm bằng cách sử dụng FDs để giảm số lượng dữ liệu dự phòng. FDs thườngthu được từ ngữ nghĩa mô hình chứ không phải là từ dữ liệu. Nghiên cứu hiện nay dựa trên cácthực tế rằng FDs có thể tồn tại trong một bộ dữ liệu độc lập của các mô hình quan hệ của cácbộ dữ liệu. Nó là hữu ích để khám phá các FDs. Ví dụ, từ một cơ sở dữ liệu hóa chấthợp chất, nó có giá trị để khám phá các hợp chất có chức năng phụ thuộc vào mộtmột số thuộc tính cấu trúc [5]. Ngoài ra, như là một loại dữ liệu phụ thuộc [3, 12], một lớnbộ dữ liệu có thể được losslessly phân hủy thành một tập hợp các datasets nhỏ hơn bằng cách sử dụng các phát hiệnFDs. Do đó, việc phát hiện ra FDs từ cơ sở dữ liệu mới trở thành một phổ biếnresearch problem [2, 4, 5, 6, 8, 11, 12, 13].3Research on FDs was an important aspect of relational database design [7] in the1980s, and led to achievements such as the Armstrong Axioms, minimal cover, closureand an algorithm for decomposing a schema into third normal form while preservingdependency. The Armstrong axioms can be stated as three rules [7]: Reflexivity rule: ifYÍX, then XY; Augmentation rule: if XY, then XZYZ; Transitivity rule: ifXY and YZ, then XY. These rules can be applied repeatedly to infer all FDsimplied by a set of FDs.Relational database theory forms the underlying basis for the FD_Mine algorithm.Combining domain knowledge and data mining algorithm is better than using eitheralone. The Apriori algorithm [1] for discovering frequent itemsets can be used to findFDs from a dataset by setting the minimum support to 2/n, where n is the number ofinstances in the dataset. In this case, the discovered FDs are the association rules with100% confidence that can be formed from the itemsets. Typically, only a small fractionof the rules have 100% confidence. This constraint can be embedded into algorithm toincrease efficiency, as we do in FD_Mine.The FD_Mine algorithm improves efficiency by pruning redundant candidates. Acandidate is a combinations of attributes over a dataset. To delete redundant candidatesfrom the database, we use four pruning rules. For example, if the FDs A
B and B

A
are discovered, then no further candidates containing B need be considered, since
attributes A and B are equivalent. This pruning is valuable because the number of
candidates increases exponentially with the number of attributes.
To prune redundant candidates, relationships among the FDs are analyzed. The set of
FDs can be divided into two parts: FDs that can be inferred from the discovered FDs
using the theory of relational databases, and those that cannot. The FD_Mine algorithm
only examines the database to find the second type of FDs. Relevant aspect of relational
database theory are collected here as lemmas, theorems, properties, and pruning rules.
For example, if A

B and C

D are discovered to hold in a database, then AC

BD
must also hold, so it does not need be checked in the database. By eliminating redundant
data and pruning candidates, the FD_Mine algorithm improves mining performance.
The remainder of this paper organized as follows. A statement of the problem and an
example of it are given in section 2. In section 3, the relationship among FDs is analyzed,
and the formal definitions, lemmas, theorems, and properties are given. Pruning rules and
FD_Mine algorithm are presented in section 4. A detailed example is also discussed in
this section. Next, the experimental results are shown in section 5. Finally, conclusions
are drawn in section 6.
2. Problem Statement
The problem addressed in this paper is to find all functional dependencies among
attributes in a database relation. Specifically, we want to improve on previous proposed
methods for this problem.
Early methods for discovering of FDs were based on repeatedly sorting and
comparing tuples to determine whether or not these tuples meet the FD definition. For
example, in Table 2.1, the tuples are first sorted on attribute A, then each pair of tuples
that have the same value on attribute A is compared on attribute B, C, D, and E, in turn,
to decide whether or not A

B, A

C, A

D, or A

E holds. Then the tuples are sorted
on attribute B and examined to decide whether or not B
A, B

C, B

D or B

E holds.
4
This process is repeated for C, D, E, AB, AC, AD, and so on. After the last candidate
BCDE has been checked, all FDs will have been discovered. All candidates of five
attributes are represented in Figure 2.1.
The disadvantage of this approach is that it does not utilize the discovered FDs as
knowledge to obtain new knowledge. If A

B has been discovered, a check is still made
to determine whether or not AC

B holds, by sorting on attributes AC and comparing on
attribute B. Instead, AC

B can be directly inferred from the previously obtained A

B
without sorting and comparing tuples again. This approach is inefficient because of this
extra sorting and because it needs to examine every value of the candidate attributes to
decide whether or not a FD holds. As a result, this approach is highly sensitive to the
number of tuples and attributes. It is impracticable for a large dataset.
A B C D E
t1 0 0 0 1 0
t2 0 1 0 1 0
t3 0 2 0 1 2
t4 0 3 1 1 0
t5 4 1 1 2 4
t6 4 3 1 2 2
t7 0 0 1 0 0
Table 2.1 An example dataset.
Figure 2.1 Lattice for 5 attributes.
Recent papers have proposed algorithms that do not sort on any attribute or compare
any values. Mannila et al. [8, 9, 10] introduced the concept of a partition, which places
tuples that have the same values for an attribute into the same group. The problem of
determining whether or not a FD holds on a given dataset can be addressed by comparing
the number of the groups among partitions for various attributes. For the dataset r, shown
in Table 2.1, the partition for attribute A can be denoted as
П
A(r) ={{t1, t2, t3, t4, t7}, {t5,
5
t6}}. Because the values of tuples t1, t2, t3, t4, and t7 on attribute A are all the same, they
are assigned to the same group. Similarly, because the values of t5 and t6 are the same,
they are placed in another group. The partition for the attribute combination AD for Table
2.1 is
П
AD(r) ={{t1, t2, t3, t4, t7}, {t5, t6}}. The cardinality of the partition |
П
A(r)|, which is
the number of groups in partition
П
A, is 2, and |
П
AD(r)| is 2 too. Because |
A(r)| is equal
to |
AD(r)|, A

D can be obtained [5].
Algorithm TANE [5] uses the partition concept to discover FDs. In addition, the set of
the candidates are pruned based on the discovered FDs. For instance, if AC

B has been
discovered, then ACD

B and ACDE

B can be inferred from AC

B without checking
the data, so candidates ACD and ACDE are redundant. According to the dependencies in
a dataset, only a portion of the lattice shown in Figure 2.1 may need to be traversed to
find all FDs in the relation.
Algorithm FUN [11, 12] uses a procedure that checks for embedded FDs, which are
FDs that hold on the projection of the dataset. By using embedded FDs, other candidates
can be pruned. For example, suppose that A

B holds over ABC. If |
П
AB(r)| > |
П
BC(r)|,
then BC

A does not hold over ABC. On synthetic datasets with correlation rates of 30%
to 70%, FUN is faster than TANE. Apparently, FUN was not tested on any UCI datasets.
Our research addresses two related questions. First, can other information from
discovered FDs be used to prune more candidates than previous approaches? Secondly,
can this pruning be done so that the overall efficiency of the algorithm is improved? We
address both these problems by further considering the theoretical properties of FDs,
formulating the FD_Mine algorithm to take advantage of these properties, and testing the
algorithm on a variety of datasets.
3. Theoretical A

đang được dịch, vui lòng đợi..

Kết quả (Việt) 2:[Sao chép]

Sao chép!

FD_Mine: Khám phá
chức năng phụ thuộc vào một cơ sở dữ liệu
Sử dụng tương đồng
Hồng Yao, Howard J.Hamilton và Cory J. Butz
Báo cáo kỹ thuật TR 2002-04
tháng tám năm 2002
Copyright 2002 Ó Hồng Yao, Howard J.Hamilton và Cory J. Butz
Sở Khoa học Máy tính
Đại học Regina
Regina, Saskatchewan
Canada S4S 0A2
ISBN 0-7731-0441-0
2
FD_Mine: Khám phá chức năng phụ thuộc
vào một cơ sở dữ liệu Sử dụng tương đồng
Hồng Yao, Howard J.Hamilton, và Cory Butz
Sở Khoa học máy tính, Đại học Regina
Regina, SK , Canada, S4S 0A2
{yao2hong, hamilton, butz}@cs.uregina.ca
Tóm tắt
chức năng phụ thuộc (FD) theo truyền thống đóng một vai trò quan trọng trong việc thiết kế
cơ sở dữ liệu quan hệ, và các nghiên cứu của FD đã sản xuất một lý thuyết phong phú và thanh lịch. Các
phát hiện của FD từ cơ sở dữ liệu gần đây đã trở thành một vấn đề nghiên cứu quan trọng. Trong
bài báo này, chúng tôi đề xuất một thuật toán mới, được gọi là FD_Mine, cho việc phát hiện ra tất cả các tối thiểu
FD từ một cơ sở dữ liệu. FD_Mine lợi dụng các lý thuyết phong phú của FD để hướng dẫn
tìm kiếm cho FD. Cụ thể hơn, việc sử dụng các lý thuyết FD có thể giảm kích thước của các
tập dữ liệu và số lượng FD để được kiểm tra bằng cách cắt tỉa dữ liệu dư thừa và bỏ qua
tìm kiếm cho FD mà theo logic từ FD đã phát hiện ra. Chúng tôi thấy rằng chúng tôi
là phương pháp âm thanh, đó là, cắt tỉa không dẫn đến mất thông tin. Các thí nghiệm trên
15 bộ dữ liệu UCI cho thấy FD_Mine có thể tỉa các ứng cử viên nhiều hơn các phương pháp trước đó.
Keywords
khai thác dữ liệu, phụ thuộc chức năng, lý thuyết cơ sở dữ liệu quan hệ
1. Giới thiệu
bài báo này đề xuất một phương pháp mới cho việc tìm kiếm phụ thuộc chức năng trong dữ liệu. Một
phụ thuộc hàm (FD) thể hiện một giá trị ràng buộc giữa các thuộc tính trong một mối quan hệ
[7]. Chính thức, cho một giản đồ quan hệ R với X IR và không khí, một phụ thuộc hàm
X
?
A có thể được xác định nếu cho tất cả các cặp tuples t1 và t2 trên R, cho tất cả Bix nếu t1 [B] = t2 [B]
sau đó t1 [ A] = t2 [A]. Nếu A không phải là chức năng phụ thuộc vào bất kỳ tập hợp của X, sau đó
X
A là tối thiểu [5]. Từ nay trở đi, tất cả FD được đề cập trong báo cáo này có thể được giả định là
tối thiểu.
FD đóng một vai trò quan trọng trong lý thuyết quan hệ và thiết kế cơ sở dữ liệu quan hệ [7]. Trong
công việc sớm, các nghiên cứu của FD tập trung vào thực tế là tính nhất quán dữ liệu có thể được
đảm bảo bằng cách sử dụng FD để giảm số lượng dữ liệu dư thừa. FD được thường
thu được từ các mô hình ngữ nghĩa hơn là từ dữ liệu. Nghiên cứu hiện nay là dựa trên
thực tế rằng FD có thể tồn tại trong một bộ dữ liệu độc lập của mô hình quan hệ của các
bộ dữ liệu. Nó rất hữu ích để khám phá những FD. Ví dụ, từ một cơ sở dữ liệu của hóa học
các hợp chất, nó là giá trị để phát hiện ra những hợp chất có chức năng phụ thuộc vào một
thuộc tính cấu trúc nào đó [5]. Ngoài ra, như một loại dữ liệu phụ thuộc [3, 12], một lượng lớn
dữ liệu có thể được chia ra thành một losslessly bộ tập dữ liệu nhỏ hơn sử dụng các phát hiện
FD. Kết quả là, sự phát hiện của FD từ cơ sở dữ liệu gần đây đã trở nên phổ biến một
vấn đề nghiên cứu [2, 4, 5, 6, 8, 11, 12, 13].
3
Nghiên cứu FD là một khía cạnh quan trọng của thiết kế cơ sở dữ liệu quan hệ [7] trong
năm 1980, và đã dẫn đến thành tựu như các tiên đề Armstrong, bìa tối thiểu, đóng cửa
và một thuật toán để phân hủy một lược đồ thành hình thức bình thường thứ ba trong khi bảo quản
phụ thuộc. Các tiên đề Armstrong có thể được nêu ra như là ba nguyên tắc [7]: quy tắc phản chiếu: nếu
YÍX, sau đó X
?
Y; Quy tắc Augmentation: nếu X
?
Y, sau đó XZ
?
YZ; Transitivity quy tắc: nếu
X
?
Y và Y
Z, sau đó X
?
Y. Những quy định này có thể được áp dụng nhiều lần để suy ra tất cả FD
ngụ ý bởi một tập hợp các FD.
lý thuyết cơ sở dữ liệu quan hệ là cơ sở cơ bản cho các thuật toán FD_Mine.
Kết hợp kiến thức miền và thuật toán khai thác dữ liệu là tốt hơn so với sử dụng hoặc là
một mình. Các thuật toán Apriori [1] cho việc khám phá tập phổ biến có thể được sử dụng để tìm
FD từ một tập dữ liệu bằng cách thiết lập để hỗ trợ tối thiểu 2 / n, trong đó n là số
trường hợp trong các bộ dữ liệu. Trong trường hợp này, các FD phát hiện ra là các luật kết hợp với
100% sự tự tin có thể được hình thành từ các tập phổ biến. Thông thường, chỉ một phần nhỏ
của các quy tắc có 100% sự tự tin. Hạn chế này có thể được nhúng vào thuật toán để
tăng hiệu quả, như chúng ta làm trong FD_Mine.
Các thuật toán FD_Mine cải thiện hiệu quả bằng cách cắt tỉa các ứng cử viên không cần thiết. Một
ứng cử viên là một sự kết hợp của các thuộc tính trên một tập dữ liệu. Để xóa các ứng cử viên dư thừa
từ các cơ sở dữ liệu, chúng tôi sử dụng bốn quy tắc cắt tỉa. Ví dụ, nếu FD A
?
B và B
?
A
được phát hiện, sau đó có ứng cử viên tiếp tục chứa B cần được xem xét, vì
các thuộc tính A và B là tương đương. Cắt tỉa này là có giá trị vì số lượng
thí sinh tăng theo cấp số nhân với số lượng các thuộc tính.
Để tỉa thí sinh dự phòng, các mối quan hệ giữa các FD được phân tích. Các thiết lập của
FD có thể được chia thành hai phần: FD có thể được suy ra từ FD phát hiện
sử dụng các lý thuyết cơ sở dữ liệu quan hệ, và những người không có thể. Các thuật toán FD_Mine
chỉ kiểm tra các cơ sở dữ liệu để tìm ra loại thứ hai của FD. Khía cạnh liên quan của quan hệ
lý thuyết cơ sở dữ liệu được thu thập ở đây là bổ đề, định lý, tính chất, quy tắc và cắt tỉa.
Ví dụ, nếu A
?
B và C
?
D được phát hiện để giữ trong cơ sở dữ liệu, sau đó AC
?
BD
cũng phải giữ, vì vậy nó không cần phải được kiểm tra trong cơ sở dữ liệu. Bằng cách loại trừ dư thừa
dữ liệu và cắt tỉa các ứng cử viên, các thuật toán FD_Mine cải thiện hiệu suất khai thác mỏ.
Phần còn lại của bài viết này được tổ chức như sau. Một tuyên bố của vấn đề và một
ví dụ của nó được đưa ra trong phần 2. Trong phần 3, các mối quan hệ giữa FD được phân tích,
và các định nghĩa chính thức, bổ đề, định lý, và các tài sản được đưa ra. Quy tắc cắt tỉa và
thuật toán FD_Mine được trình bày trong phần 4. Một ví dụ cụ thể cũng được thảo luận trong
phần này. Tiếp theo, các kết quả thí nghiệm được trình bày ở phần 5. Cuối cùng, kết luận
được rút ra trong phần 6.
2. Tuyên bố vấn đề
Vấn đề được đề cập trong bài viết này là để tìm tất cả các phụ thuộc hàm giữa các
thuộc tính trong một mối quan hệ cơ sở dữ liệu. Cụ thể, chúng tôi muốn cải thiện về đề xuất trước đó
phương pháp cho vấn đề này.
phương pháp phát hiện sớm cho các FD được dựa trên nhiều lần phân loại và
so sánh các tuple để xác định có hay không những tuples đáp ứng định nghĩa FD. Ví
dụ, trong bảng 2.1, các bộ dữ liệu đầu tiên được sắp xếp trên thuộc tính A, sau đó mỗi cặp tuples
mà có cùng giá trị trên thuộc tính A được so sánh trên thuộc tính B, C, D, E, và đến lượt nó,
để quyết định có hay không A
?
B, A
?
C, A
?
D, hay A
?
E giữ. Sau đó, các bộ dữ liệu được sắp xếp
trên thuộc tính B và kiểm tra để quyết định có hay không B
A, B
?
C, B
?
D hoặc B
?
E giữ.
4
Quá trình này được lặp đi lặp lại cho C, D, E, AB, AC, AD, và vv. Sau khi các ứng cử viên cuối cùng
BCDE đã được kiểm tra, tất cả các FD sẽ được phát hiện. Tất cả các ứng cử viên của năm
thuộc tính được biểu diễn trong hình 2.1.
Nhược điểm của phương pháp này là nó không sử dụng các FD phát hiện như là
kiến thức để có được kiến thức mới. Nếu A
?
B đã được phát hiện, kiểm tra vẫn đang được thực hiện
để xác định có hay không AC
?
B giữ, bằng cách phân loại trên thuộc tính AC và so sánh trên
thuộc tính B. Thay vào đó, AC
?
B có thể được suy ra trực tiếp từ thu được trước đây A
?
B
mà không cần phân loại và so sánh các tuple nữa. Cách tiếp cận này là không hiệu quả vì điều này
phân loại phụ và vì nó cần phải kiểm tra tất cả các giá trị của các ứng cử viên thuộc tính để
quyết định có hay không một FD giữ. Kết quả là, phương pháp này là rất nhạy cảm với các
số tuples và các thuộc tính. Nó là không thể thực hiện cho một tập dữ liệu lớn.
ABCDE
t1 0 0 0 1 0
0 1 0 t2 1 0
t3 0 2 0 1 2
t4 0 3 1 1 0
t5 4 1 1 2 4
t6 4 3 2 1 2
t7 0 0 1 0 0
Bảng 2.1 Một bộ dữ liệu ví dụ.
Hình 2.1 Lattice cho 5 thuộc tính.
giấy tờ gần đây đã đề xuất các thuật toán mà không sắp xếp trên bất kỳ thuộc tính hay so sánh
bất kỳ giá trị. Mannila et al. [8, 9, 10] giới thiệu các khái niệm về một phân vùng, và được xếp
tuples mà có cùng giá trị cho một thuộc tính vào cùng một nhóm. Vấn đề
xác định có hay không một FD giữ trên một tập dữ liệu cho trước có thể được giải quyết bằng cách so sánh
số lượng các nhóm trong phân vùng cho các thuộc tính khác nhau. Đối với các bộ dữ liệu r, thể hiện
trong Bảng 2.1, các phân vùng cho thuộc tính A có thể được ký hiệu là
П
A (r) = {{t1, t2, t3, t4, t7}, {t5,
5
t6}}. Vì các giá trị của các tuple t1, t2, t3, t4, t7 và trên thuộc tính A là tất cả như nhau, họ
được giao cho cùng một nhóm. Tương tự như vậy, bởi vì giá trị của t5 và t6 là như nhau,
chúng được đặt trong một nhóm khác. Các phân vùng cho các AD kết hợp thuộc tính cho Bảng
2.1 là
П
AD (r) = {{t1, t2, t3, t4, t7}, {t5, t6}}. Cardinality của phân vùng |
П
A (r) |, mà là
số lượng các nhóm trong phân vùng
П
A, là 2, và |
П
AD (r) | là 2 quá. Bởi vì |
A (r) | bằng
để |
AD (r) |, A
?
D có thể được thu được [5].
Algorithm Tane [5] sử dụng khái niệm phân vùng để khám phá FD. Ngoài ra, các thiết lập của
các ứng cử viên được tỉa dựa trên FD phát hiện. Ví dụ, nếu AC
?
B đã được
phát hiện, sau đó ACD
?
B và ACDE
?
B có thể được suy ra từ AC
?
B mà không cần kiểm tra
dữ liệu, vì vậy ứng viên ACD và ACDE là thừa. Theo các phụ thuộc trong
một tập dữ liệu, chỉ là một phần của mạng tinh thể hiện trong hình 2.1 có thể cần phải được đi qua để
tìm tất cả các FD trong mối quan hệ.
Algorithm FUN [11, 12] sử dụng một thủ tục kiểm tra cho FD nhúng, mà là
FD mà giữ trên chiếu của bộ dữ liệu. Bằng cách sử dụng FD nhúng, các ứng cử viên khác
có thể được bỏ bớt. Ví dụ, giả sử rằng A
?
B nắm giữ trên ABC. Nếu |
П
AB (r) |> |
П
BC (r) |,
sau đó BC
?
A không giữ trên ABC. Trên tập dữ liệu tổng hợp, có tỷ lệ tương quan của 30%
đến 70%, FUN là nhanh hơn so với Tane. Rõ ràng, FUN đã không được thử nghiệm trên bất kỳ bộ dữ liệu UCI.
Nghiên cứu của chúng tôi giải quyết hai câu hỏi liên quan. Đầu tiên, có thể thông tin khác từ
FD phát hiện được sử dụng để tỉa các ứng cử viên nhiều hơn phương pháp trước đây? Thứ hai,
cắt tỉa này có thể được thực hiện sao cho hiệu quả tổng thể của thuật toán được cải thiện? Chúng tôi
giải quyết cả những vấn đề này bằng cách xem xét thêm các tính chất lý thuyết của FD,
việc xây dựng các thuật toán FD_Mine để tận dụng lợi thế của các đặc tính này, và thử nghiệm các
thuật toán trên một loạt các bộ dữ liệu.
3. Một lý thuyết

đang được dịch, vui lòng đợi..

Kết quả (Việt) 3:[Sao chép]

Sao chép!

đang được dịch, vui lòng đợi..

Các ngôn ngữ khác

Hỗ trợ công cụ dịch thuật: Albania, Amharic, Anh, Armenia, Azerbaijan, Ba Lan, Ba Tư, Bantu, Basque, Belarus, Bengal, Bosnia, Bulgaria, Bồ Đào Nha, Catalan, Cebuano, Chichewa, Corsi, Creole (Haiti), Croatia, Do Thái, Estonia, Filipino, Frisia, Gael Scotland, Galicia, George, Gujarat, Hausa, Hawaii, Hindi, Hmong, Hungary, Hy Lạp, Hà Lan, Hà Lan (Nam Phi), Hàn, Iceland, Igbo, Ireland, Java, Kannada, Kazakh, Khmer, Kinyarwanda, Klingon, Kurd, Kyrgyz, Latinh, Latvia, Litva, Luxembourg, Lào, Macedonia, Malagasy, Malayalam, Malta, Maori, Marathi, Myanmar, Mã Lai, Mông Cổ, Na Uy, Nepal, Nga, Nhật, Odia (Oriya), Pashto, Pháp, Phát hiện ngôn ngữ, Phần Lan, Punjab, Quốc tế ngữ, Rumani, Samoa, Serbia, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenia, Somali, Sunda, Swahili, Séc, Tajik, Tamil, Tatar, Telugu, Thái, Thổ Nhĩ Kỳ, Thụy Điển, Tiếng Indonesia, Tiếng Ý, Trung, Trung (Phồn thể), Turkmen, Tây Ban Nha, Ukraina, Urdu, Uyghur, Uzbek, Việt, Xứ Wales, Yiddish, Yoruba, Zulu, Đan Mạch, Đức, Ả Rập, dịch ngôn ngữ.