6.2.4 A Pattern-Growth Approach for

6.2.4 A Pattern-Growth Approach for Mining Frequent Itemsets
As we have seen, in many cases the Apriori candidate generate-and-test method signifi- cantly reduces the size of candidate sets, leading to good performance gain. However, it can suffer from two nontrivial costs:

It may still need to generate a huge number of candidate sets. For example, if there are 104 frequent 1-itemsets, the Apriori algorithm will need to generate more than 107 candidate 2-itemsets.
It may need to repeatedly scan the whole database and check a large set of candidates by pattern matching. It is costly to go over each transaction in the database to determine the support of the candidate itemsets.

“Can we design a method that mines the complete set of frequent itemsets without such a costly candidate generation process?” An interesting method in this attempt is called frequent pattern growth, or simply FP-growth, which adopts a divide-and-conquer strategy as follows. First, it compresses the database representing frequent items into a frequent pattern tree, or FP-tree, which retains the itemset association information. It then divides the compressed database into a set of conditional databases (a special kind of projected database), each associated with one frequent item or “pattern fragment,” and mines each database separately. For each “pattern fragment,” only its associated data sets need to be examined. Therefore, this approach may substantially reduce the size of the data sets to be searched, along with the “growth” of patterns being examined. You will see how it works in Example 6.5.

Example 6.5 FP-growth (finding frequent itemsets without candidate generation). We reexamine the mining of transaction database, D, of Table 6.1 in Example 6.3 using the frequent pattern growth approach.
The first scan of the database is the same as Apriori, which derives the set of frequent items (1-itemsets) and their support counts (frequencies). Let the minimum support count be 2. The set of frequent items is sorted in the order of descending support count. This resulting set or list is denoted by L. Thus, we have L = {{I2: 7}, {I1: 6}, {I3: 6},
{I4: 2}, {I5: 2}}.
An FP-tree is then constructed as follows. First, create the root of the tree, labeled
with “null.” Scan database D a second time. The items in each transaction are processed in L order (i.e., sorted according to descending support count), and a branch is created for each transaction. For example, the scan of the first transaction, “T100: I1, I2, I5,” which contains three items (I2, I1, I5 in L order), leads to the construction of the first branch of the tree with three nodes, (I2: 1), (I1: 1), and (I5: 1), where I2 is linked as a child to the root, I1 is linked to I2, and I5 is linked to I1. The second transaction, T200,
contains the items I2 and I4 in L order, which would result in a branch where I2 is linked to the root and I4 is linked to I2. However, this branch would share a common prefix, I2, with the existing path for T100. Therefore, we instead increment the count of the I2 node by 1, and create a new node, (I4: 1), which is linked as a child to (I2: 2). In general,

It may still need to generate a huge number of candidate sets. For example, if there are 104 frequent 1-itemsets, the Apriori algorithm will need to generate more than 107 candidate 2-itemsets.
It may need to repeatedly scan the whole database and check a large set of candidates by pattern matching. It is costly to go over each transaction in the database to determine the support of the candidate itemsets.

“Can we design a method that mines the complete set of frequent itemsets without such a costly candidate generation process?” An interesting method in this attempt is called frequent pattern growth, or simply FP-growth, which adopts a divide-and-conquer strategy as follows. First, it compresses the database representing frequent items into a frequent pattern tree, or FP-tree, which retains the itemset association information. It then divides the compressed database into a set of conditional databases (a special kind of projected database), each associated with one frequent item or “pattern fragment,” and mines each database separately. For each “pattern fragment,” only its associated data sets need to be examined. Therefore, this approach may substantially reduce the size of the data sets to be searched, along with the “growth” of patterns being examined. You will see how it works in Example 6.5.

Example 6.5 FP-growth (finding frequent itemsets without candidate generation). We reexamine the mining of transaction database, D, of Table 6.1 in Example 6.3 using the frequent pattern growth approach.
The first scan of the database is the same as Apriori, which derives the set of frequent items (1-itemsets) and their support counts (frequencies). Let the minimum support count be 2. The set of frequent items is sorted in the order of descending support count. This resulting set or list is denoted by L. Thus, we have L = {{I2: 7}, {I1: 6}, {I3: 6},
{I4: 2}, {I5: 2}}.
An FP-tree is then constructed as follows. First, create the root of the tree, labeled
with “null.” Scan database D a second time. The items in each transaction are processed in L order (i.e., sorted according to descending support count), and a branch is created for each transaction. For example, the scan of the first transaction, “T100: I1, I2, I5,” which contains three items (I2, I1, I5 in L order), leads to the construction of the first branch of the tree with three nodes, (I2: 1), (I1: 1), and (I5: 1), where I2 is linked as a child to the root, I1 is linked to I2, and I5 is linked to I1. The second transaction, T200,
contains the items I2 and I4 in L order, which would result in a branch where I2 is linked to the root and I4 is linked to I2. However, this branch would share a common prefix, I2, with the existing path for T100. Therefore, we instead increment the count of the I2 node by 1, and create a new node, (I4: 1), which is linked as a child to (I2: 2). In general,

0/5000

Từ: -

Sang: -

Kết quả (Việt) 1: [Sao chép]

Sao chép!

6.2.4 một phương pháp tiếp cận mô hình tăng trưởng cho khai thác mỏ thường xuyên ItemsetsNhư chúng ta đã thấy, trong nhiều trường hợp Apriori ứng cử viên tạo ra-và-thử nghiệm phương pháp signifi-cantly làm giảm kích thước của bộ ứng cử viên hàng đầu để đạt được hiệu suất tốt. Tuy nhiên, nó có thể bị từ hai chi phí nontrivial:Nó vẫn có thể cần phải tạo ra một số lượng lớn các ứng cử viên bộ. Ví dụ, nếu có thường xuyên 104 1-itemsets, thuật toán Apriori sẽ cần phải tạo ra hơn 107 ứng cử viên 2-itemsets.Nó có thể cần phải liên tục quét cơ sở dữ liệu toàn bộ và kiểm tra một tập lớn các ứng viên bằng mô hình kết hợp. Nó là tốn kém để đi qua mỗi giao dịch cơ sở dữ liệu để xác định sự ủng hộ của ứng cử viên itemsets."Chúng tôi có thể thiết kế một phương pháp mines bộ hoàn chỉnh các itemsets thường xuyên mà không có một ứng cử viên tốn kém thế hệ quá trình như vậy?" Một phương pháp thú vị trong nỗ lực này được gọi là thường xuyên mô hình tăng trưởng, hoặc chỉ đơn giản là FP-tăng trưởng, mà thông qua một chiến lược phân chia và chinh phục như sau. Đầu tiên, nó nén cơ sở dữ liệu đại diện thường xuyên các mục vào một cây mô hình thường xuyên, hoặc FP-cây, mà vẫn giữ thông tin Hiệp hội itemset. Sau đó, nó chia một tập hợp các điều kiện cơ sở dữ liệu (một đặc biệt dự kiến loại cơ sở dữ liệu), mỗi kết hợp với một mục thường xuyên hoặc "mảnh mô hình," cơ sở dữ liệu nén và mines mỗi cơ sở dữ liệu một cách riêng biệt. Đối với mỗi mảnh mô hình"," chỉ là các bộ dữ liệu liên quan cần phải được kiểm tra. Vì vậy, cách tiếp cận này có thể làm giảm đáng kể kích thước của các tập hợp dữ liệu được tìm kiếm, cùng với sự phát triển"" của mô hình đang được kiểm tra. Bạn sẽ thấy làm thế nào nó hoạt động trong ví dụ 6,5.Ví dụ 6,5 FP-tăng trưởng (tìm kiếm thường xuyên itemsets mà không có thế hệ ứng cử viên). Chúng tôi reexamine khai thác cơ sở dữ liệu giao dịch, D của bảng 6.1 ở 6.3 ví dụ bằng cách sử dụng phương pháp tiếp cận phát triển mô hình thường xuyên.Quét cơ sở dữ liệu, đầu tiên là tương tự như Apriori, có nguồn gốc các thiết lập của mặt hàng thường xuyên (1-itemsets) và hỗ trợ của họ đếm (tần số). Giả sử số hỗ trợ tối thiểu là 2. Tập thường xuyên các mục được sắp xếp theo thứ tự giảm dần số lượng hỗ trợ. Kết quả thiết lập hoặc danh sách này được kí hiệu bởi L. Vì vậy, chúng tôi có L = {{I2: 7}, {I1: 6}, {I3: 6},{I4: 2}, {I5: 2}}.Một FP-cây sau đó được xây dựng như sau. Đầu tiên, tạo ra các gốc cây, có nhãnvới "null." Quét cơ sở dữ liệu D một lần thứ hai. Các mục trong mỗi giao dịch được xử lý theo thứ tự L (tức là, được sắp xếp theo giảm dần số lượng hỗ trợ), và một chi nhánh được tạo ra cho mỗi giao dịch. Ví dụ, việc quét các giao dịch đầu tiên, "T100: I1, I2, I5," mà có chứa ba mục (I2, I1, I5 L để), dẫn đến việc xây dựng các chi nhánh đầu tiên của cây với ba nút, (I2: 1), (I1: 1), và (I5: 1), nơi I2 được liên kết như một đứa trẻ vào thư mục gốc, I1 được liên kết với I2 và I5 được liên kết với I1. Giao dịch lần thứ hai, T200,chứa các khoản mục I2 và I4 L để, sẽ cho kết quả trong một chi nhánh nơi I2 được liên kết với gốc và I4 được liên kết với I2. Tuy nhiên, chi nhánh này muốn chia sẻ một tiền tố phổ biến, I2, với đường dẫn hiện có cho T100. Vì vậy, chúng tôi thay vì tăng tính nút I2 bằng 1, và tạo ra một nút mới, (I4: 1), mà được liên kết như là một đứa trẻ (I2: 2). nói chung,

đang được dịch, vui lòng đợi..

Kết quả (Việt) 2:[Sao chép]

Sao chép!

6.2.4 Cách tiếp cận mẫu-Tăng trưởng cho khai thác tập phổ biến
Như chúng ta đã thấy, trong nhiều trường hợp, các ứng cử viên Apriori phương pháp tạo ra-và-kiểm tra signifi- đáng làm giảm kích thước của bộ ứng cử viên, dẫn đến tăng hiệu suất tốt. Tuy nhiên, nó có thể bị từ hai chi phí không tầm thường:

Nó vẫn có thể cần phải tạo ra một số lượng lớn các bộ ứng cử viên. Ví dụ, nếu có 104 thường xuyên 1-tập phổ biến, các thuật toán Apriori sẽ cần phải tạo ra hơn 107 ứng cử viên 2 tập phổ biến.
Nó có thể cần phải liên tục quét toàn bộ cơ sở dữ liệu và kiểm tra một tập hợp lớn các ứng cử viên do mô hình kết hợp. Nó là tốn kém để đi qua mỗi giao dịch trong cơ sở dữ liệu để xác định sự hỗ trợ của các tập phổ biến ứng cử viên.

"Chúng ta có thể thiết kế một phương pháp mà mìn bộ hoàn chỉnh các tập phổ biến mà không có một thế hệ quá trình ứng cử viên tốn kém như vậy?" Một phương pháp thú vị trong nỗ lực này là được gọi là mô hình tăng trưởng thường xuyên, hoặc chỉ đơn giản là FP-tăng trưởng, mà thông qua một chiến lược chia-và-chinh phục như sau. Đầu tiên, nó nén cơ sở dữ liệu đại diện cho các mặt hàng thường xuyên vào một cây thường xuyên mẫu, hoặc FP-tree, mà vẫn giữ được thông tin liên kết tập phổ biến. Sau đó nó phân chia cơ sở dữ liệu nén vào một tập hợp các cơ sở dữ liệu có điều kiện (một loại đặc biệt của cơ sở dữ liệu dự), mỗi liên kết với một mục thường xuyên hoặc "mô hình mảnh", và mỏ mỗi cơ sở dữ liệu riêng biệt. Đối với mỗi "mảnh mô hình", chỉ tập hợp dữ liệu có liên quan của nó cần phải được kiểm tra. Vì vậy, phương pháp này có thể làm giảm đáng kể kích thước của các bộ dữ liệu được tìm kiếm, cùng với sự "tăng trưởng" của mẫu được kiểm tra. Bạn sẽ xem làm thế nào nó hoạt động trong Ví dụ 6.5.

Ví dụ 6.5 FP-tăng trưởng (tìm tập phổ biến mà không cần thế hệ ứng viên). Chúng tôi xem xét lại việc khai thác cơ sở dữ liệu giao dịch, D, trong Bảng 6.1 trong Ví dụ 6.3 bằng cách sử dụng phương pháp tiếp cận phát triển mô hình thường xuyên.
Việc quét đầu tiên của cơ sở dữ liệu tương tự như Apriori, mà xuất phát tập các mặt hàng thường xuyên (1-tập phổ biến) và hỗ trợ của họ đếm (tần số). Hãy tính hỗ trợ tối thiểu là 2. Tập hợp các mặt hàng thường xuyên được sắp xếp theo thứ tự giảm dần số lượng hỗ trợ. Điều này thiết lập kết quả hoặc danh sách được ký hiệu là L. Vì vậy, chúng ta có L = {{I2: 7}, {I1: 6}, {I3: 6},
{I4: 2}, {I5: 2}}.
Một FP sau đó -cây được xây dựng như sau. Đầu tiên, tạo thư mục gốc của cây, dán nhãn
với "null". Quét cơ sở dữ liệu D một lần thứ hai. Các mặt hàng trong mỗi giao dịch được xử lý theo thứ tự L (tức là, sắp xếp theo giảm dần số lượng hỗ trợ), và một chi nhánh được tạo ra cho mỗi giao dịch. Ví dụ, quá trình quét các giao dịch đầu tiên, "T100: I1, I2, I5," trong đó có ba mục (I2, I1, I5 trong L theo thứ tự), dẫn đến việc xây dựng các chi nhánh đầu tiên của cây với ba nút, (I2: 1), (I1: 1), và (I5: 1), nơi I2 được liên kết như một đứa trẻ vào thư mục gốc, I1 được liên kết với I2, và I5 được liên kết với I1. Các giao dịch thứ hai, T200,
chứa các mục I2 và I4 trong L tự, mà sẽ dẫn đến một chi nhánh nơi I2 được liên kết vào thư mục gốc và I4 được liên kết với I2. Tuy nhiên, chi nhánh này sẽ chia sẻ một tiền tố phổ biến, I2, với đường dẫn hiện tại cho T100. Do đó, chúng tôi thay vì tăng số lần của nút I2 bằng 1, và tạo ra một nút mới, (I4: 1), được liên kết như một đứa trẻ (I2: 2). Nói chung,

đang được dịch, vui lòng đợi..

Kết quả (Việt) 3:[Sao chép]

Sao chép!

đang được dịch, vui lòng đợi..

Các ngôn ngữ khác

Hỗ trợ công cụ dịch thuật: Albania, Amharic, Anh, Armenia, Azerbaijan, Ba Lan, Ba Tư, Bantu, Basque, Belarus, Bengal, Bosnia, Bulgaria, Bồ Đào Nha, Catalan, Cebuano, Chichewa, Corsi, Creole (Haiti), Croatia, Do Thái, Estonia, Filipino, Frisia, Gael Scotland, Galicia, George, Gujarat, Hausa, Hawaii, Hindi, Hmong, Hungary, Hy Lạp, Hà Lan, Hà Lan (Nam Phi), Hàn, Iceland, Igbo, Ireland, Java, Kannada, Kazakh, Khmer, Kinyarwanda, Klingon, Kurd, Kyrgyz, Latinh, Latvia, Litva, Luxembourg, Lào, Macedonia, Malagasy, Malayalam, Malta, Maori, Marathi, Myanmar, Mã Lai, Mông Cổ, Na Uy, Nepal, Nga, Nhật, Odia (Oriya), Pashto, Pháp, Phát hiện ngôn ngữ, Phần Lan, Punjab, Quốc tế ngữ, Rumani, Samoa, Serbia, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenia, Somali, Sunda, Swahili, Séc, Tajik, Tamil, Tatar, Telugu, Thái, Thổ Nhĩ Kỳ, Thụy Điển, Tiếng Indonesia, Tiếng Ý, Trung, Trung (Phồn thể), Turkmen, Tây Ban Nha, Ukraina, Urdu, Uyghur, Uzbek, Việt, Xứ Wales, Yiddish, Yoruba, Zulu, Đan Mạch, Đức, Ả Rập, dịch ngôn ngữ.