The STC method appears to achieve the quality of a “complete,” i.e., O(N2) method (see discussion of Cluster Validation below), while running in linear time, i.e., O(N). The “secret” is the nature of the “similarity” measure STC uses, and the efficient data structure and algorithm STC uses to index the documents and compute the similarity. Practically all other cluster methods use a measure such that if document D1 is similar to document D2, and document D2 is similar to D3, one cannot assume that D1 is similar to D3. In a word, these measures, e.g., cosine similarity, are nontransitive. As a result, every pair of interdocument similarities needs to be computed and accessed for “completeness.” By contrast, STC forms its base clusters on the basis of shared phrases. If D1 and D2 share a phrase, and D2 and D3 share the same phrase, then D1 and D3 certainly share that phrase too! Hence, STC can perform complete clustering at the base cluster level without incurring the O(N2) penalty. STC achieves O(N) time and space by employing a suffix tree to index the document collection, and an efficient algorithm due to Ukkonen [Algorith, 1995] [Nelson, 1996] to build and update the suffix tree. The second-stage clustering of base clusters is not transitive, but involves clustering of base clusters, not documents. Moreover (and this is the most “heuristic” element of the method), during the incremental reclustering of base clusters, only the q “best” existing clusters are revisited, as noted above. This keeps the time (actually the maximum time) required for stage two constant as the number of documents grows.
đang được dịch, vui lòng đợi..
