3.3 Term WeightingAs mentioned above, text extracted from a web page consists of boilerplate and payload text. To reduce the influence of the former and boost the impact of the latter on the document vectors, we compute idf separately for each domain in the set (rather than globally across all domains). Thus, terms that occur frequently across a particular web site will receive a low specificity score (i.e., idf) on pages from that web site, yet may receive a high score if they appear elsewhere.3.4 Scoring functionsIn our experiments, we explored and combined the following scoring functions:3.4.1 Cosine Similarity (cos)This is the classical measure of similarity in LSI-based Information Retrieval. It computes the co-sine of the angle between the two vectors that em-bed two candidate documents in the joint semantic vector space.3.4.2 “Local” cosine similarity (lcos)The intuition behind the local cosine similarity measure is this: since we perform SVD on a bilin-gual term-document matrix that consists of doc-ument column vectors for documents from a large collection of web sites, web pages from each specific web site will still appear quite similar if the web site is dedicated to a particular topic area (which the vast majority of web sites are). Similarity scores will thus be dominated by the general domain of the web site rather than the differences between individual pages within a given web site. The local cosine similarity measure tries to mediate this phe-nomenon by shifting the origin of the vector space to the centre of the sub-space in which the pages of
đang được dịch, vui lòng đợi..
