3.3 Term WeightingAs mentioned abov

3.3 Term Weighting

As mentioned above, text extracted from a web page consists of boilerplate and payload text. To reduce the influence of the former and boost the impact of the latter on the document vectors, we compute idf separately for each domain in the set (rather than globally across all domains). Thus, terms that occur frequently across a particular web site will receive a low specificity score (i.e., idf) on pages from that web site, yet may receive a high score if they appear elsewhere.

3.4 Scoring functions

In our experiments, we explored and combined the following scoring functions:

3.4.1 Cosine Similarity (cos)

This is the classical measure of similarity in LSI-based Information Retrieval. It computes the co-sine of the angle between the two vectors that em-bed two candidate documents in the joint semantic vector space.

3.4.2 “Local” cosine similarity (lcos)

The intuition behind the local cosine similarity measure is this: since we perform SVD on a bilin-gual term-document matrix that consists of doc-ument column vectors for documents from a large collection of web sites, web pages from each specific web site will still appear quite similar if the web site is dedicated to a particular topic area (which the vast majority of web sites are). Similarity scores will thus be dominated by the general domain of the web site rather than the diﬀerences between individual pages within a given web site. The local cosine similarity measure tries to mediate this phe-nomenon by shifting the origin of the vector space to the centre of the sub-space in which the pages of

3.3 Term Weighting

As mentioned above, text extracted from a web page consists of boilerplate and payload text. To reduce the influence of the former and boost the impact of the latter on the document vectors, we compute idf separately for each domain in the set (rather than globally across all domains). Thus, terms that occur frequently across a particular web site will receive a low specificity score (i.e., idf) on pages from that web site, yet may receive a high score if they appear elsewhere.

3.4 Scoring functions

In our experiments, we explored and combined the following scoring functions:

3.4.1 Cosine Similarity (cos)

This is the classical measure of similarity in LSI-based Information Retrieval. It computes the co-sine of the angle between the two vectors that em-bed two candidate documents in the joint semantic vector space.

3.4.2 “Local” cosine similarity (lcos)

The intuition behind the local cosine similarity measure is this: since we perform SVD on a bilin-gual term-document matrix that consists of doc-ument column vectors for documents from a large collection of web sites, web pages from each specific web site will still appear quite similar if the web site is dedicated to a particular topic area (which the vast majority of web sites are). Similarity scores will thus be dominated by the general domain of the web site rather than the diﬀerences between individual pages within a given web site. The local cosine similarity measure tries to mediate this phe-nomenon by shifting the origin of the vector space to the centre of the sub-space in which the pages of

0/5000

Từ: -

Sang: -

Kết quả (Việt) 1: [Sao chép]

Sao chép!

3.3 Term WeightingAs mentioned above, text extracted from a web page consists of boilerplate and payload text. To reduce the influence of the former and boost the impact of the latter on the document vectors, we compute idf separately for each domain in the set (rather than globally across all domains). Thus, terms that occur frequently across a particular web site will receive a low specificity score (i.e., idf) on pages from that web site, yet may receive a high score if they appear elsewhere.3.4 Scoring functionsIn our experiments, we explored and combined the following scoring functions:3.4.1 Cosine Similarity (cos)This is the classical measure of similarity in LSI-based Information Retrieval. It computes the co-sine of the angle between the two vectors that em-bed two candidate documents in the joint semantic vector space.3.4.2 “Local” cosine similarity (lcos)The intuition behind the local cosine similarity measure is this: since we perform SVD on a bilin-gual term-document matrix that consists of doc-ument column vectors for documents from a large collection of web sites, web pages from each specific web site will still appear quite similar if the web site is dedicated to a particular topic area (which the vast majority of web sites are). Similarity scores will thus be dominated by the general domain of the web site rather than the diﬀerences between individual pages within a given web site. The local cosine similarity measure tries to mediate this phe-nomenon by shifting the origin of the vector space to the centre of the sub-space in which the pages of

đang được dịch, vui lòng đợi..

Kết quả (Việt) 2:[Sao chép]

Sao chép!

3.3 Các trọng hạn

Như đã đề cập ở trên, văn bản trích từ một trang web bao gồm soạn và tải trọng văn bản. Để giảm ảnh hưởng của các cựu và tăng tác động của thứ hai trên vectơ tài liệu, chúng tôi tính idf riêng cho từng lĩnh vực trong tập (chứ không phải là trên toàn cầu trên tất cả các lĩnh vực). Do đó, điều khoản thường xuyên xảy ra trên một trang web cụ thể sẽ nhận được một số điểm đặc hiệu thấp (tức là, idf) trên các trang từ trang web đó, nhưng vẫn có thể nhận được một điểm số cao nếu chúng xuất hiện ở những nơi khác.

Chức năng chấm điểm 3.4

Trong thí nghiệm của chúng tôi, chúng tôi khám phá và kết hợp các chức năng chấm điểm sau đây:

3.4.1 Cosine Similarity (cos)

Đây là biện pháp cổ điển tương tự trong LSI dựa trên thông tin Retrieval. Nó tính đồng sin của góc giữa hai vectơ mà em-giường hai tài liệu ứng cử viên trong các không gian vector ngữ nghĩa chung.

3.4.2 "địa phương" tương tự cosin (LCOS)

Trực giác đằng sau những biện pháp cosin giống địa phương là: kể từ khi chúng tôi thực hiện SVD trên một ma trận hạn tài liệu bilin-Gual mà bao gồm các vectơ cột doc-ument cho các tài liệu từ một bộ sưu tập lớn của các trang web, các trang web từ mỗi trang web cụ thể vẫn sẽ xuất hiện hoàn toàn tương tự nếu các trang web dành riêng cho một khu vực chủ đề cụ thể (mà đại đa số các trang web đang có). Điểm tương đồng như vậy sẽ bị chi phối bởi các miền chung của trang web chứ không phải là với hàm di ff giữa các trang cá nhân trong vòng một trang web nhất định.

đang được dịch, vui lòng đợi..

Kết quả (Việt) 3:[Sao chép]

Sao chép!

3.3 hạngNhư trên đã - Su - rơ, văn bản được chiết xuất từ trang văn bản mẫu và chuyến hàng.Hình thức thôi để giảm ảnh hưởng của tài liệu và phát triển thứ hai ảnh hưởng của vector, chúng ta tính toán riêng biệt cho mỗi miền IDF lập (chứ không phải trên toàn thế giới. Mỗi khu vực).Vì vậy, thường xảy ra ở một địa điểm cụ thể sẽ được cụ thể điểm thấp (i.e., IDF) trên trang web page, nhưng nếu họ thắng lớn xuất hiện ở những nơi khác.3.4 điểm chức năng.Trong cuộc thí nghiệm của chúng tôi, chúng tôi kết hợp chức năng ghi được khám phá và giáp các đô thị:3.4.1 cosin (cos)Đó là tìm kiếm thông tin tương tự dựa trên kinh điển của mạch đo lường.Nó tính toán góc giữa hai vectơ giữa hai vectơ trong không gian vectơ hợp ngữ nghĩa của hai ứng viên của tài liệu.3.4.2 "native" cosin (LCOS)Ở địa phương cosin similarities đo trực giác là: nếu chúng ta SVD bilingual ma thuật ngữ trên tài liệu gồm các vector trong văn học đã thu thập được nhiều trang web của tập tin, từ mỗi một trang web cụ thể nếu website dedicated vẫn sẽ là một chủ đề cụ thể khu vực này đều rất tương như (và hầu hết. Trang web).Điểm giống nhau sẽ được trang web thay vì hai erences ﬀ người giữa các trang web site trong một lĩnh vực cụ thể của Tổng thống trị.Địa phương cosin similarities metric cố sẽ không gian vectơ Origins đến đâu là một không gian con của Trung tâm hòa giải một hiện tượng này.

đang được dịch, vui lòng đợi..

Các ngôn ngữ khác

Hỗ trợ công cụ dịch thuật: Albania, Amharic, Anh, Armenia, Azerbaijan, Ba Lan, Ba Tư, Bantu, Basque, Belarus, Bengal, Bosnia, Bulgaria, Bồ Đào Nha, Catalan, Cebuano, Chichewa, Corsi, Creole (Haiti), Croatia, Do Thái, Estonia, Filipino, Frisia, Gael Scotland, Galicia, George, Gujarat, Hausa, Hawaii, Hindi, Hmong, Hungary, Hy Lạp, Hà Lan, Hà Lan (Nam Phi), Hàn, Iceland, Igbo, Ireland, Java, Kannada, Kazakh, Khmer, Kinyarwanda, Klingon, Kurd, Kyrgyz, Latinh, Latvia, Litva, Luxembourg, Lào, Macedonia, Malagasy, Malayalam, Malta, Maori, Marathi, Myanmar, Mã Lai, Mông Cổ, Na Uy, Nepal, Nga, Nhật, Odia (Oriya), Pashto, Pháp, Phát hiện ngôn ngữ, Phần Lan, Punjab, Quốc tế ngữ, Rumani, Samoa, Serbia, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenia, Somali, Sunda, Swahili, Séc, Tajik, Tamil, Tatar, Telugu, Thái, Thổ Nhĩ Kỳ, Thụy Điển, Tiếng Indonesia, Tiếng Ý, Trung, Trung (Phồn thể), Turkmen, Tây Ban Nha, Ukraina, Urdu, Uyghur, Uzbek, Việt, Xứ Wales, Yiddish, Yoruba, Zulu, Đan Mạch, Đức, Ả Rập, dịch ngôn ngữ.