Web content mining[edit]Web content

Web content mining[edit]
Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page content. The heterogeneity and the lack of structure that permits much of the ever-expanding information sources on the World Wide Web, such as hypertext documents, makes automated discovery, organization, and search and indexing tools of the Internet and the World Wide Web such as Lycos, Alta Vista, WebCrawler, ALIWEB, MetaCrawler, and others provide some comfort to users, but they do not generally provide structural information nor categorize, filter, or interpret documents. In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent web agents, as well as to extend database and data mining techniques to provide a higher level of organization for semi-structured data available on the web. The agent-based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information.

Web content mining is differentiated from two different points of view:[3] Information Retrieval View and Database View.[4] summarized the research works done for unstructured data and semi-structured data from information retrieval view. It shows that most of the researches use bag of words, which is based on the statistics about single words in isolation, to represent unstructured text and take single word found in the training corpus as features. For the semi-structured data, all the works utilize the HTML structures inside the documents and some utilized the hyperlink structure between the documents for document representation. As for the database view, in order to have the better information management and querying on the web, the mining always tries to infer the structure of the web site to transform a web site to become a database.

There are several ways to represent documents; vector space model is typically used. The documents constitute the whole vector space. This representation does not realize the importance of words in a document. To resolve this, tf-idf (Term Frequency Times Inverse Document Frequency) is introduced.

By multi-scanning the document, we can implement feature selection. Under the condition that the category result is rarely affected, the extraction of feature subset is needed. The general algorithm is to construct an evaluating function to evaluate the features. As feature set, Information Gain, Cross Entropy, Mutual Information, and Odds Ratio are usually used. The classifier and pattern analysis methods of text data mining are very similar to traditional data mining techniques. The usual evaluative merits are Classification Accuracy, Precision, Recall and Information Score.

Web mining is an important component of content pipeline for web portals. It is used in data confirmation and validity verification, data integrity and building taxonomies, content management, content generation and opinion mining

Web content mining is differentiated from two different points of view:[3] Information Retrieval View and Database View.[4] summarized the research works done for unstructured data and semi-structured data from information retrieval view. It shows that most of the researches use bag of words, which is based on the statistics about single words in isolation, to represent unstructured text and take single word found in the training corpus as features. For the semi-structured data, all the works utilize the HTML structures inside the documents and some utilized the hyperlink structure between the documents for document representation. As for the database view, in order to have the better information management and querying on the web, the mining always tries to infer the structure of the web site to transform a web site to become a database.

There are several ways to represent documents; vector space model is typically used. The documents constitute the whole vector space. This representation does not realize the importance of words in a document. To resolve this, tf-idf (Term Frequency Times Inverse Document Frequency) is introduced.

By multi-scanning the document, we can implement feature selection. Under the condition that the category result is rarely affected, the extraction of feature subset is needed. The general algorithm is to construct an evaluating function to evaluate the features. As feature set, Information Gain, Cross Entropy, Mutual Information, and Odds Ratio are usually used. The classifier and pattern analysis methods of text data mining are very similar to traditional data mining techniques. The usual evaluative merits are Classification Accuracy, Precision, Recall and Information Score.

Web mining is an important component of content pipeline for web portals. It is used in data confirmation and validity verification, data integrity and building taxonomies, content management, content generation and opinion mining

0/5000

Từ: -

Sang: -

Kết quả (Việt) 1: [Sao chép]

Sao chép!

Web content mining[edit]Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page content. The heterogeneity and the lack of structure that permits much of the ever-expanding information sources on the World Wide Web, such as hypertext documents, makes automated discovery, organization, and search and indexing tools of the Internet and the World Wide Web such as Lycos, Alta Vista, WebCrawler, ALIWEB, MetaCrawler, and others provide some comfort to users, but they do not generally provide structural information nor categorize, filter, or interpret documents. In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent web agents, as well as to extend database and data mining techniques to provide a higher level of organization for semi-structured data available on the web. The agent-based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information.Web content mining is differentiated from two different points of view:[3] Information Retrieval View and Database View.[4] summarized the research works done for unstructured data and semi-structured data from information retrieval view. It shows that most of the researches use bag of words, which is based on the statistics about single words in isolation, to represent unstructured text and take single word found in the training corpus as features. For the semi-structured data, all the works utilize the HTML structures inside the documents and some utilized the hyperlink structure between the documents for document representation. As for the database view, in order to have the better information management and querying on the web, the mining always tries to infer the structure of the web site to transform a web site to become a database.There are several ways to represent documents; vector space model is typically used. The documents constitute the whole vector space. This representation does not realize the importance of words in a document. To resolve this, tf-idf (Term Frequency Times Inverse Document Frequency) is introduced.By multi-scanning the document, we can implement feature selection. Under the condition that the category result is rarely affected, the extraction of feature subset is needed. The general algorithm is to construct an evaluating function to evaluate the features. As feature set, Information Gain, Cross Entropy, Mutual Information, and Odds Ratio are usually used. The classifier and pattern analysis methods of text data mining are very similar to traditional data mining techniques. The usual evaluative merits are Classification Accuracy, Precision, Recall and Information Score.Web mining is an important component of content pipeline for web portals. It is used in data confirmation and validity verification, data integrity and building taxonomies, content management, content generation and opinion mining

đang được dịch, vui lòng đợi..

Kết quả (Việt) 2:[Sao chép]

Sao chép!

Nội dung web khai thác mỏ [sửa]
khai thác nội dung Web là khai thác mỏ, khai thác và tích hợp các dữ liệu hữu ích, thông tin và kiến thức từ nội dung trang web. Sự bất đồng nhất và thiếu cấu trúc cho phép nhiều trong những nguồn thông tin ngày càng mở rộng trên World Wide Web, chẳng hạn như tài liệu siêu văn bản, làm cho phát hiện tự động, tổ chức, và tìm kiếm và công cụ lập chỉ mục của Internet và World Wide Web như Lycos , Alta Vista, WebCrawler, ALIWEB, MetaCrawler, và những người khác cung cấp một số tiện nghi cho người sử dụng, nhưng họ thường không cung cấp thông tin về cấu trúc cũng như phân loại, lọc, hoặc giải thích văn bản. Trong những năm gần đây, những yếu tố này đã khiến các nhà nghiên cứu để phát triển các công cụ thông minh hơn để tìm kiếm thông tin, chẳng hạn như các đại lý web thông minh, cũng như để mở rộng cơ sở dữ liệu và khai thác dữ liệu kỹ thuật để cung cấp một mức độ cao hơn của tổ chức cho dữ liệu bán cấu trúc có sẵn trên web. Các cách tiếp cận đại lý dựa trên việc khai thác mỏ web liên quan đến việc phát triển các hệ thống AI tinh vi mà có thể hoạt động độc lập hoặc bán tự trị thay mặt cho một người dùng cụ thể, nhằm phát hiện và tổ chức thông tin dựa trên web. Khai thác nội dung Web được phân biệt từ hai quan điểm khác nhau :. [3] Thông tin Retrieval Xem và Cơ sở dữ liệu Xem [4] tóm tắt các công trình nghiên cứu được thực hiện cho dữ liệu phi cấu trúc và dữ liệu bán cấu trúc từ xem thông tin. Nó cho thấy rằng hầu hết các nghiên cứu sử dụng túi của các từ, mà là dựa trên các số liệu thống kê về các từ đơn trong sự cô lập, để đại diện cho văn bản phi cấu trúc và dùng từ ngữ duy nhất được tìm thấy trong các ngữ liệu huấn luyện như các tính năng. Đối với các dữ liệu bán cấu trúc, tất cả các công trình sử dụng các cấu trúc HTML bên trong các tài liệu và một số sử dụng các cấu trúc liên kết giữa các văn bản đại diện cho tài liệu. Đối với các quan điểm cơ sở dữ liệu, để có sự quản lý thông tin tốt hơn và truy vấn trên web, các mỏ luôn luôn cố gắng để suy ra cấu trúc của trang web để chuyển đổi một trang web để trở thành một cơ sở dữ liệu. Có một số cách để đại diện cho văn bản; vector mô hình không gian thường được sử dụng. Các tài liệu cấu thành toàn bộ không gian vector. Đại diện này không nhận ra tầm quan trọng của các từ trong một tài liệu. Để giải quyết điều này, tf-idf (Frequency Term Times Inverse Document Frequency) được giới thiệu. Bởi đa chức năng quét tài liệu, chúng ta có thể thực hiện các lựa chọn tính năng. Dưới điều kiện là kết quả loại ít chịu ảnh hưởng, việc khai thác các tính năng tập hợp là cần thiết. Các thuật toán chung là xây dựng một hàm đánh giá để đánh giá các tính năng. Như tính năng thiết lập, Information Gain, Cross Entropy, thông tin lẫn nhau, và Tỷ lệ Tỷ lệ này thường được sử dụng. Bộ phân loại và phân tích mô hình phương pháp của văn bản khai thác dữ liệu là rất tương tự như kỹ thuật khai thác dữ liệu truyền thống. Công đức tính đánh giá thông thường là phân loại chính xác, Precision, Recall và Điểm Thông tin. Khai thác Web là một thành phần quan trọng của kênh nội dung cho cổng web. Nó được sử dụng trong dữ liệu xác nhận và xác minh tính hợp lệ, tính toàn vẹn dữ liệu và xây dựng nguyên tắc phân loại, quản lý nội dung, hệ nội dung và khai thác ý kiến

đang được dịch, vui lòng đợi..

Kết quả (Việt) 3:[Sao chép]

Sao chép!

đang được dịch, vui lòng đợi..

Các ngôn ngữ khác

Hỗ trợ công cụ dịch thuật: Albania, Amharic, Anh, Armenia, Azerbaijan, Ba Lan, Ba Tư, Bantu, Basque, Belarus, Bengal, Bosnia, Bulgaria, Bồ Đào Nha, Catalan, Cebuano, Chichewa, Corsi, Creole (Haiti), Croatia, Do Thái, Estonia, Filipino, Frisia, Gael Scotland, Galicia, George, Gujarat, Hausa, Hawaii, Hindi, Hmong, Hungary, Hy Lạp, Hà Lan, Hà Lan (Nam Phi), Hàn, Iceland, Igbo, Ireland, Java, Kannada, Kazakh, Khmer, Kinyarwanda, Klingon, Kurd, Kyrgyz, Latinh, Latvia, Litva, Luxembourg, Lào, Macedonia, Malagasy, Malayalam, Malta, Maori, Marathi, Myanmar, Mã Lai, Mông Cổ, Na Uy, Nepal, Nga, Nhật, Odia (Oriya), Pashto, Pháp, Phát hiện ngôn ngữ, Phần Lan, Punjab, Quốc tế ngữ, Rumani, Samoa, Serbia, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenia, Somali, Sunda, Swahili, Séc, Tajik, Tamil, Tatar, Telugu, Thái, Thổ Nhĩ Kỳ, Thụy Điển, Tiếng Indonesia, Tiếng Ý, Trung, Trung (Phồn thể), Turkmen, Tây Ban Nha, Ukraina, Urdu, Uyghur, Uzbek, Việt, Xứ Wales, Yiddish, Yoruba, Zulu, Đan Mạch, Đức, Ả Rập, dịch ngôn ngữ.