We assume that each web page is rep

We assume that each web page is represented by a unique integer; the speciﬁc scheme used to assign these integers is described below. We build an adjacency table that resembles an inverted index; it has a row for each web page, with the rows ordered by the corresponding integers. The row for any page p contains a sorted list of integers, each corresponding to a web page that links to p. This table permits us to respond to queries of the form which pages link to p?In similar fashion we build a table whose entries are the pages linked to by p.
This table representation cuts the space taken by the naive representation (in which we explicitly represent each link by its two end points,eacha 32-bit integer) by 50%. Our description below will focus on the table for the links from each page; it should be clear that the techniques apply just as well to thetableoflinkstoeach page.Tofurtherreduce thestorage forthetable,we exploit several ideas:
Similarity between lists: Many rows of the table have many entries in common. Thus, if we explicitly represent a prototype row for several sim¬ilar rows, the remainder can be succinctly expressed in terms of the proto¬typical row.
Locality: Many links from a page go to "nearby" pages - pages on the same host, for instance. This suggests that in encoding the destination of a link, we can often use small integers and thereby save space.
We use gap encodings in sorted lists: Rather than store the destination of each link, we store the offset from the previous entry in the row.
We now develop each of these techniques.
In a lexicographic ordering of all URLs, we treat each URL as an alphanumeric string and sort these strings. Figure 20.5 shows a segment of this sorted order. For a true lexicographic sort of web pages, the domain name part of the URL should be inverted, so that www.stanford.edu becomes edu.stanford.www, but this is not necessary here because we are mainly concerned with links local to a single host.
To each URL, we assign its position in this ordering as the unique identifying integer. Figure 20.6 shows an example of such a numbering and the resulting table. In this example sequence, www.stanford.edu/biology isas- signed the integer 2 because it is second in the sequence.
We next exploit a property that stems from the way most websites are structured to get similarity and locality. Most websites have a template with a set of links from each page in the site to a fixed set of pages on the site (such as its copyright notice, terms of use, and so on). In this case, the rows corresponding to pages in a website will have many table entries in common. Moreover, under the lexicographic ordering of URLs, it is very likely that the pages from a website appear as contiguous rows in the table.
We adopt the following strategy: We walk down the table, encoding each table row in terms of the seven preceding rows. In the example of Figure 20.6, we could encode the fourth row as "the same as the row at offset 2 (mean¬ing, two rows earlier in the table), with 9 replaced by 8." This requires the specification of the offset, the integer(s) dropped (in this case 9) and the in- teger(s) added (in this case 8). The use of only the seven preceding rows has two advantages: (i) the offset can be expressed with only 3 bits; this choice is optimized empirically (the reason for seven and not eight preceding rows is the subject of Exercise 20.4) and (ii) fixing the maximum offset to a small value like seven avoids having to perform an expensive search among many candidate prototypes in terms of which to express the current row.
What if none of the preceding seven rows is a good prototype for express¬ing the current row? This would happen, for instance, at each boundary be¬tween different websites as we walk down the rows of the table. In this case, we simply express the row as starting from the empty set and "adding in" each integer in that row. By using gap encodings to store the gaps (rather than the actual integers) in each row, and encoding these gaps tightly based on the distribution of their values, we obtain further space reduction. In ex¬periments mentioned in Section 20.5, the series of techniques outlined here appears to use as few as 3 bits per link, on average - a dramatic reduction from the 64 required in the naive representation.
Although these ideas give us a representation of sizable web graphs that comfortably fit in memory, we still need to support connectivity queries. What is entailed in retrieving from this representation the set of links from a page? First, we need an index lookup from (a hash of) the URL to its row number in the table. Next, we need to reconstruct these entries, which may be encoded in terms of entries in other rows. This entails following the offsets to reconstruct these other rows - a process that in principle could lead through many levels of indirection. In practice, however, this does not happen very
often. A heuristic for controlling this can be introduced into the construction of the table: When examining the preceding seven rows as candidates from which to model the current row, we demand a threshold of similarity be¬tween the current row and the candidate prototype. This threshold must be chosen with care. If the threshold is set too high, we seldom use prototypes and express many rows afresh. If the threshold is too low, most rows get ex¬pressed in terms of prototypes, so that at query time the reconstruction of a row leads to many levels of indirection through preceding prototypes.
r Exercise 20.4 We noted that expressing a row in terms of one of seven pre¬ceding rows allowed us to use no more than three bits to specify which of the preceding rows we are using as prototype. Why seven and not eight preceding rows? (Hint: Consider the case when none of the preceding seven rows is a good prototype.)
Exercise 20.5 We noted that for the scheme in Section 20.4, decoding the links incident on a URL could result in many levels of indirection. Construct an example in which the number of levels of indirection grows linearly with the number of URLs.
References and further reading
The first web crawler appears to be Matthew Gray's Wanderer, written in the spring of 1993. The Mercator crawler is due to Najork and Heydon (Najork and Heydon 2001, 2002); the treatment in this chapter follows their work. Other classic early descriptions of web crawling include Burner (1997), Brin and Page (1998), Cho et al. (1998), and the creators of the Webbase sys¬tem at Stanford (Hirai et al. 2000). Cho and Garcia-Molina (2002) give a taxonomy and comparative study of different modes of communication bettween the nodes of a distributed crawler. The Robots Exclusion Protocol stan¬dard is described at www.robotstxt.org/wc/exclusion.html. Boldi et al. (2002) and Shkapenyuk and Suel (2002) provide more recent details of implementing large-scale distributed web crawlers.
Our discussion of DNS resolution (Section 20.2.2) uses the current convention for Internet addresses, known as IPv4 (for Internet Protocol version 4); each IP address is a sequence of four bytes. In the future, the convention for addresses (collectively known as the internet address space) is likely to use a new standard known as IPv6 (www.ipv6.org/).
Tomasic and Garcia-Molina (1993) and Jeong and Omiecinski (1995) are key early papers evaluating term partitioning versus document partitioning for distributed indexes. Document partitioning is found to be superior, at least when the distribution of terms is skewed, as it typically is in practice. This result has generally been confirmed in more recent work (MacFarlane et al. 2000). But the outcome depends on the details of the distributed system;
at least one thread of work has reached the opposite conclusion (Ribeiro- Neto and Barbosa 1998; Badue et al. 2001). Sornil (2001) argues for a partitioning scheme that is a hybrid between term and document partitioning. Barroso et al. (2003) describe the distribution methods used at Google. The first implementation of a connectivity server was described by Bharat et al. (1998). The scheme discussed in this chapter, currently believed to be the best published scheme (achieving as few as 3 bits per link for encoding), is described in a series of papers by Boldi and Vigna (2004a, 2004b).

We assume that each web page is represented by a unique integer; the speciﬁc scheme used to assign these integers is described below. We build an adjacency table that resembles an inverted index; it has a row for each web page, with the rows ordered by the corresponding integers. The row for any page p contains a sorted list of integers, each corresponding to a web page that links to p. This table permits us to respond to queries of the form which pages link to p?In similar fashion we build a table whose entries are the pages linked to by p. 
This table representation cuts the space taken by the naive representation (in which we explicitly represent each link by its two end points,eacha 32-bit integer) by 50%. Our description below will focus on the table for the links from each page; it should be clear that the techniques apply just as well to thetableoflinkstoeach page.Tofurtherreduce thestorage forthetable,we exploit several ideas:
 Similarity between lists: Many rows of the table have many entries in common. Thus, if we explicitly represent a prototype row for several sim¬ilar rows, the remainder can be succinctly expressed in terms of the proto¬typical row.
 Locality: Many links from a page go to "nearby" pages - pages on the same host, for instance. This suggests that in encoding the destination of a link, we can often use small integers and thereby save space.
 We use gap encodings in sorted lists: Rather than store the destination of each link, we store the offset from the previous entry in the row.
We now develop each of these techniques.
In a lexicographic ordering of all URLs, we treat each URL as an alphanumeric string and sort these strings. Figure 20.5 shows a segment of this sorted order. For a true lexicographic sort of web pages, the domain name part of the URL should be inverted, so that www.stanford.edu becomes edu.stanford.www, but this is not necessary here because we are mainly concerned with links local to a single host.
To each URL, we assign its position in this ordering as the unique identifying integer. Figure 20.6 shows an example of such a numbering and the resulting table. In this example sequence, www.stanford.edu/biology isas- signed the integer 2 because it is second in the sequence.
We next exploit a property that stems from the way most websites are structured to get similarity and locality. Most websites have a template with a set of links from each page in the site to a fixed set of pages on the site (such as its copyright notice, terms of use, and so on). In this case, the rows corresponding to pages in a website will have many table entries in common. Moreover, under the lexicographic ordering of URLs, it is very likely that the pages from a website appear as contiguous rows in the table.
We adopt the following strategy: We walk down the table, encoding each table row in terms of the seven preceding rows. In the example of Figure 20.6, we could encode the fourth row as "the same as the row at offset 2 (mean¬ing, two rows earlier in the table), with 9 replaced by 8." This requires the specification of the offset, the integer(s) dropped (in this case 9) and the in- teger(s) added (in this case 8). The use of only the seven preceding rows has two advantages: (i) the offset can be expressed with only 3 bits; this choice is optimized empirically (the reason for seven and not eight preceding rows is the subject of Exercise 20.4) and (ii) fixing the maximum offset to a small value like seven avoids having to perform an expensive search among many candidate prototypes in terms of which to express the current row.
What if none of the preceding seven rows is a good prototype for express¬ing the current row? This would happen, for instance, at each boundary be¬tween different websites as we walk down the rows of the table. In this case, we simply express the row as starting from the empty set and "adding in" each integer in that row. By using gap encodings to store the gaps (rather than the actual integers) in each row, and encoding these gaps tightly based on the distribution of their values, we obtain further space reduction. In ex¬periments mentioned in Section 20.5, the series of techniques outlined here appears to use as few as 3 bits per link, on average - a dramatic reduction from the 64 required in the naive representation.
Although these ideas give us a representation of sizable web graphs that comfortably fit in memory, we still need to support connectivity queries. What is entailed in retrieving from this representation the set of links from a page? First, we need an index lookup from (a hash of) the URL to its row number in the table. Next, we need to reconstruct these entries, which may be encoded in terms of entries in other rows. This entails following the offsets to reconstruct these other rows - a process that in principle could lead through many levels of indirection. In practice, however, this does not happen very 
often. A heuristic for controlling this can be introduced into the construction of the table: When examining the preceding seven rows as candidates from which to model the current row, we demand a threshold of similarity be¬tween the current row and the candidate prototype. This threshold must be chosen with care. If the threshold is set too high, we seldom use prototypes and express many rows afresh. If the threshold is too low, most rows get ex¬pressed in terms of prototypes, so that at query time the reconstruction of a row leads to many levels of indirection through preceding prototypes.
r Exercise 20.4 We noted that expressing a row in terms of one of seven pre¬ceding rows allowed us to use no more than three bits to specify which of the preceding rows we are using as prototype. Why seven and not eight preceding rows? (Hint: Consider the case when none of the preceding seven rows is a good prototype.)
Exercise 20.5 We noted that for the scheme in Section 20.4, decoding the links incident on a URL could result in many levels of indirection. Construct an example in which the number of levels of indirection grows linearly with the number of URLs.
 References and further reading
The first web crawler appears to be Matthew Gray's Wanderer, written in the spring of 1993. The Mercator crawler is due to Najork and Heydon (Najork and Heydon 2001, 2002); the treatment in this chapter follows their work. Other classic early descriptions of web crawling include Burner (1997), Brin and Page (1998), Cho et al. (1998), and the creators of the Webbase sys¬tem at Stanford (Hirai et al. 2000). Cho and Garcia-Molina (2002) give a taxonomy and comparative study of different modes of communication bettween the nodes of a distributed crawler. The Robots Exclusion Protocol stan¬dard is described at www.robotstxt.org/wc/exclusion.html. Boldi et al. (2002) and Shkapenyuk and Suel (2002) provide more recent details of implementing large-scale distributed web crawlers.
Our discussion of DNS resolution (Section 20.2.2) uses the current convention for Internet addresses, known as IPv4 (for Internet Protocol version 4); each IP address is a sequence of four bytes. In the future, the convention for addresses (collectively known as the internet address space) is likely to use a new standard known as IPv6 (www.ipv6.org/).
Tomasic and Garcia-Molina (1993) and Jeong and Omiecinski (1995) are key early papers evaluating term partitioning versus document partitioning for distributed indexes. Document partitioning is found to be superior, at least when the distribution of terms is skewed, as it typically is in practice. This result has generally been confirmed in more recent work (MacFarlane et al. 2000). But the outcome depends on the details of the distributed system;
at least one thread of work has reached the opposite conclusion (Ribeiro- Neto and Barbosa 1998; Badue et al. 2001). Sornil (2001) argues for a partitioning scheme that is a hybrid between term and document partitioning. Barroso et al. (2003) describe the distribution methods used at Google. The first implementation of a connectivity server was described by Bharat et al. (1998). The scheme discussed in this chapter, currently believed to be the best published scheme (achieving as few as 3 bits per link for encoding), is described in a series of papers by Boldi and Vigna (2004a, 2004b).

0/5000

Từ: -

Sang: -

Kết quả (Việt) 1: [Sao chép]

Sao chép!

Chúng tôi giả định rằng mỗi trang web được đại diện bởi một số nguyên duy nhất; Các đề án speciﬁc được sử dụng để gán các số nguyên được mô tả dưới đây. Chúng tôi xây dựng một bảng kề tương tự như một chỉ số bị lộn ngược; đô thị này có một hàng cho mỗi trang web, với dòng lệnh của các số nguyên tương ứng. Hàng cho bất kỳ trang p chứa một danh sách được sắp xếp của số nguyên, mỗi tương ứng với một trang web đó liên kết với p. Bảng này cho phép chúng tôi đáp ứng các truy vấn của các hình thức những trang liên kết với p? Tương tự như chúng tôi xây dựng một bảng mục mà là các trang được liên kết đến bởi p. Đại diện bảng này cắt giảm không gian thực hiện bởi đại diện ngây thơ (trong đó chúng tôi rõ ràng đại diện cho mỗi liên kết của nó điểm cuối hai, eacha 32-bit số nguyên) 50%. Chúng tôi mô tả dưới đây sẽ tập trung vào bảng với các liên kết từ mỗi trang; nó nên được rõ ràng rằng các kỹ thuật áp dụng chỉ là tốt cho thetableoflinkstoeach trang. Tofurtherreduce thestorage forthetable, chúng ta khai thác một số ý tưởng: Sự tương tự giữa danh sách: nhiều hàng của bảng có mục có nhiều điểm chung. Vì vậy, nếu chúng tôi rõ ràng đại diện cho một hàng nguyên mẫu cho một số sim¬ilar hàng, phần còn lại có thể được ngắn gọn thể hiện trong điều khoản của dòng proto¬typical. Địa phương: Nhiều liên kết từ một trang đi đến "gần đó" trang - trang trên cùng một máy chủ, ví dụ. Điều này cho thấy rằng mã hóa điểm đến của một liên kết, chúng tôi có thể thường sử dụng các số nguyên nhỏ và do đó tiết kiệm không gian. Chúng tôi sử dụng khoảng cách mã hóa trong danh sách được sắp xếp: khá hơn lưu trữ đích của mỗi liên kết, chúng tôi lưu trữ các bù đắp từ các mục nhập trước đó trong hàng.Chúng tôi bây giờ phát triển mỗi người trong số các kỹ thuật này.Trong một đặt hàng lexicographic của tất cả các URL, chúng tôi xử lý mỗi URL như một chuỗi chữ và số và sắp xếp các dây. Con số 20.5 cho thấy một phân đoạn của bộ này được sắp xếp. Cho một loại lexicographic đúng của trang web, một phần tên miền của URL nên được đảo ngược, do đó www.stanford.edu trở thành edu.stanford.www, nhưng điều này là không cần thiết ở đây bởi vì chúng tôi là chủ yếu là có liên quan với các liên kết địa phương đến một máy chủ duy nhất.Mỗi URL, chúng tôi chỉ định vị trí của nó trong này đặt hàng như là số nguyên nhận dạng duy nhất. Con số 20.6 cho thấy một ví dụ về các một số và bảng kết quả. Theo thứ tự này ví dụ, www.stanford.edu/biology isas--ký số nguyên 2 vì nó là thứ hai trong chuỗi.Chúng tôi tiếp theo khai thác một tài sản mà bắt nguồn từ cách hầu hết các trang web được cấu trúc để có được tương tự và địa phương. Hầu hết các trang web có một mẫu với một tập hợp các liên kết từ mỗi trang trong trang web đến một tập hợp cố định các trang trên trang web (chẳng hạn như bản quyền của nó thông báo, điều khoản sử dụng, và như vậy). Trong trường hợp này, các hàng tương ứng với trang trong một trang web sẽ có nhiều bảng mục chung. Hơn nữa, theo các đặt hàng lexicographic của URL, nó là rất có khả năng rằng các trang từ một trang web xuất hiện như là lục địa hàng trong bảng.Chúng tôi áp dụng các chiến lược sau: chúng tôi đi bộ xuống bảng, mã hóa mỗi dòng của bảng trong điều khoản của các hàng trước bảy. Trong ví dụ của hình 20.6, chúng tôi có thể mã hóa hàng thứ tư là "giống như dòng tại đối tượng dời hình 2 (mean¬ing, hai hàng trước đó trong bảng), với 9 thay thế bởi 8." Điều này đòi hỏi đặc điểm kỹ thuật của các bù đắp, integer(s) giảm (trong trường hợp này 9) và các tại - teger(s) được thêm vào (trong trường hợp này 8). Việc sử dụng chỉ bảy trước hàng có hai ưu điểm: (i) đối tượng dời hình có thể được thể hiện với chỉ 3 bit; lựa chọn này tối ưu hóa empirically (lý do cho bảy và tám không trước hàng là chủ thể của tập thể dục 20.4) và (ii) cố định tối đa bù đắp một giá trị nhỏ như bảy tránh phải thực hiện một tìm kiếm đắt tiền trong số nhiều ứng cử viên nguyên mẫu trong điều kiện để nhận hàng hiện tại.Nếu không có bảy hàng trước là một mẫu thử nghiệm tốt cho express¬ing dòng hiện tại? Điều này sẽ xảy ra, ví dụ, mỗi be¬tween ranh giới các trang web khác nhau như chúng tôi đi bộ xuống các hàng của bảng. Trong trường hợp này, chúng tôi chỉ đơn giản là nhận dòng là bắt đầu từ tập rỗng và "thêm vào" mỗi số nguyên trong hàng đó. Bằng cách sử dụng khoảng cách mã hóa để lưu trữ các khoảng trống (chứ không phải là các số nguyên thực tế) trong mỗi hàng, và mã hóa các khoảng cách chặt chẽ dựa trên việc phân phối các giá trị của họ, chúng tôi có được thêm space giảm. Trong ex¬periments đề cập trong phần 20.5, một loạt các kỹ thuật được nêu ở đây dường như sử dụng 3 bit cho mỗi liên kết, chỉ là trung bình - một sự giảm đáng kể từ 64 các yêu cầu trong đại diện ngây thơ.Although these ideas give us a representation of sizable web graphs that comfortably fit in memory, we still need to support connectivity queries. What is entailed in retrieving from this representation the set of links from a page? First, we need an index lookup from (a hash of) the URL to its row number in the table. Next, we need to reconstruct these entries, which may be encoded in terms of entries in other rows. This entails following the offsets to reconstruct these other rows - a process that in principle could lead through many levels of indirection. In practice, however, this does not happen very often. A heuristic for controlling this can be introduced into the construction of the table: When examining the preceding seven rows as candidates from which to model the current row, we demand a threshold of similarity be¬tween the current row and the candidate prototype. This threshold must be chosen with care. If the threshold is set too high, we seldom use prototypes and express many rows afresh. If the threshold is too low, most rows get ex¬pressed in terms of prototypes, so that at query time the reconstruction of a row leads to many levels of indirection through preceding prototypes.r Exercise 20.4 We noted that expressing a row in terms of one of seven pre¬ceding rows allowed us to use no more than three bits to specify which of the preceding rows we are using as prototype. Why seven and not eight preceding rows? (Hint: Consider the case when none of the preceding seven rows is a good prototype.)Exercise 20.5 We noted that for the scheme in Section 20.4, decoding the links incident on a URL could result in many levels of indirection. Construct an example in which the number of levels of indirection grows linearly with the number of URLs. References and further readingThe first web crawler appears to be Matthew Gray's Wanderer, written in the spring of 1993. The Mercator crawler is due to Najork and Heydon (Najork and Heydon 2001, 2002); the treatment in this chapter follows their work. Other classic early descriptions of web crawling include Burner (1997), Brin and Page (1998), Cho et al. (1998), and the creators of the Webbase sys¬tem at Stanford (Hirai et al. 2000). Cho and Garcia-Molina (2002) give a taxonomy and comparative study of different modes of communication bettween the nodes of a distributed crawler. The Robots Exclusion Protocol stan¬dard is described at www.robotstxt.org/wc/exclusion.html. Boldi et al. (2002) and Shkapenyuk and Suel (2002) provide more recent details of implementing large-scale distributed web crawlers.Our discussion of DNS resolution (Section 20.2.2) uses the current convention for Internet addresses, known as IPv4 (for Internet Protocol version 4); each IP address is a sequence of four bytes. In the future, the convention for addresses (collectively known as the internet address space) is likely to use a new standard known as IPv6 (www.ipv6.org/).Tomasic and Garcia-Molina (1993) and Jeong and Omiecinski (1995) are key early papers evaluating term partitioning versus document partitioning for distributed indexes. Document partitioning is found to be superior, at least when the distribution of terms is skewed, as it typically is in practice. This result has generally been confirmed in more recent work (MacFarlane et al. 2000). But the outcome depends on the details of the distributed system;at least one thread of work has reached the opposite conclusion (Ribeiro- Neto and Barbosa 1998; Badue et al. 2001). Sornil (2001) argues for a partitioning scheme that is a hybrid between term and document partitioning. Barroso et al. (2003) describe the distribution methods used at Google. The first implementation of a connectivity server was described by Bharat et al. (1998). The scheme discussed in this chapter, currently believed to be the best published scheme (achieving as few as 3 bits per link for encoding), is described in a series of papers by Boldi and Vigna (2004a, 2004b).

đang được dịch, vui lòng đợi..

Kết quả (Việt) 2:[Sao chép]

Sao chép!

Chúng tôi giả định rằng mỗi trang web được đại diện bởi một số nguyên duy nhất; các Speci fi c khổ được sử dụng để chỉ định các số nguyên được mô tả dưới đây. Chúng tôi xây dựng một bảng kề mà giống như một chỉ số đảo ngược; nó có một hàng cho mỗi trang web, với các dòng lệnh của các số nguyên tương ứng. Hàng đối với bất kỳ p trang có chứa một danh sách sắp xếp các số nguyên, mỗi tương ứng với một trang web có liên kết đến p. Bảng này cho phép chúng tôi đáp ứng các truy vấn của các hình thức mà các trang liên kết đến p? Trong thời trang tương tự như chúng ta xây dựng một bảng mà mục là các trang liên kết đến bởi p.
Đại diện bảng này cắt giảm không gian được thực hiện bởi đại diện ngây thơ (trong đó chúng ta một cách rõ ràng đại diện cho mỗi liên kết bởi hai điểm cuối của nó, eacha số nguyên 32-bit) bằng 50%. Mô tả của chúng tôi dưới đây sẽ tập trung vào các bảng cho các liên kết từ mỗi trang; nó nên được rõ ràng rằng các kỹ thuật áp dụng chỉ là tốt để thetableoflinkstoeach page.Tofurtherreduce thestorage forthetable, chúng ta khai thác một số ý tưởng:
Sự tương đồng giữa các danh sách: Nhiều hàng của bảng có nhiều mục chung. Vì vậy, nếu chúng ta một cách rõ ràng đại diện cho một hàng mẫu cho một số hàng sim¬ilar, phần còn lại có thể được cô đọng biểu diễn theo hàng proto¬typical.
Địa phương: Nhiều liên kết từ một trang đi đến "gần" trang - trang trên cùng một máy chủ , Vi dụ như. Điều này cho thấy trong mã hóa các điểm đến của một liên kết, chúng ta thường có thể sử dụng các số nguyên nhỏ và do đó tiết kiệm không gian.
Chúng tôi sử dụng mã hóa khoảng trống trong danh sách được sắp xếp: Thay vì lưu trữ các điểm đến của mỗi liên kết, chúng ta lưu trữ bù đắp từ các entry trước đó ở hàng .
Bây giờ chúng ta phát triển từng kỹ thuật.
Trong một trật tự từ điển của tất cả các URL, chúng tôi đối xử với nhau URL như là một chuỗi chữ và số và sắp xếp các chuỗi. Hình 20.5 cho thấy một phân đoạn của thứ tự sắp xếp này. Đối với một loại tự từ điển thực sự của trang web, phần tên miền của URL nên được đảo ngược, vì vậy mà trở nên www.stanford.edu edu.stanford.www, nhưng điều này là không cần thiết ở đây bởi vì chúng tôi chủ yếu là liên quan với các liên kết đến một địa phương máy chủ duy nhất.
Để mỗi URL, chúng tôi chỉ định vị thế của mình trong trật tự này là các số nguyên nhận dạng duy nhất. Hình 20.6 cho thấy một ví dụ về một số và bảng kết quả. Trong chuỗi ví dụ này, www.stanford.edu/biology isas- ký số nguyên 2 vì nó là thứ hai trong chuỗi.
Tiếp theo chúng ta khai thác một tài sản mà xuất phát từ cách mà hầu hết các trang web được cấu trúc để có được sự tương đồng và địa phương. Hầu hết các trang web có một mẫu với một tập hợp các liên kết từ mỗi trang trong trang web để một tập cố định của các trang trên trang web (chẳng hạn như thông báo của bản quyền, điều khoản sử dụng, vv). Trong trường hợp này, các hàng tương ứng với các trang trong một trang web sẽ có nhiều mục bảng điểm chung. Hơn nữa, theo thứ tự tự từ điển của URL, nó rất có khả năng rằng các trang từ một trang web xuất hiện như là các hàng liền kề nhau trong bảng.
Chúng tôi áp dụng các chiến lược sau: Chúng tôi đi bộ xuống bàn, mã hóa mỗi dòng của bảng trong điều khoản của bảy hàng trước . Trong ví dụ của hình 20.6, chúng tôi có thể mã hóa các hàng ghế thứ tư là "giống như hàng tại offset 2 (mean¬ing, hai hàng trước đó trong bảng), với 9 thay thế bằng 8." Điều này đòi hỏi các đặc điểm kỹ thuật của sự bù đắp, các số nguyên (s) giảm (trong trường hợp này 9) và teger trong- (s) thêm (trong trường hợp này 8). Việc sử dụng chỉ có bảy hàng trước có hai ưu điểm: (i) sự bù đắp có thể được thể hiện với chỉ có 3 bit; sự lựa chọn này được tối ưu hóa thực nghiệm (lý do cho bảy và không tám hàng trước là chủ đề của bài tập 20.4) và (ii) sửa chữa tối đa bù đắp cho một giá trị nhỏ như bảy tránh được việc phải thực hiện một cuộc tìm kiếm tốn kém trong số rất nhiều mẫu ứng cử viên về để diễn tả hàng hiện tại.
Nếu không có trước bảy hàng là một nguyên mẫu tốt cho express¬ing hàng hiện tại? Điều này sẽ xảy ra, ví dụ, ở mỗi ranh giới be¬tween trang web khác nhau như chúng tôi đi bộ xuống các hàng của bảng. Trong trường hợp này, chúng tôi chỉ đơn giản là thể hiện hàng như bắt đầu từ tập rỗng và "thêm vào" mỗi số nguyên trong hàng đó. Bằng cách sử dụng mã hóa để lưu trữ khoảng cách các khoảng trống (chứ không phải là số nguyên thực tế) trong mỗi hàng, và mã hóa những khoảng cách chặt chẽ dựa trên việc phân phối các giá trị của họ, chúng ta có được giảm không gian thêm. Trong ex¬periments nêu tại mục 20.5, hàng loạt các kỹ thuật nêu ở đây xuất hiện để sử dụng càng ít càng 3 bit cho mỗi liên kết, trên trung bình -. Giảm đáng kể từ 64 yêu cầu trong biểu diễn ngây thơ
Mặc dù những ý tưởng cung cấp cho chúng tôi là một đại diện của khá lớn đồ thị web thoải mái phù hợp trong bộ nhớ, chúng ta vẫn cần phải hỗ trợ các truy vấn kết nối. Những gì được lồng trong việc lấy từ đại diện này tập hợp các liên kết từ một trang? Đầu tiên, chúng ta cần một tra cứu chỉ số từ (một hash của) các URL để số lượng hàng của mình trong bảng. Tiếp theo, chúng ta cần phải xây dựng lại các mục, trong đó có thể được mã hóa trong các điều khoản của các mục trong các hàng khác. Điều này đòi hỏi sau đây offsets để tái tạo lại các hàng khác - một quá trình mà theo nguyên tắc có thể dẫn qua nhiều cấp độ về mình. Trong thực tế, tuy nhiên, điều này không xảy ra rất
thường xuyên. Một heuristic cho việc kiểm soát này có thể được giới thiệu vào việc xây dựng các bảng: Khi kiểm tra trước bảy hàng là ứng cử viên mà từ đó để mô hình hàng hiện tại, chúng tôi yêu cầu một ngưỡng tương be¬tween hàng hiện tại và nguyên mẫu ứng cử viên. Ngưỡng này phải được lựa chọn cẩn thận. Nếu ngưỡng được đặt quá cao, chúng tôi ít khi sử dụng nguyên mẫu và thể hiện nhiều hàng lại từ đầu. Nếu ngưỡng này là quá thấp, hầu hết các hàng được ex¬pressed về các nguyên mẫu, vì vậy ở thời gian truy vấn việc tái thiết của một hàng dẫn đến nhiều cấp độ về mình thông qua các nguyên mẫu trước.
R Exercise 20.4 Chúng tôi lưu ý rằng hiện liên tiếp trong các điều khoản của một trong bảy hàng pre¬ceding cho phép chúng ta sử dụng không quá ba bit để xác định các hàng trước, chúng tôi đang sử dụng như là nguyên mẫu. Tại sao bảy và tám hàng không trước? (Gợi ý:. Xem xét trường hợp khi không có vị trước bảy hàng là một nguyên mẫu tốt)
Tập thể dục 20,5 Chúng tôi lưu ý rằng đối với các đề án trong Mục 20.4, giải mã các sự cố liên kết trên một URL có thể dẫn đến nhiều cấp độ về mình. Xây dựng một ví dụ trong đó số lượng các mức gián tiếp tăng tuyến tính với số lượng các URL.
Tài liệu tham khảo và đọc thêm
Crawler web đầu tiên xuất hiện là Matthew Gray giang hồ, được viết vào mùa xuân năm 1993. Các Mercator bánh xích là do Najork và Heydon (Najork và Heydon 2001, 2002); điều trị trong chương này sau công việc của họ. Giới thiệu đầu kinh điển khác của web crawling bao gồm Burner (1997), Brin và Page (1998), Cho et al. (1998), và những người sáng tạo của sys¬tem Webbase tại Stanford (Hirai et al. 2000). Cho và Garcia-Molina (2002) đưa ra một nguyên tắc phân loại và nghiên cứu so sánh các phương thức liên lạc khác nhau bettween các nút của một trình thu thập phân. The Robots Exclusion Protocol stan¬dard được mô tả tại www.robotstxt.org/wc/exclusion.html. Boldi et al. (2002) và Shkapenyuk và Suel (2002) cung cấp thêm chi tiết gần đây thực hiện trên quy mô lớn phân phối thu thập web.
Thảo luận của chúng tôi về độ phân giải DNS (Phần 20.2.2) sử dụng các quy ước hiện hành đối với các địa chỉ Internet, gọi là IPv4 (đối với phiên bản Internet Protocol 4); mỗi địa chỉ IP là một dãy bốn byte. Trong tương lai, các quy ước cho các địa chỉ (gọi chung là các không gian địa chỉ internet) có thể sẽ sử dụng một tiêu chuẩn mới được gọi là IPv6 (www.ipv6.org/).
Tomasic và Garcia-Molina (1993) và Jeong và Omiecinski (1995 ) là giấy tờ quan trọng đầu đánh giá phân vùng hạn so với phân vùng tài liệu cho các chỉ số phân phối. Phân vùng tài liệu được tìm thấy là tốt, ít nhất là khi sự phân bố của các điều khoản bị lệch, vì nó thường là trong thực tế. Kết quả này đã thường được khẳng định trong công việc gần đây (MacFarlane et al. 2000). Nhưng kết quả phụ thuộc vào các chi tiết của hệ thống phân phối;
ít nhất một chủ đề của tác phẩm đã đi đến kết luận ngược lại (Ribeiro- Neto và Barbosa 1998; Badue et al 2001.). Sornil (2001) lập luận cho một lược đồ phân vùng đó là một lai giữa từ ngữ và phân vùng tài liệu. Barroso et al. (2003) mô tả các phương pháp phân phối sử dụng ở Google. Việc thực hiện đầu tiên của một máy chủ kết nối được mô tả bởi Bharat et al. (1998). Đề án được thảo luận trong chương này, hiện đang được cho là phương án tốt nhất đã xuất bản (đạt vài như là 3 bit cho mỗi liên kết để mã hóa), được mô tả trong một loạt các giấy tờ bằng Boldi và Vigna (2004a, 2004b).

đang được dịch, vui lòng đợi..

Kết quả (Việt) 3:[Sao chép]

Sao chép!

đang được dịch, vui lòng đợi..

Các ngôn ngữ khác

Hỗ trợ công cụ dịch thuật: Albania, Amharic, Anh, Armenia, Azerbaijan, Ba Lan, Ba Tư, Bantu, Basque, Belarus, Bengal, Bosnia, Bulgaria, Bồ Đào Nha, Catalan, Cebuano, Chichewa, Corsi, Creole (Haiti), Croatia, Do Thái, Estonia, Filipino, Frisia, Gael Scotland, Galicia, George, Gujarat, Hausa, Hawaii, Hindi, Hmong, Hungary, Hy Lạp, Hà Lan, Hà Lan (Nam Phi), Hàn, Iceland, Igbo, Ireland, Java, Kannada, Kazakh, Khmer, Kinyarwanda, Klingon, Kurd, Kyrgyz, Latinh, Latvia, Litva, Luxembourg, Lào, Macedonia, Malagasy, Malayalam, Malta, Maori, Marathi, Myanmar, Mã Lai, Mông Cổ, Na Uy, Nepal, Nga, Nhật, Odia (Oriya), Pashto, Pháp, Phát hiện ngôn ngữ, Phần Lan, Punjab, Quốc tế ngữ, Rumani, Samoa, Serbia, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenia, Somali, Sunda, Swahili, Séc, Tajik, Tamil, Tatar, Telugu, Thái, Thổ Nhĩ Kỳ, Thụy Điển, Tiếng Indonesia, Tiếng Ý, Trung, Trung (Phồn thể), Turkmen, Tây Ban Nha, Ukraina, Urdu, Uyghur, Uzbek, Việt, Xứ Wales, Yiddish, Yoruba, Zulu, Đan Mạch, Đức, Ả Rập, dịch ngôn ngữ.