We assume that each web page is rep

We assume that each web page is represented by a unique integer; the speciﬁc scheme used to assign these integers is described below. We build an adjacency table that resembles an inverted index; it has a row for each web page, with the rows ordered by the corresponding integers. The row for any page p contains a sorted list of integers, each corresponding to a web page that links to p. This table permits us to respond to queries of the form which pages link to p?In similar fashion we build a table whose entries are the pages linked to by p.
This table representation cuts the space taken by the naive representation (in which we explicitly represent each link by its two end points,eacha 32-bit integer) by 50%. Our description below will focus on the table for the links from each page; it should be clear that the techniques apply just as well to thetableoflinkstoeach page.Tofurtherreduce thestorage forthetable,we exploit several ideas:
Similarity between lists: Many rows of the table have many entries in common. Thus, if we explicitly represent a prototype row for several sim¬ilar rows, the remainder can be succinctly expressed in terms of the proto¬typical row.
Locality: Many links from a page go to "nearby" pages - pages on the same host, for instance. This suggests that in encoding the destination of a link, we can often use small integers and thereby save space.
We use gap encodings in sorted lists: Rather than store the destination of each link, we store the offset from the previous entry in the row.
We now develop each of these techniques.
In a lexicographic ordering of all URLs, we treat each URL as an alphanumeric string and sort these strings. Figure 20.5 shows a segment of this sorted order. For a true lexicographic sort of web pages, the domain name part of the URL should be inverted, so that www.stanford.edu becomes edu.stanford.www, but this is not necessary here because we are mainly concerned with links local to a single host.
To each URL, we assign its position in this ordering as the unique identifying integer. Figure 20.6 shows an example of such a numbering and the resulting table. In this example sequence, www.stanford.edu/biology isas- signed the integer 2 because it is second in the sequence.
We next exploit a property that stems from the way most websites are structured to get similarity and locality. Most websites have a template with a set of links from each page in the site to a fixed set of pages on the site (such as its copyright notice, terms of use, and so on). In this case, the rows corresponding to pages in a website will have many table entries in common. Moreover, under the lexicographic ordering of URLs, it is very likely that the pages from a website appear as contiguous rows in the table.
We adopt the following strategy: We walk down the table, encoding each table row in terms of the seven preceding rows. In the example of Figure 20.6, we could encode the fourth row as "the same as the row at offset 2 (mean¬ing, two rows earlier in the table), with 9 replaced by 8." This requires the specification of the offset, the integer(s) dropped (in this case 9) and the in- teger(s) added (in this case 8). The use of only the seven preceding rows has two advantages: (i) the offset can be expressed with only 3 bits; this choice is optimized empirically (the reason for seven and not eight preceding rows is the subject of Exercise 20.4) and (ii) fixing the maximum offset to a small value like seven avoids having to perform an expensive search among many candidate prototypes in terms of which to express the current row.
What if none of the preceding seven rows is a good prototype for express¬ing the current row? This would happen, for instance, at each boundary be¬tween different websites as we walk down the rows of the table. In this case, we simply express the row as starting from the empty set and "adding in" each integer in that row. By using gap encodings to store the gaps (rather than the actual integers) in each row, and encoding these gaps tightly based on the distribution of their values, we obtain further space reduction. In ex¬periments mentioned in Section 20.5, the series of techniques outlined here appears to use as few as 3 bits per link, on average - a dramatic reduction from the 64 required in the naive representation.
Although these ideas give us a representation of sizable web graphs that comfortably fit in memory, we still need to support connectivity queries. What is entailed in retrieving from this representation the set of links from a page? First, we need an index lookup from (a hash of) the URL to its row number in the table. Next, we need to reconstruct these entries, which may be encoded in terms of entries in other rows. This entails following the offsets to reconstruct these other rows - a process that in principle could lead through many levels of indirection. In practice, however, this does not happen very
often. A heuristic for controlling this can be introduced into the construction of the table: When examining the preceding seven rows as candidates from which to model the current row, we demand a threshold of similarity be¬tween the current row and the candidate prototype. This threshold must be chosen with care. If the threshold is set too high, we seldom use prototypes and express many rows afresh. If the threshold is too low, most rows get ex¬pressed in terms of prototypes, so that at query time the reconstruction of a row leads to many levels of indirection through preceding prototypes.
r Exercise 20.4 We noted that expressing a row in terms of one of seven pre¬ceding rows allowed us to use no more than three bits to specify which of the preceding rows we are using as prototype. Why seven and not eight preceding rows? (Hint: Consider the case when none of the preceding seven rows is a good prototype.)
Exercise 20.5 We noted that for the scheme in Section 20.4, decoding the links incident on a URL could result in many levels of indirection. Construct an example in which the number of levels of indirection grows linearly with the number of URLs.
References and further reading
The first web crawler appears to be Matthew Gray's Wanderer, written in the spring of 1993. The Mercator crawler is due to Najork and Heydon (Najork and Heydon 2001, 2002); the treatment in this chapter follows their work. Other classic early descriptions of web crawling include Burner (1997), Brin and Page (1998), Cho et al. (1998), and the creators of the Webbase sys¬tem at Stanford (Hirai et al. 2000). Cho and Garcia-Molina (2002) give a taxonomy and comparative study of different modes of communication bettween the nodes of a distributed crawler. The Robots Exclusion Protocol stan¬dard is described at www.robotstxt.org/wc/exclusion.html. Boldi et al. (2002) and Shkapenyuk and Suel (2002) provide more recent details of implementing large-scale distributed web crawlers.
Our discussion of DNS resolution (Section 20.2.2) uses the current convention for Internet addresses, known as IPv4 (for Internet Protocol version 4); each IP address is a sequence of four bytes. In the future, the convention for addresses (collectively known as the internet address space) is likely to use a new standard known as IPv6 (www.ipv6.org/).
Tomasic and Garcia-Molina (1993) and Jeong and Omiecinski (1995) are key early papers evaluating term partitioning versus document partitioning for distributed indexes. Document partitioning is found to be superior, at least when the distribution of terms is skewed, as it typically is in practice. This result has generally been confirmed in more recent work (MacFarlane et al. 2000). But the outcome depends on the details of the distributed system;
at least one thread of work has reached the opposite conclusion (Ribeiro- Neto and Barbosa 1998; Badue et al. 2001). Sornil (2001) argues for a partitioning scheme that is a hybrid between term and document partitioning. Barroso et al. (2003) describe the distribution methods used at Google. The first implementation of a connectivity server was described by Bharat et al. (1998). The scheme discussed in this chapter, currently believed to be the best published scheme (achieving as few as 3 bits per link for encoding), is described in a series of papers by Boldi and Vigna (2004a, 2004b).

We assume that each web page is represented by a unique integer; the speciﬁc scheme used to assign these integers is described below. We build an adjacency table that resembles an inverted index; it has a row for each web page, with the rows ordered by the corresponding integers. The row for any page p contains a sorted list of integers, each corresponding to a web page that links to p. This table permits us to respond to queries of the form which pages link to p?In similar fashion we build a table whose entries are the pages linked to by p. 
This table representation cuts the space taken by the naive representation (in which we explicitly represent each link by its two end points,eacha 32-bit integer) by 50%. Our description below will focus on the table for the links from each page; it should be clear that the techniques apply just as well to thetableoflinkstoeach page.Tofurtherreduce thestorage forthetable,we exploit several ideas:
 Similarity between lists: Many rows of the table have many entries in common. Thus, if we explicitly represent a prototype row for several sim¬ilar rows, the remainder can be succinctly expressed in terms of the proto¬typical row.
 Locality: Many links from a page go to "nearby" pages - pages on the same host, for instance. This suggests that in encoding the destination of a link, we can often use small integers and thereby save space.
 We use gap encodings in sorted lists: Rather than store the destination of each link, we store the offset from the previous entry in the row.
We now develop each of these techniques.
In a lexicographic ordering of all URLs, we treat each URL as an alphanumeric string and sort these strings. Figure 20.5 shows a segment of this sorted order. For a true lexicographic sort of web pages, the domain name part of the URL should be inverted, so that www.stanford.edu becomes edu.stanford.www, but this is not necessary here because we are mainly concerned with links local to a single host.
To each URL, we assign its position in this ordering as the unique identifying integer. Figure 20.6 shows an example of such a numbering and the resulting table. In this example sequence, www.stanford.edu/biology isas- signed the integer 2 because it is second in the sequence.
We next exploit a property that stems from the way most websites are structured to get similarity and locality. Most websites have a template with a set of links from each page in the site to a fixed set of pages on the site (such as its copyright notice, terms of use, and so on). In this case, the rows corresponding to pages in a website will have many table entries in common. Moreover, under the lexicographic ordering of URLs, it is very likely that the pages from a website appear as contiguous rows in the table.
We adopt the following strategy: We walk down the table, encoding each table row in terms of the seven preceding rows. In the example of Figure 20.6, we could encode the fourth row as "the same as the row at offset 2 (mean¬ing, two rows earlier in the table), with 9 replaced by 8." This requires the specification of the offset, the integer(s) dropped (in this case 9) and the in- teger(s) added (in this case 8). The use of only the seven preceding rows has two advantages: (i) the offset can be expressed with only 3 bits; this choice is optimized empirically (the reason for seven and not eight preceding rows is the subject of Exercise 20.4) and (ii) fixing the maximum offset to a small value like seven avoids having to perform an expensive search among many candidate prototypes in terms of which to express the current row.
What if none of the preceding seven rows is a good prototype for express¬ing the current row? This would happen, for instance, at each boundary be¬tween different websites as we walk down the rows of the table. In this case, we simply express the row as starting from the empty set and "adding in" each integer in that row. By using gap encodings to store the gaps (rather than the actual integers) in each row, and encoding these gaps tightly based on the distribution of their values, we obtain further space reduction. In ex¬periments mentioned in Section 20.5, the series of techniques outlined here appears to use as few as 3 bits per link, on average - a dramatic reduction from the 64 required in the naive representation.
Although these ideas give us a representation of sizable web graphs that comfortably fit in memory, we still need to support connectivity queries. What is entailed in retrieving from this representation the set of links from a page? First, we need an index lookup from (a hash of) the URL to its row number in the table. Next, we need to reconstruct these entries, which may be encoded in terms of entries in other rows. This entails following the offsets to reconstruct these other rows - a process that in principle could lead through many levels of indirection. In practice, however, this does not happen very 
often. A heuristic for controlling this can be introduced into the construction of the table: When examining the preceding seven rows as candidates from which to model the current row, we demand a threshold of similarity be¬tween the current row and the candidate prototype. This threshold must be chosen with care. If the threshold is set too high, we seldom use prototypes and express many rows afresh. If the threshold is too low, most rows get ex¬pressed in terms of prototypes, so that at query time the reconstruction of a row leads to many levels of indirection through preceding prototypes.
r Exercise 20.4 We noted that expressing a row in terms of one of seven pre¬ceding rows allowed us to use no more than three bits to specify which of the preceding rows we are using as prototype. Why seven and not eight preceding rows? (Hint: Consider the case when none of the preceding seven rows is a good prototype.)
Exercise 20.5 We noted that for the scheme in Section 20.4, decoding the links incident on a URL could result in many levels of indirection. Construct an example in which the number of levels of indirection grows linearly with the number of URLs.
 References and further reading
The first web crawler appears to be Matthew Gray's Wanderer, written in the spring of 1993. The Mercator crawler is due to Najork and Heydon (Najork and Heydon 2001, 2002); the treatment in this chapter follows their work. Other classic early descriptions of web crawling include Burner (1997), Brin and Page (1998), Cho et al. (1998), and the creators of the Webbase sys¬tem at Stanford (Hirai et al. 2000). Cho and Garcia-Molina (2002) give a taxonomy and comparative study of different modes of communication bettween the nodes of a distributed crawler. The Robots Exclusion Protocol stan¬dard is described at www.robotstxt.org/wc/exclusion.html. Boldi et al. (2002) and Shkapenyuk and Suel (2002) provide more recent details of implementing large-scale distributed web crawlers.
Our discussion of DNS resolution (Section 20.2.2) uses the current convention for Internet addresses, known as IPv4 (for Internet Protocol version 4); each IP address is a sequence of four bytes. In the future, the convention for addresses (collectively known as the internet address space) is likely to use a new standard known as IPv6 (www.ipv6.org/).
Tomasic and Garcia-Molina (1993) and Jeong and Omiecinski (1995) are key early papers evaluating term partitioning versus document partitioning for distributed indexes. Document partitioning is found to be superior, at least when the distribution of terms is skewed, as it typically is in practice. This result has generally been confirmed in more recent work (MacFarlane et al. 2000). But the outcome depends on the details of the distributed system;
at least one thread of work has reached the opposite conclusion (Ribeiro- Neto and Barbosa 1998; Badue et al. 2001). Sornil (2001) argues for a partitioning scheme that is a hybrid between term and document partitioning. Barroso et al. (2003) describe the distribution methods used at Google. The first implementation of a connectivity server was described by Bharat et al. (1998). The scheme discussed in this chapter, currently believed to be the best published scheme (achieving as few as 3 bits per link for encoding), is described in a series of papers by Boldi and Vigna (2004a, 2004b).

0/5000

Từ: -

Sang: -

Kết quả (Việt) 1: [Sao chép]

Sao chép!

We assume that each web page is represented by a unique integer; the speciﬁc scheme used to assign these integers is described below. We build an adjacency table that resembles an inverted index; it has a row for each web page, with the rows ordered by the corresponding integers. The row for any page p contains a sorted list of integers, each corresponding to a web page that links to p. This table permits us to respond to queries of the form which pages link to p?In similar fashion we build a table whose entries are the pages linked to by p. This table representation cuts the space taken by the naive representation (in which we explicitly represent each link by its two end points,eacha 32-bit integer) by 50%. Our description below will focus on the table for the links from each page; it should be clear that the techniques apply just as well to thetableoflinkstoeach page.Tofurtherreduce thestorage forthetable,we exploit several ideas: Similarity between lists: Many rows of the table have many entries in common. Thus, if we explicitly represent a prototype row for several sim¬ilar rows, the remainder can be succinctly expressed in terms of the proto¬typical row. Locality: Many links from a page go to "nearby" pages - pages on the same host, for instance. This suggests that in encoding the destination of a link, we can often use small integers and thereby save space. We use gap encodings in sorted lists: Rather than store the destination of each link, we store the offset from the previous entry in the row.We now develop each of these techniques.In a lexicographic ordering of all URLs, we treat each URL as an alphanumeric string and sort these strings. Figure 20.5 shows a segment of this sorted order. For a true lexicographic sort of web pages, the domain name part of the URL should be inverted, so that www.stanford.edu becomes edu.stanford.www, but this is not necessary here because we are mainly concerned with links local to a single host.To each URL, we assign its position in this ordering as the unique identifying integer. Figure 20.6 shows an example of such a numbering and the resulting table. In this example sequence, www.stanford.edu/biology isas- signed the integer 2 because it is second in the sequence.We next exploit a property that stems from the way most websites are structured to get similarity and locality. Most websites have a template with a set of links from each page in the site to a fixed set of pages on the site (such as its copyright notice, terms of use, and so on). In this case, the rows corresponding to pages in a website will have many table entries in common. Moreover, under the lexicographic ordering of URLs, it is very likely that the pages from a website appear as contiguous rows in the table.We adopt the following strategy: We walk down the table, encoding each table row in terms of the seven preceding rows. In the example of Figure 20.6, we could encode the fourth row as "the same as the row at offset 2 (mean¬ing, two rows earlier in the table), with 9 replaced by 8." This requires the specification of the offset, the integer(s) dropped (in this case 9) and the in- teger(s) added (in this case 8). The use of only the seven preceding rows has two advantages: (i) the offset can be expressed with only 3 bits; this choice is optimized empirically (the reason for seven and not eight preceding rows is the subject of Exercise 20.4) and (ii) fixing the maximum offset to a small value like seven avoids having to perform an expensive search among many candidate prototypes in terms of which to express the current row.What if none of the preceding seven rows is a good prototype for express¬ing the current row? This would happen, for instance, at each boundary be¬tween different websites as we walk down the rows of the table. In this case, we simply express the row as starting from the empty set and "adding in" each integer in that row. By using gap encodings to store the gaps (rather than the actual integers) in each row, and encoding these gaps tightly based on the distribution of their values, we obtain further space reduction. In ex¬periments mentioned in Section 20.5, the series of techniques outlined here appears to use as few as 3 bits per link, on average - a dramatic reduction from the 64 required in the naive representation.Although these ideas give us a representation of sizable web graphs that comfortably fit in memory, we still need to support connectivity queries. What is entailed in retrieving from this representation the set of links from a page? First, we need an index lookup from (a hash of) the URL to its row number in the table. Next, we need to reconstruct these entries, which may be encoded in terms of entries in other rows. This entails following the offsets to reconstruct these other rows - a process that in principle could lead through many levels of indirection. In practice, however, this does not happen very often. A heuristic for controlling this can be introduced into the construction of the table: When examining the preceding seven rows as candidates from which to model the current row, we demand a threshold of similarity be¬tween the current row and the candidate prototype. This threshold must be chosen with care. If the threshold is set too high, we seldom use prototypes and express many rows afresh. If the threshold is too low, most rows get ex¬pressed in terms of prototypes, so that at query time the reconstruction of a row leads to many levels of indirection through preceding prototypes.r Exercise 20.4 We noted that expressing a row in terms of one of seven pre¬ceding rows allowed us to use no more than three bits to specify which of the preceding rows we are using as prototype. Why seven and not eight preceding rows? (Hint: Consider the case when none of the preceding seven rows is a good prototype.)Exercise 20.5 We noted that for the scheme in Section 20.4, decoding the links incident on a URL could result in many levels of indirection. Construct an example in which the number of levels of indirection grows linearly with the number of URLs. References and further readingThe first web crawler appears to be Matthew Gray's Wanderer, written in the spring of 1993. The Mercator crawler is due to Najork and Heydon (Najork and Heydon 2001, 2002); the treatment in this chapter follows their work. Other classic early descriptions of web crawling include Burner (1997), Brin and Page (1998), Cho et al. (1998), and the creators of the Webbase sys¬tem at Stanford (Hirai et al. 2000). Cho and Garcia-Molina (2002) give a taxonomy and comparative study of different modes of communication bettween the nodes of a distributed crawler. The Robots Exclusion Protocol stan¬dard is described at www.robotstxt.org/wc/exclusion.html. Boldi et al. (2002) and Shkapenyuk and Suel (2002) provide more recent details of implementing large-scale distributed web crawlers.Our discussion of DNS resolution (Section 20.2.2) uses the current convention for Internet addresses, known as IPv4 (for Internet Protocol version 4); each IP address is a sequence of four bytes. In the future, the convention for addresses (collectively known as the internet address space) is likely to use a new standard known as IPv6 (www.ipv6.org/).Tomasic and Garcia-Molina (1993) and Jeong and Omiecinski (1995) are key early papers evaluating term partitioning versus document partitioning for distributed indexes. Document partitioning is found to be superior, at least when the distribution of terms is skewed, as it typically is in practice. This result has generally been confirmed in more recent work (MacFarlane et al. 2000). But the outcome depends on the details of the distributed system;
at least one thread of work has reached the opposite conclusion (Ribeiro- Neto and Barbosa 1998; Badue et al. 2001). Sornil (2001) argues for a partitioning scheme that is a hybrid between term and document partitioning. Barroso et al. (2003) describe the distribution methods used at Google. The first implementation of a connectivity server was described by Bharat et al. (1998). The scheme discussed in this chapter, currently believed to be the best published scheme (achieving as few as 3 bits per link for encoding), is described in a series of papers by Boldi and Vigna (2004a, 2004b).

đang được dịch, vui lòng đợi..

Kết quả (Việt) 2:[Sao chép]

Sao chép!

đang được dịch, vui lòng đợi..

Kết quả (Việt) 3:[Sao chép]

Sao chép!

đang được dịch, vui lòng đợi..

Các ngôn ngữ khác

Hỗ trợ công cụ dịch thuật: Albania, Amharic, Anh, Armenia, Azerbaijan, Ba Lan, Ba Tư, Bantu, Basque, Belarus, Bengal, Bosnia, Bulgaria, Bồ Đào Nha, Catalan, Cebuano, Chichewa, Corsi, Creole (Haiti), Croatia, Do Thái, Estonia, Filipino, Frisia, Gael Scotland, Galicia, George, Gujarat, Hausa, Hawaii, Hindi, Hmong, Hungary, Hy Lạp, Hà Lan, Hà Lan (Nam Phi), Hàn, Iceland, Igbo, Ireland, Java, Kannada, Kazakh, Khmer, Kinyarwanda, Klingon, Kurd, Kyrgyz, Latinh, Latvia, Litva, Luxembourg, Lào, Macedonia, Malagasy, Malayalam, Malta, Maori, Marathi, Myanmar, Mã Lai, Mông Cổ, Na Uy, Nepal, Nga, Nhật, Odia (Oriya), Pashto, Pháp, Phát hiện ngôn ngữ, Phần Lan, Punjab, Quốc tế ngữ, Rumani, Samoa, Serbia, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenia, Somali, Sunda, Swahili, Séc, Tajik, Tamil, Tatar, Telugu, Thái, Thổ Nhĩ Kỳ, Thụy Điển, Tiếng Indonesia, Tiếng Ý, Trung, Trung (Phồn thể), Turkmen, Tây Ban Nha, Ukraina, Urdu, Uyghur, Uzbek, Việt, Xứ Wales, Yiddish, Yoruba, Zulu, Đan Mạch, Đức, Ả Rập, dịch ngôn ngữ.