sublinear"Heaps' law [2] can also b

sublinear"

Heaps' law [2] can also be applied in characterizing natural language processing, according to which the vocabulary size grows in a sublinear function with document size, say with , where denotes the total number of words and is the number of distinct words. One ingredient causing such a sublinear growth may be the memory and bursty nature of human language [23]–[25]. A particular interesting phenomenon is the coexistence of the Zipf's law and Heaps' law. Gelbukh and Sidorov [26] observed these two laws in English, Russian and Spanish texts, with different exponents depending on languages. Similar results were recently reported for the corpus of web texts [27], including the Industry Sector database, the Open Directory and the English Wikipedia. Besides the statistical regularities of text, the occurrences of tags for online resources [28], [29], keywords for scientific publications [30], words contained by web pages resulted from web searching [31], and identifiers in modern Java, C++ and C programs [32] also simultaneously display the Zipf's law and Heaps' law. Benz et al. [33] reported the Zipf's law of the distribution of the features of small organic molecules, together with the Heaps' law about the number of unique features. In particular, the Zipf's law and Heaps' law are closely related to the evolving networks. It is well-known that some networks grow in an accelerating manner [34], [35] and have scale-free structures (see for example the WWW [36] and Internet [37]), in fact, the former property corresponds to the Heaps' law that the number of nodes grows in a sublinear form with the total degree of nodes, while the latter is equivalent to the Zipf's law for degree distribution.

Baeza-Yates and Navarro [38] showed that the two laws are related: when , it can be derived that if both the Zipf's law and Heaps' law hold, . By using a more sophisticated approach, Leijenhorst and Weide [39] generalized this result from the Zipf's law to the Mandelbrot's law [40] where and is a constant. Based on a variant of the Simon model [16], Montemurro and Zanette [41], [42] showed that the Zipf's law is a result from the Heaps' law with depending on and the modeling parameter. Also based on a stochastic model, Serrano et al. [27] claimed that the Zipf's law can result in the Heaps' law when , and the Heaps' exponent is . In this paper, we prove that for an evolving system with a stable Zipf's exponent, the Heaps' law can be directly derived from the Zipf's law without the help of any specific stochastic model. The relation is only an asymptotic solution hold for very-large-size systems with . We will refine this result for finite-size systems with and complement it with . In particular, we analyze the effects of system size on the Heaps' exponent, which are completely ignored in the literature. Extensive empirical analysis on tens of disparate systems ranging from keyword occurrences in scientific journals to spreading patterns of the novel virus influenza A (H1N1) has demonstrated that the refined results presented here can better capture the relation between Zipf's and Heaps' exponents. In particular, our results agree well with the evolving regularities of the accelerating networks and suggest that the accelerating growth is necessary to keep a stable power-law degree distribution. Whereas the majority of studies on the Heaps' law are limited in linguistics, our work opens up the door to a much wider horizon that includes many complex systems.
Results
Analytical Results

For simplicity of depiction, we use the language of word statistics in text, where denotes the frequency of the word with rank . However, the results are not limited to language systems. Note that is the very number of distinct words with frequency larger than . Denoting by the total number of word occurrences (i.e., size of the text) and the corresponding number of distinct words, then(1)Note that with a constant. According to the normalization condition , when and (these two conditions are hold for most real systems), . Substituting in Eq. 1 by , we have(2)According to the Zipf's law and the relation between the Zipf's and power-law exponents , the right part of Eq. 2 can be expressed in term of and , as(3)Combine Eq. 1 and Eq. 3, we can obtain the estimation of , as(4)Obviously, the text size is the sum of all words' occurrences, say(5)Notice that the summation is larger than the integration . The relative error of this approximation, for , increases with the increasing of and decreases with the increasing of (see Figure S1 the numerical results on the sensitivity of relative errors to parameters and ). Substituting by Eq. 4, it arrives to the relation between and :(6)The direct comparison between the empirical observation and Eq. 6, as well as an improved version of Eq. 6, is shown in Materials and Methods. Clearly, Eq. 6 is not a simply power-law form as described by the Heaps' law. We will see that the Heaps' law is an approximate result that can be derived from Eq. 6. Actually, when is considerably larger than 1, and ; while if is considerably smaller than 1, and . This approximated result can be summarized as(7)which is in accordance with the previous analytical results [29], [38], [39] for and has complemented the case for .

Although Eq. 6 is different from a strict power law, numerical results indicate that the relationship between and can be well fitted by the power-law functions (the fitting is usually much better than the empirical observations about the Heaps' law, see Materials and Methods for some typical examples). In Fig. 1, we report the numerical results with fixed total number of word occurrences . When is considerably larger or smaller than 1, the numerical results agree well with the known analytical solution in Eq. 7, however, a clear deviation is observed for (see Materials and Methods about how to get the numerical results for ).
thumbnail
Download:

PPT
PowerPoint slide
PNG
larger image (63KB)
TIFF
original image (368KB)

Figure 1. Relationship between the Heaps' expo

sublinear"

Heaps' law [2] can also be applied in characterizing natural language processing, according to which the vocabulary size grows in a sublinear function with document size, say with , where denotes the total number of words and is the number of distinct words. One ingredient causing such a sublinear growth may be the memory and bursty nature of human language [23]–[25]. A particular interesting phenomenon is the coexistence of the Zipf's law and Heaps' law. Gelbukh and Sidorov [26] observed these two laws in English, Russian and Spanish texts, with different exponents depending on languages. Similar results were recently reported for the corpus of web texts [27], including the Industry Sector database, the Open Directory and the English Wikipedia. Besides the statistical regularities of text, the occurrences of tags for online resources [28], [29], keywords for scientific publications [30], words contained by web pages resulted from web searching [31], and identifiers in modern Java, C++ and C programs [32] also simultaneously display the Zipf's law and Heaps' law. Benz et al. [33] reported the Zipf's law of the distribution of the features of small organic molecules, together with the Heaps' law about the number of unique features. In particular, the Zipf's law and Heaps' law are closely related to the evolving networks. It is well-known that some networks grow in an accelerating manner [34], [35] and have scale-free structures (see for example the WWW [36] and Internet [37]), in fact, the former property corresponds to the Heaps' law that the number of nodes grows in a sublinear form with the total degree of nodes, while the latter is equivalent to the Zipf's law for degree distribution.

Baeza-Yates and Navarro [38] showed that the two laws are related: when , it can be derived that if both the Zipf's law and Heaps' law hold, . By using a more sophisticated approach, Leijenhorst and Weide [39] generalized this result from the Zipf's law to the Mandelbrot's law [40] where and is a constant. Based on a variant of the Simon model [16], Montemurro and Zanette [41], [42] showed that the Zipf's law is a result from the Heaps' law with depending on and the modeling parameter. Also based on a stochastic model, Serrano et al. [27] claimed that the Zipf's law can result in the Heaps' law when , and the Heaps' exponent is . In this paper, we prove that for an evolving system with a stable Zipf's exponent, the Heaps' law can be directly derived from the Zipf's law without the help of any specific stochastic model. The relation is only an asymptotic solution hold for very-large-size systems with . We will refine this result for finite-size systems with and complement it with . In particular, we analyze the effects of system size on the Heaps' exponent, which are completely ignored in the literature. Extensive empirical analysis on tens of disparate systems ranging from keyword occurrences in scientific journals to spreading patterns of the novel virus influenza A (H1N1) has demonstrated that the refined results presented here can better capture the relation between Zipf's and Heaps' exponents. In particular, our results agree well with the evolving regularities of the accelerating networks and suggest that the accelerating growth is necessary to keep a stable power-law degree distribution. Whereas the majority of studies on the Heaps' law are limited in linguistics, our work opens up the door to a much wider horizon that includes many complex systems.
Results
Analytical Results

For simplicity of depiction, we use the language of word statistics in text, where denotes the frequency of the word with rank . However, the results are not limited to language systems. Note that is the very number of distinct words with frequency larger than . Denoting by the total number of word occurrences (i.e., size of the text) and the corresponding number of distinct words, then(1)Note that with a constant. According to the normalization condition , when and (these two conditions are hold for most real systems), . Substituting in Eq. 1 by , we have(2)According to the Zipf's law and the relation between the Zipf's and power-law exponents , the right part of Eq. 2 can be expressed in term of and , as(3)Combine Eq. 1 and Eq. 3, we can obtain the estimation of , as(4)Obviously, the text size is the sum of all words' occurrences, say(5)Notice that the summation is larger than the integration . The relative error of this approximation, for , increases with the increasing of and decreases with the increasing of (see Figure S1 the numerical results on the sensitivity of relative errors to parameters and ). Substituting by Eq. 4, it arrives to the relation between and :(6)The direct comparison between the empirical observation and Eq. 6, as well as an improved version of Eq. 6, is shown in Materials and Methods. Clearly, Eq. 6 is not a simply power-law form as described by the Heaps' law. We will see that the Heaps' law is an approximate result that can be derived from Eq. 6. Actually, when is considerably larger than 1, and ; while if is considerably smaller than 1, and . This approximated result can be summarized as(7)which is in accordance with the previous analytical results [29], [38], [39] for and has complemented the case for .

Although Eq. 6 is different from a strict power law, numerical results indicate that the relationship between and can be well fitted by the power-law functions (the fitting is usually much better than the empirical observations about the Heaps' law, see Materials and Methods for some typical examples). In Fig. 1, we report the numerical results with fixed total number of word occurrences . When is considerably larger or smaller than 1, the numerical results agree well with the known analytical solution in Eq. 7, however, a clear deviation is observed for (see Materials and Methods about how to get the numerical results for ).
thumbnail
Download:

PPT
 PowerPoint slide
 PNG
 larger image (63KB)
 TIFF
 original image (368KB)

Figure 1. Relationship between the Heaps' expo

0/5000

Từ: -

Sang: -

Kết quả (Việt) 1: [Sao chép]

Sao chép!

sublinear"Heaps' law [2] can also be applied in characterizing natural language processing, according to which the vocabulary size grows in a sublinear function with document size, say with , where denotes the total number of words and is the number of distinct words. One ingredient causing such a sublinear growth may be the memory and bursty nature of human language [23]–[25]. A particular interesting phenomenon is the coexistence of the Zipf's law and Heaps' law. Gelbukh and Sidorov [26] observed these two laws in English, Russian and Spanish texts, with different exponents depending on languages. Similar results were recently reported for the corpus of web texts [27], including the Industry Sector database, the Open Directory and the English Wikipedia. Besides the statistical regularities of text, the occurrences of tags for online resources [28], [29], keywords for scientific publications [30], words contained by web pages resulted from web searching [31], and identifiers in modern Java, C++ and C programs [32] also simultaneously display the Zipf's law and Heaps' law. Benz et al. [33] reported the Zipf's law of the distribution of the features of small organic molecules, together with the Heaps' law about the number of unique features. In particular, the Zipf's law and Heaps' law are closely related to the evolving networks. It is well-known that some networks grow in an accelerating manner [34], [35] and have scale-free structures (see for example the WWW [36] and Internet [37]), in fact, the former property corresponds to the Heaps' law that the number of nodes grows in a sublinear form with the total degree of nodes, while the latter is equivalent to the Zipf's law for degree distribution.Baeza-Yates and Navarro [38] showed that the two laws are related: when , it can be derived that if both the Zipf's law and Heaps' law hold, . By using a more sophisticated approach, Leijenhorst and Weide [39] generalized this result from the Zipf's law to the Mandelbrot's law [40] where and is a constant. Based on a variant of the Simon model [16], Montemurro and Zanette [41], [42] showed that the Zipf's law is a result from the Heaps' law with depending on and the modeling parameter. Also based on a stochastic model, Serrano et al. [27] claimed that the Zipf's law can result in the Heaps' law when , and the Heaps' exponent is . In this paper, we prove that for an evolving system with a stable Zipf's exponent, the Heaps' law can be directly derived from the Zipf's law without the help of any specific stochastic model. The relation is only an asymptotic solution hold for very-large-size systems with . We will refine this result for finite-size systems with and complement it with . In particular, we analyze the effects of system size on the Heaps' exponent, which are completely ignored in the literature. Extensive empirical analysis on tens of disparate systems ranging from keyword occurrences in scientific journals to spreading patterns of the novel virus influenza A (H1N1) has demonstrated that the refined results presented here can better capture the relation between Zipf's and Heaps' exponents. In particular, our results agree well with the evolving regularities of the accelerating networks and suggest that the accelerating growth is necessary to keep a stable power-law degree distribution. Whereas the majority of studies on the Heaps' law are limited in linguistics, our work opens up the door to a much wider horizon that includes many complex systems.ResultsAnalytical ResultsFor simplicity of depiction, we use the language of word statistics in text, where denotes the frequency of the word with rank . However, the results are not limited to language systems. Note that is the very number of distinct words with frequency larger than . Denoting by the total number of word occurrences (i.e., size of the text) and the corresponding number of distinct words, then(1)Note that with a constant. According to the normalization condition , when and (these two conditions are hold for most real systems), . Substituting in Eq. 1 by , we have(2)According to the Zipf's law and the relation between the Zipf's and power-law exponents , the right part of Eq. 2 can be expressed in term of and , as(3)Combine Eq. 1 and Eq. 3, we can obtain the estimation of , as(4)Obviously, the text size is the sum of all words' occurrences, say(5)Notice that the summation is larger than the integration . The relative error of this approximation, for , increases with the increasing of and decreases with the increasing of (see Figure S1 the numerical results on the sensitivity of relative errors to parameters and ). Substituting by Eq. 4, it arrives to the relation between and :(6)The direct comparison between the empirical observation and Eq. 6, as well as an improved version of Eq. 6, is shown in Materials and Methods. Clearly, Eq. 6 is not a simply power-law form as described by the Heaps' law. We will see that the Heaps' law is an approximate result that can be derived from Eq. 6. Actually, when is considerably larger than 1, and ; while if is considerably smaller than 1, and . This approximated result can be summarized as(7)which is in accordance with the previous analytical results [29], [38], [39] for and has complemented the case for .Although Eq. 6 is different from a strict power law, numerical results indicate that the relationship between and can be well fitted by the power-law functions (the fitting is usually much better than the empirical observations about the Heaps' law, see Materials and Methods for some typical examples). In Fig. 1, we report the numerical results with fixed total number of word occurrences . When is considerably larger or smaller than 1, the numerical results agree well with the known analytical solution in Eq. 7, however, a clear deviation is observed for (see Materials and Methods about how to get the numerical results for ).thumbnailDownload: PPT PowerPoint slide PNG larger image (63KB) TIFF original image (368KB)Figure 1. Relationship between the Heaps' expo

đang được dịch, vui lòng đợi..

Kết quả (Việt) 2:[Sao chép]

Sao chép!

sublinear " Heaps 'luật [2] cũng có thể được áp dụng trong việc mô tả xử lý ngôn ngữ tự nhiên, theo đó quy mô vốn từ vựng lớn lên trong một chức năng sublinear với kích thước tài liệu, nói với, nơi biểu thị tổng số từ và số từ biệt . Một thành phần gây ra như một sự tăng trưởng sublinear có thể là bộ nhớ và tính chất bùng phát của ngôn ngữ con người [23] -.. [25] Một hiện tượng thú vị đặc biệt là sự cùng tồn tại của pháp luật của Zipf và pháp luật Gelbukh Heaps 'và Sidorov [26] đã quan sát các hai luật bằng tiếng Anh, tiếng Nga và văn bản tiếng Tây Ban Nha, với số mũ khác nhau tùy thuộc vào ngôn ngữ. Kết quả tương tự gần đây đã được báo cáo cho các tập văn của văn bản web [27], bao gồm cả các cơ sở dữ liệu ngành Công nghiệp, Open Directory và Wikipedia tiếng Anh. Bên cạnh các quy tắc thống kê các văn bản, các lần xuất hiện của các thẻ cho các nguồn tài nguyên trực tuyến [28], [29], từ khóa cho các ấn phẩm khoa học [30], từ chứa của các trang web kết quả từ web tìm kiếm [31], và các định dạng Java hiện đại, C ++ và C chương trình [ 32] cũng đồng thời hiển thị pháp luật của Zipf và pháp luật Heaps '. Benz et al. [33] đã báo cáo pháp luật của Zipf về sự phân bố của các tính năng của các phân tử hữu cơ nhỏ, cùng với pháp luật 'Heaps về số lượng các tính năng độc đáo. Đặc biệt, pháp luật của Zipf và pháp luật Heaps 'liên quan chặt chẽ đến các mạng phát triển. Nó là nổi tiếng mà một số mạng lưới phát triển một cách tăng tốc [34], [35] và có cấu trúc mô-free (xem ví dụ về WWW [36] và Internet [37]), trên thực tế, những tài sản tương ứng với cựu . pháp luật 'Heaps rằng số lượng các nút lớn lên trong một hình thức sublinear với tổng mức độ của các nút, trong khi sau này là tương đương với pháp luật của Zipf cho phân phối độ Baeza-Yates và Navarro [38] cho thấy rằng hai luật có liên quan: khi, nó có thể được bắt nguồn mà pháp luật nếu cả của Zipf và giữ luật Heaps ',. Bằng cách sử dụng một cách tiếp cận tinh vi hơn, Leijenhorst và Weide [39] khái quát kết quả này từ pháp luật của Zipf pháp luật của Mandelbrot [40] nơi và là một hằng số. Dựa trên một biến thể của mô hình Simon [16], Montemurro và Zanette [41], [42] cho thấy rằng luật pháp của Zipf là một kết quả từ pháp luật 'Heaps với tùy thuộc vào và tham số mô hình. Ngoài ra dựa trên một mô hình ngẫu nhiên, Serrano et al. [27] cho rằng pháp luật của Zipf có thể dẫn đến việc Heaps 'pháp luật khi, và Heaps' mũ là. Trong bài báo này, chúng tôi chứng minh rằng đối với một hệ thống phát triển với số mũ một Zipf ổn định của, pháp luật 'Heaps có thể được bắt nguồn trực tiếp từ pháp luật của Zipf mà không cần sự giúp đỡ của bất kỳ mô hình ngẫu nhiên cụ thể. Các mối quan hệ chỉ là một giải pháp giữ tiệm cận cho hệ thống rất lớn kích thước với. Chúng tôi sẽ tinh chỉnh kết quả này cho các hệ thống hữu hạn kích thước với và bổ sung cho nó với. Đặc biệt, chúng tôi phân tích những ảnh hưởng của kích thước hệ thống về số mũ của Heaps, được hoàn toàn bị bỏ qua trong các tài liệu. Phân tích thực nghiệm rộng rãi trên hàng chục hệ thống khác nhau, từ lần xuất hiện từ khóa trong các tạp chí khoa học để truyền bá mẫu của virus cúm A mới (H1N1) đã chứng minh rằng kết quả tinh tế trình bày ở đây có thể nắm bắt tốt hơn mối quan hệ giữa số mũ Heaps 'của Zipf và. Đặc biệt, kết quả của chúng tôi đồng ý tốt với các qui luật phát triển của các mạng lưới thúc đẩy và đề nghị rằng tốc độ tăng là cần thiết để giữ một sức mạnh-luật phân bố mức độ ổn định. Trong khi phần lớn các nghiên cứu về pháp luật 'Heaps được giới hạn trong ngôn ngữ học, công việc của chúng tôi mở ra cánh cửa đến một chân trời rộng lớn hơn nhiều mà bao gồm nhiều hệ thống phức tạp. Kết quả Kết quả phân tích Để đơn giản mô tả, chúng tôi sử dụng ngôn ngữ của số liệu thống kê từ trong văn bản, nơi biểu thị tần số của từ với cấp bậc. Tuy nhiên, kết quả không giới hạn hệ thống ngôn ngữ. Lưu ý là số lượng rất các từ riêng biệt với tần số lớn hơn. Biểu thị bằng tổng số lần xuất hiện từ (tức là, kích thước của văn bản) và các số tương ứng của các từ riêng biệt, sau đó (1) Chú ý rằng với một hằng số. Theo các điều kiện bình thường, khi nào và (hai điều kiện là giữ cho hầu hết các hệ thống thực tế),. Thay vào phương trình. 1 bởi, chúng ta có (2) Theo luật của Zipf và mối quan hệ giữa các số mũ của Zipf và quyền lực-pháp luật, phần bên phải của phương trình. 2 có thể được thể hiện trong thời hạn từ, như (3) Kết hợp phương. 1 và Eq. 3, chúng ta có thể có được các ước lượng, như (4) Rõ ràng, kích thước văn bản là tổng hợp của tất cả các từ xuất hiện, nói (5) Chú ý rằng tổng là lớn hơn so với hội nhập. Các sai số tương đối của xấp xỉ này, cho, làm tăng với sự gia tăng của và giảm với sự gia tăng của (xem Hình S1 các kết quả bằng số vào sự nhạy cảm của các lỗi liên quan đến các thông số và). Thay thế bởi Eq. 4, nó đến với mối quan hệ giữa và: (6) Sự so sánh trực tiếp giữa các quan sát thực nghiệm và phương. 6, cũng như một phiên bản cải tiến của phương trình. 6, được thể hiện trong Vật liệu và phương pháp. Rõ ràng, Eq. 6 không phải là một hình thức quyền lực pháp luật đơn giản như mô tả của pháp luật 'Heaps. Chúng ta sẽ thấy rằng pháp luật 'Heaps là một kết quả gần đúng có thể được bắt nguồn từ biểu thức. 6. Trên thực tế, khi là lớn hơn đáng kể so với 1, và; trong khi nếu là nhỏ hơn đáng kể so với 1, và. Kết quả xấp xỉ này có thể được tóm tắt như (7), phù hợp với các kết quả phân tích trước đó [29], [38], [39] và đã bổ sung cho các trường hợp. Mặc dù phương. 6 là khác nhau từ một định luật nghiêm ngặt, kết quả số chỉ ra rằng mối quan hệ giữa và có thể được trang bị tốt bởi các chức năng điện rể (lắp thường là tốt hơn nhiều so với các quan sát thực nghiệm về pháp luật 'Heaps, xem liệu và phương pháp cho một số ví dụ điển hình). Trong hình. 1, chúng tôi báo cáo kết quả bằng số với tổng số cố định của các lần xuất hiện từ đó. Khi là lớn hơn đáng kể hoặc nhỏ hơn 1, kết quả số đồng ý tốt với các giải pháp phân tích được biết đến trong phương. 7, tuy nhiên, một sự sai lệch rõ ràng được quan sát cho (xem Vật liệu và phương pháp về cách để có được kết quả bằng số cho). thumbnail Download: PPT PowerPoint trượt PNG hình ảnh lớn hơn (63KB) TIFF hình ảnh ban đầu (368KB) Hình 1. Mối quan hệ giữa các Heaps 'hội chợ

đang được dịch, vui lòng đợi..

Kết quả (Việt) 3:[Sao chép]

Sao chép!

đang được dịch, vui lòng đợi..

Các ngôn ngữ khác

Hỗ trợ công cụ dịch thuật: Albania, Amharic, Anh, Armenia, Azerbaijan, Ba Lan, Ba Tư, Bantu, Basque, Belarus, Bengal, Bosnia, Bulgaria, Bồ Đào Nha, Catalan, Cebuano, Chichewa, Corsi, Creole (Haiti), Croatia, Do Thái, Estonia, Filipino, Frisia, Gael Scotland, Galicia, George, Gujarat, Hausa, Hawaii, Hindi, Hmong, Hungary, Hy Lạp, Hà Lan, Hà Lan (Nam Phi), Hàn, Iceland, Igbo, Ireland, Java, Kannada, Kazakh, Khmer, Kinyarwanda, Klingon, Kurd, Kyrgyz, Latinh, Latvia, Litva, Luxembourg, Lào, Macedonia, Malagasy, Malayalam, Malta, Maori, Marathi, Myanmar, Mã Lai, Mông Cổ, Na Uy, Nepal, Nga, Nhật, Odia (Oriya), Pashto, Pháp, Phát hiện ngôn ngữ, Phần Lan, Punjab, Quốc tế ngữ, Rumani, Samoa, Serbia, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenia, Somali, Sunda, Swahili, Séc, Tajik, Tamil, Tatar, Telugu, Thái, Thổ Nhĩ Kỳ, Thụy Điển, Tiếng Indonesia, Tiếng Ý, Trung, Trung (Phồn thể), Turkmen, Tây Ban Nha, Ukraina, Urdu, Uyghur, Uzbek, Việt, Xứ Wales, Yiddish, Yoruba, Zulu, Đan Mạch, Đức, Ả Rập, dịch ngôn ngữ.