A formal derivation of Heaps’ LawD.

A formal derivation of Heaps’ Law
D.C. van Leijenhorst, Th.P. van der Weide
*
Department of Computer Science, Faculty of Mathematics and Computing Science,
Radboud University of Nijmegen, Toernooiveld 1, 6525 EDNijmegen, Netherlands
Received 25 September 2003; received in revised form 26 February 2004; accepted 2 March 2004
Abstract
Word frequencies in text documents can be reasonably described by the Mandelbrot
distribution, which has Zipf’s Law as a special case. Furthermore, the growth of
vocabulary size as a function of the text size (its number of words) has been described in
Heaps’ Law. It has been shown that these two experimental laws are related.
In this paper we go a step further, and provide a (formal) derivation of Heaps’ Law
from the Mandelbrot distribution. We also provide a specification of the validity area
for applying Heaps’ Law.

2004 Elsevier Inc. All rights reserved.
1. Introduction
In many practical situations, a connection has been shown between the
order of probability of events, and the probability itself. The most well-known
models for such connections are Zipf’s Law [12] and the Mandelbrot distri-
bution [8].
Let the
r
th most probable event have probability
p
, then Zipf’s Law states
that
p

r
is (almost) equal for all events, while the Mandelbrot distribution
claims this for the expression
p
ð
c
þ
r
Þ
h
for some parameters
c
and
h
. In case
of
c
¼
0, the distribution is also referred to as the generalized Zipf’s Law. Some
authors motivate the validity of these laws from physical phenomena, see for
*
Corresponding author. Fax: +31-24-3553450.
E-mail address:
tvdw@cs.kun.nl
(Th.P. van der Weide).
0020-0255/$ - see front matter

2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.ins.2004.03.006
Information Sciences 170 (2005) 263–272
www.elsevier.com/locate/ins
example [4] for Zipf’s Law in the context of cities. But it is also possible to
derive Zipf/Mandelbrot’s Law from a simple statistical model [7]. For example,
Zipf’s Law can be derived for word occurrences in artificial language, when it is
assumed that letters that compose a word are drawn randomly from some
distribution. In practice, however, words are thoughtfully selected by the
author; yet on the long run this selection process may adjust to such a statis-
tical description.
Another experimental law of nature is Heaps’ Law [6], which describes the
average growth in the number of unique elements (also referred as the number
of records), when elements are drawn randomly without replacement from
some statistical distribution. For example, in the case of word occurrences in
natural language, Heaps’ Law predicts the vocabulary size of a document from
its text size, i.e., the number of words it contains. Heaps’ Law states that this
number of unique elements will grow according to
a
k
b
for some application
dependent constants
a
and
b
,0
<
b
<
1, where
k
is the number of drawings.
See Table 1 for an overview of used symbols.
In this paper we focus on the relation between Zipf’s Law and the
Mandelbrot distribution on the one hand, and Heaps’ Law on the other hand.
This relation has been recognized, for example in [3], but this relation has not
been formally motivated. In this paper we assume that elements are drawn
according to the Mandelbrot distribution, and derive Heaps’ Law for the
number of unique elements drawn. As a consequence, Heaps’ Law can also be
regarded in a natural way as a complexity estimate.
Unfortunately, this analysis leads to a rather untractable recurrence relation
that has no analytical solution. By applying techniques from complexity the-
ory, restricting ourselves to first-order terms, Heaps’ Law is obtained. Note
that by involving second order terms, a more advanced formulation of Heaps’
Law may be obtained.
Table 1
Table of most important symbols used in this paper
Symbol Meaning
N
Vocabulary size
c
Constant in Mandelbrot distribution
h
Constant in Mandelbrot distribution
a
N
Normalization constant of Mandelbrot distribution
a
Constant in Heaps’ Law
b
constant in Heaps’ Law
S
k
Probability of new word in
k
th drawing
M
k
k
th inverse moment of probability distribution
N
k
Expected vocabulary size after
k
drawings
264
D.C. van Leijenhorst, Th.P. van der Weide / Information Sciences 170 (2005) 263–272
In Fig. 1 we see how nicely average growth can be fitted by a power function
of the form
a
k
b
in the case of a set of 100 elements (denoted as
N
¼
100 in this
figure;
theta
and
c
refer to the parameters of the Mandelbrot distribution, and
a
N
is a normalization constant for this distribution that will be introduced in a
later section).
However, this approximation expressed by Heaps’ Law is not valid every-
where. For one reason, the number of records is bounded by the total number
of events, while a power function will exceed this number eventually. In order
to express the limited validity of Heaps’ Law, we also focus on the validity area
of the approximations in our analysis. The validity area is described rather
defensively, in practice the area will be larger.
The structure of this paper is as follows. In Section 2we discuss related
work. In Section 3 we present a statistical model for the vocabulary size in a
text, i.e. the average number of unique occurrences after a series of drawings.
In Section 4 we solve the resulting equation, leading to Heaps’ Law. We also
give bounds for the validity area of the approximations. In Section 5 we draw
some conclusions and discuss further research.

2004 Elsevier Inc. All rights reserved.
1. Introduction
In many practical situations, a connection has been shown between the
order of probability of events, and the probability itself. The most well-known
models for such connections are Zipf’s Law [12] and the Mandelbrot distri-
bution [8].
Let the
r
th most probable event have probability
p
, then Zipf’s Law states
that
p

r
is (almost) equal for all events, while the Mandelbrot distribution
claims this for the expression
p
ð
c
þ
r
Þ
h
for some parameters
c
and
h
. In case
of
c
¼
0, the distribution is also referred to as the generalized Zipf’s Law. Some
authors motivate the validity of these laws from physical phenomena, see for
*
Corresponding author. Fax: +31-24-3553450.
E-mail address:
tvdw@cs.kun.nl
(Th.P. van der Weide).
0020-0255/$ - see front matter

2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.ins.2004.03.006
Information Sciences 170 (2005) 263–272
www.elsevier.com/locate/ins
example [4] for Zipf’s Law in the context of cities. But it is also possible to
derive Zipf/Mandelbrot’s Law from a simple statistical model [7]. For example,
Zipf’s Law can be derived for word occurrences in artificial language, when it is
assumed that letters that compose a word are drawn randomly from some
distribution. In practice, however, words are thoughtfully selected by the
author; yet on the long run this selection process may adjust to such a statis-
tical description.
Another experimental law of nature is Heaps’ Law [6], which describes the
average growth in the number of unique elements (also referred as the number
of records), when elements are drawn randomly without replacement from
some statistical distribution. For example, in the case of word occurrences in
natural language, Heaps’ Law predicts the vocabulary size of a document from
its text size, i.e., the number of words it contains. Heaps’ Law states that this
number of unique elements will grow according to
a
k
b
for some application
dependent constants
a
and
b
,0
<
b
<
1, where
k
is the number of drawings.
See Table 1 for an overview of used symbols.
In this paper we focus on the relation between Zipf’s Law and the
Mandelbrot distribution on the one hand, and Heaps’ Law on the other hand.
This relation has been recognized, for example in [3], but this relation has not
been formally motivated. In this paper we assume that elements are drawn
according to the Mandelbrot distribution, and derive Heaps’ Law for the
number of unique elements drawn. As a consequence, Heaps’ Law can also be
regarded in a natural way as a complexity estimate.
Unfortunately, this analysis leads to a rather untractable recurrence relation
that has no analytical solution. By applying techniques from complexity the-
ory, restricting ourselves to first-order terms, Heaps’ Law is obtained. Note
that by involving second order terms, a more advanced formulation of Heaps’
Law may be obtained.
Table 1
Table of most important symbols used in this paper
Symbol Meaning
N
Vocabulary size
c
Constant in Mandelbrot distribution
h
Constant in Mandelbrot distribution
a
N
Normalization constant of Mandelbrot distribution
a
Constant in Heaps’ Law
b
constant in Heaps’ Law
S
k
Probability of new word in
k
th drawing
M
k
k
th inverse moment of probability distribution
N
k
Expected vocabulary size after
k
drawings
264
D.C. van Leijenhorst, Th.P. van der Weide / Information Sciences 170 (2005) 263–272
In Fig. 1 we see how nicely average growth can be fitted by a power function
of the form
a
k
b
in the case of a set of 100 elements (denoted as
N
¼
100 in this
figure;
theta
and
c
refer to the parameters of the Mandelbrot distribution, and
a
N
is a normalization constant for this distribution that will be introduced in a
later section).
However, this approximation expressed by Heaps’ Law is not valid every-
where. For one reason, the number of records is bounded by the total number
of events, while a power function will exceed this number eventually. In order
to express the limited validity of Heaps’ Law, we also focus on the validity area
of the approximations in our analysis. The validity area is described rather
defensively, in practice the area will be larger.
The structure of this paper is as follows. In Section 2we discuss related
work. In Section 3 we present a statistical model for the vocabulary size in a
text, i.e. the average number of unique occurrences after a series of drawings.
In Section 4 we solve the resulting equation, leading to Heaps’ Law. We also
give bounds for the validity area of the approximations. In Section 5 we draw
some conclusions and discuss further research.

0/5000

Từ: -

Sang: -

Kết quả (Việt) 1: [Sao chép]

Sao chép!

A formal derivation of Heaps’ LawD.C. van Leijenhorst, Th.P. van der Weide*Department of Computer Science, Faculty of Mathematics and Computing Science,Radboud University of Nijmegen, Toernooiveld 1, 6525 EDNijmegen, NetherlandsReceived 25 September 2003; received in revised form 26 February 2004; accepted 2 March 2004AbstractWord frequencies in text documents can be reasonably described by the Mandelbrotdistribution, which has Zipf’s Law as a special case. Furthermore, the growth ofvocabulary size as a function of the text size (its number of words) has been described inHeaps’ Law. It has been shown that these two experimental laws are related.In this paper we go a step further, and provide a (formal) derivation of Heaps’ Lawfrom the Mandelbrot distribution. We also provide a specification of the validity areafor applying Heaps’ Law.2004 Elsevier Inc. All rights reserved.1. IntroductionIn many practical situations, a connection has been shown between theorder of probability of events, and the probability itself. The most well-knownmodels for such connections are Zipf’s Law [12] and the Mandelbrot distri-bution [8].Let therth most probable event have probabilityp, then Zipf’s Law statesthatpris (almost) equal for all events, while the Mandelbrot distributionclaims this for the expressionpðcþrÞhfor some parameterscandh. In caseofc¼0, the distribution is also referred to as the generalized Zipf’s Law. Someauthors motivate the validity of these laws from physical phenomena, see for*Corresponding author. Fax: +31-24-3553450.E-mail address:tvdw@cs.kun.nl(Th.P. van der Weide).0020-0255/$ - see front matter2004 Elsevier Inc. All rights reserved.doi:10.1016/j.ins.2004.03.006Information Sciences 170 (2005) 263–272www.elsevier.com/locate/insexample [4] for Zipf’s Law in the context of cities. But it is also possible toderive Zipf/Mandelbrot’s Law from a simple statistical model [7]. For example,Zipf’s Law can be derived for word occurrences in artificial language, when it isassumed that letters that compose a word are drawn randomly from somedistribution. In practice, however, words are thoughtfully selected by theauthor; yet on the long run this selection process may adjust to such a statis-tical description.Another experimental law of nature is Heaps’ Law [6], which describes theaverage growth in the number of unique elements (also referred as the numberof records), when elements are drawn randomly without replacement fromsome statistical distribution. For example, in the case of word occurrences innatural language, Heaps’ Law predicts the vocabulary size of a document fromits text size, i.e., the number of words it contains. Heaps’ Law states that thisnumber of unique elements will grow according toakbfor some applicationdependent constantsaandb,0<b<1, wherekis the number of drawings.See Table 1 for an overview of used symbols.In this paper we focus on the relation between Zipf’s Law and theMandelbrot distribution on the one hand, and Heaps’ Law on the other hand.This relation has been recognized, for example in [3], but this relation has notbeen formally motivated. In this paper we assume that elements are drawnaccording to the Mandelbrot distribution, and derive Heaps’ Law for thenumber of unique elements drawn. As a consequence, Heaps’ Law can also beregarded in a natural way as a complexity estimate.Unfortunately, this analysis leads to a rather untractable recurrence relationthat has no analytical solution. By applying techniques from complexity the-ory, restricting ourselves to first-order terms, Heaps’ Law is obtained. Notethat by involving second order terms, a more advanced formulation of Heaps’Law may be obtained.Table 1Table of most important symbols used in this paperSymbol MeaningNVocabulary sizecConstant in Mandelbrot distributionhConstant in Mandelbrot distributionaNNormalization constant of Mandelbrot distributionaConstant in Heaps’ Lawbconstant in Heaps’ LawSkProbability of new word inkth drawingMkkth inverse moment of probability distributionNkExpected vocabulary size afterkdrawings264D.C. van Leijenhorst, Th.P. van der Weide / Information Sciences 170 (2005) 263–272In Fig. 1 we see how nicely average growth can be fitted by a power functionof the formakbin the case of a set of 100 elements (denoted asN¼100 in thisfigure;thetaandcrefer to the parameters of the Mandelbrot distribution, andaNis a normalization constant for this distribution that will be introduced in alater section).However, this approximation expressed by Heaps’ Law is not valid every-where. For one reason, the number of records is bounded by the total numberof events, while a power function will exceed this number eventually. In orderto express the limited validity of Heaps’ Law, we also focus on the validity areaof the approximations in our analysis. The validity area is described ratherdefensively, in practice the area will be larger.The structure of this paper is as follows. In Section 2we discuss relatedwork. In Section 3 we present a statistical model for the vocabulary size in atext, i.e. the average number of unique occurrences after a series of drawings.In Section 4 we solve the resulting equation, leading to Heaps’ Law. We alsogive bounds for the validity area of the approximations. In Section 5 we drawsome conclusions and discuss further research.

đang được dịch, vui lòng đợi..

Kết quả (Việt) 2:[Sao chép]

Sao chép!

Một nguồn gốc chính thức của Luật Heaps '
DC van Leijenhorst, Th.P. van der Weide
*
Sở Khoa học Máy tính, Khoa Toán và Khoa học máy tính,
Đại học Radboud Nijmegen, Toernooiveld 1, 6525 EDNijmegen, Hà Lan
đã nhận 25 tháng 9 năm 2003; nhận được trong hình thức sửa đổi ngày 26 tháng hai năm 2004; chấp nhận ngày 2 tháng 3 năm 2004
Tóm tắt
các tần số từ trong tài liệu văn bản có thể được mô tả một cách hợp lý các Mandelbrot
phân phối, trong đó có Luật Zipf như là một trường hợp đặc biệt. Hơn nữa, sự phát triển của
quy mô vốn từ vựng như là một hàm của kích thước văn bản (số lượng các từ ngữ) đã được mô tả trong
Luật Heaps '. Nó đã được chứng minh rằng hai luật thực nghiệm có liên quan.
Trong bài báo này chúng tôi đi một bước xa hơn, và cung cấp một (chính thức) nguồn gốc của Heaps 'Luật
từ phân phối Mandelbrot. Chúng tôi cũng cung cấp một đặc điểm kỹ thuật của khu vực có giá trị
để áp dụng Heaps 'Luật.
?
2004 Elsevier Inc. Tất cả quyền được bảo lưu.
1. Giới thiệu
Trong nhiều tình huống thực tế, một kết nối đã được trình bày giữa các
đơn đặt hàng của các xác suất của các sự kiện, và xác suất chính nó. Các nổi tiếng nhất
mô hình cho các kết nối như vậy là Luật Zipf của [12] và phối Mandelbrot
phân [8].
Hãy để cho
r
th sự kiện có thể xảy ra nhất có xác suất
p
, sau đó Luật Zipf của tiểu bang
mà
p
?
r
là (gần như) bình đẳng cho tất cả các sự kiện, trong khi phân phối Mandelbrot
tuyên bố này cho các biểu thức
p
? ð
c
þ
r
Þ
h
đối với một số thông số
c
và
h
. Trong trường hợp
của
c
¼
0, phân phối cũng được gọi là Luật các tổng quát của Zipf. Một số
tác giả khuyến khích tính hợp lệ của các luật này từ các hiện tượng vật lý, xem cho
*
tác giả tương ứng. . Fax: + 31-24-3553450
: Địa chỉ E-mail
tvdw@cs.kun.nl
(Th.P. van der Weide).
thấy vấn đề phía trước - 0020-0255 / $
?
2004 Elsevier Inc. Tất cả quyền được bảo lưu.
doi : 10,1016 / j.ins.2004.03.006
Khoa học thông tin 170 (2005) 263-272
www.elsevier.com/locate/ins
dụ [4] cho Luật Zipf trong bối cảnh của thành phố. Nhưng nó cũng có thể
lấy được Luật Zipf / Mandelbrot từ một mô hình đơn giản thống kê [7]. Ví dụ,
Luật Zipf có thể được bắt nguồn cho các lần xuất hiện từ trong ngôn ngữ nhân tạo, khi nó được
giả định rằng các chữ cái mà soạn một từ được rút ra ngẫu nhiên từ một số
phân phối. Trong thực tế, tuy nhiên, các từ được tư lự lựa chọn bởi các
tác giả; nhưng về lâu về dài quá trình lựa chọn này có thể điều chỉnh để một statis- như
mô tả tical.
Một luật thực nghiệm về bản chất là Heaps 'Luật [6], trong đó mô tả sự
tăng trưởng trung bình trong số các yếu tố duy nhất (còn gọi là số
lượng hồ sơ ), khi các yếu tố được rút ra một cách ngẫu nhiên mà không cần thay thế từ
một số phân bố thống kê. Ví dụ, trong trường hợp xuất hiện từ trong
ngôn ngữ tự nhiên, Luật Heaps 'dự đoán kích thước từ vựng của một tài liệu từ
văn bản kích thước của nó, tức là, số lượng các từ nó chứa. Heaps 'Luật nêu rằng đây
số lượng các yếu tố độc đáo sẽ phát triển theo
một
k
b
cho một số ứng dụng
các hằng số phụ thuộc
một
và
b
, 0
<
b
<
1, nơi mà
k
là số các bản vẽ.
Xem Bảng 1 cho một cái nhìn tổng quan của các biểu tượng được sử dụng.
Trong bài báo này, chúng tôi tập trung vào các mối quan hệ giữa Luật Zipf và sự
phân phối Mandelbrot trên một mặt, và Luật Heaps 'mặt khác.
Mối quan hệ này đã được công nhận, ví dụ như trong [3], nhưng mối quan hệ này đã không
được thúc đẩy chính thức . Trong bài báo này chúng tôi giả định rằng yếu tố được rút ra
theo phân phối Mandelbrot, và lấy được Heaps 'Luật cho các
số nguyên tố độc đáo rút ra. Như một hệ quả, Heaps 'Luật cũng có thể được
coi một cách tự nhiên như một ước tính phức tạp.
Thật không may, phân tích điều này dẫn đến một mối quan hệ tái phát khá untractable
rằng không có giải pháp phân tích. Bằng cách áp dụng kỹ thuật từ phức tạp gì-
Ory, hạn chế bản thân để điều lệnh đầu tiên, Luật Heaps 'thu được. Lưu ý
rằng bằng cách liên quan đến điều kiện bậc thứ hai, một công thức tiên tiến hơn của Heaps '
Luật có thể thu được.
Bảng 1
Bảng biểu tượng quan trọng nhất được sử dụng trong bài viết này
Symbol Ý nghĩa
N
kích thước từ vựng
c
liên tục trong Mandelbrot phân phối
h
liên tục trong Mandelbrot phân phối
một
N
liên tục Normalization phân phối Mandelbrot
một
hằng số trong 'Luật Heaps
b
liên tục trong Heaps 'Luật
S
k
Xác suất của từ mới trong
k
th vẽ
M
k
k
th khoảnh khắc nghịch đảo của phân phối xác suất
N
k
Dự kiến quy mô vốn từ vựng sau
k
bản vẽ
264
DC van Leijenhorst, Th.P. van der Weide / Thông tin Khoa học 170 (2005) 263-272
Trong hình. 1 chúng ta thấy cách độc đáo tăng trưởng trung bình có thể được trang bị bởi một chức năng điện
có dạng
một
k
b
trong trường hợp của một tập hợp các nguyên tố 100 (ký hiệu là
N
¼
100 trong này
con số;
theta
và
c
tham khảo các thông số của phân phối Mandelbrot , và
một
N
là một sự bình thường hóa liên tục để phân phối này sẽ được giới thiệu trong một
phần sau).
Tuy nhiên, xấp xỉ này thể hiện bằng Luật Heaps 'là không hợp lệ ở khắp mọi
nơi. Đối với một lý do, số lượng hồ sơ được bao bọc bởi tổng số
các sự kiện, trong khi một chức năng điện sẽ vượt quá con số này cuối cùng. Để
thể hiện tính hiệu lực hạn chế của Luật Heaps ', chúng tôi cũng tập trung vào các khu vực có giá trị
của xấp xỉ trong phân tích của chúng tôi. Các khu vực có giá trị được mô tả thay
cho việc phòng thủ, trong thực tế khu vực này sẽ lớn hơn.
Cấu trúc của bài viết này là như sau. Trong phần thảo luận 2we liên quan
làm việc. Trong phần 3, chúng tôi trình bày một mô hình thống kê cho các kích thước từ vựng trong một
văn bản, tức là trung bình số lần xuất hiện độc đáo sau khi một loạt các bản vẽ.
Trong phần 4, chúng tôi giải quyết các phương trình kết quả, dẫn đến Luật Heaps '. Chúng tôi cũng
cung cấp cho giới hạn cho các khu vực có giá trị của xấp xỉ. Trong phần 5, chúng tôi rút ra
một số kết luận và thảo luận về nghiên cứu thêm.

đang được dịch, vui lòng đợi..

Kết quả (Việt) 3:[Sao chép]

Sao chép!

đang được dịch, vui lòng đợi..

Các ngôn ngữ khác

Hỗ trợ công cụ dịch thuật: Albania, Amharic, Anh, Armenia, Azerbaijan, Ba Lan, Ba Tư, Bantu, Basque, Belarus, Bengal, Bosnia, Bulgaria, Bồ Đào Nha, Catalan, Cebuano, Chichewa, Corsi, Creole (Haiti), Croatia, Do Thái, Estonia, Filipino, Frisia, Gael Scotland, Galicia, George, Gujarat, Hausa, Hawaii, Hindi, Hmong, Hungary, Hy Lạp, Hà Lan, Hà Lan (Nam Phi), Hàn, Iceland, Igbo, Ireland, Java, Kannada, Kazakh, Khmer, Kinyarwanda, Klingon, Kurd, Kyrgyz, Latinh, Latvia, Litva, Luxembourg, Lào, Macedonia, Malagasy, Malayalam, Malta, Maori, Marathi, Myanmar, Mã Lai, Mông Cổ, Na Uy, Nepal, Nga, Nhật, Odia (Oriya), Pashto, Pháp, Phát hiện ngôn ngữ, Phần Lan, Punjab, Quốc tế ngữ, Rumani, Samoa, Serbia, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenia, Somali, Sunda, Swahili, Séc, Tajik, Tamil, Tatar, Telugu, Thái, Thổ Nhĩ Kỳ, Thụy Điển, Tiếng Indonesia, Tiếng Ý, Trung, Trung (Phồn thể), Turkmen, Tây Ban Nha, Ukraina, Urdu, Uyghur, Uzbek, Việt, Xứ Wales, Yiddish, Yoruba, Zulu, Đan Mạch, Đức, Ả Rập, dịch ngôn ngữ.