Thus, for each term tl and document Di, we can generate a n × d matrixof probabilities in terms of these parameters, where n is the numberof documents and d is the number of terms. For a given corpus, wealso have the n × d term-document occurrence matrix X, which tellsus which term actually occurs in each document, and how many timesthe term occurs in the document. In other words, X(i, l) is the numberof times that term tl occurs in document Di. Therefore, we can use amaximum likelihood estimation algorithm which maximizes the productof the probabilities of terms that are observed in each document in theentire collection. The logarithm of this can be expressed as a weightedsum of the logarithm of the terms in Equation 4.12, where the weightof the (i, l)th term is its frequency count X(i, l). This is a constrainedoptimization problem which optimizes the value of the log likelihoodprobability i,l X(i, l)·log(P(tl|Di)) subject to the constraints that theprobability values over each of the topic-document and term-topic spacesmust sum to 1:
đang được dịch, vui lòng đợi..