3.1.2 Spelling error detectionBefore performing this step, tweets must be removed noisy contents such asemotion symbols (e.g: ❤❤,..), hashtag symbols, link url @username, etc. In orderto detect errors, we synthesize and build a dictionary for all Vietnamese words.This dictionary includes more than 7,300 words. In our method, a word will beidentified error if it does not appear in the dictionary. Normally, Vietnameseincludes two kind of errors: typing error and spelling error.3.1.3 NormalizationAfter spelling error detection phase, the word with spelling error was identified.The system first uses vocabulary structure, set of syllable rules to fix this typingerror, then the result will be measured the similarity with the word in the dictionary to find word with the highest similarity degree. In the case we can notfind the result word in the dictionary, the system will use n-gram to normalizethe error word. Table 1 shows the normalization results Vietnamese tweets.
đang được dịch, vui lòng đợi..
