To build a language model a corpus of more than80,000 full-text open-access scientific articles wereobtained from PubMed Central. The articles areprovided in a simple XML format which was parsedto produce plain text documents using only sectionsof the articles containing content full prose (i.e. byexcluding sections such as e.g.
đang được dịch, vui lòng đợi..