The Colorado Richly Annotated Full-Text(CRAFT) Corpus being developed at the Universityof Colorado Denver was used for this work.Currently, the corpus consists of 97 full-text openaccessscientific articles that have been annotated bythe Mouse Genome Institute1 with concepts fromthe Gene Ontology2and Mammalian PhenotypeOntology3. Thirty-six of the articles have beenannotated with deep syntactic structures similarto that of the Penn Treebank corpus described in(Marcus et al., 1994). As this is a work in progress,eight of the articles have been set aside for a finalholdout evaluation and results for these articlesare not reported here. In addition to the standardtreebank annotation, the NML tag discussed in(Bies et al., 2005) and (Vadas and Curran, 2007)which marks nominal subconstituents which donot observe the right-branching structure commonto many (but not all) noun phrases is annotated.This is of particular importance for coordinatednoun phrases because it provides an unambiguousrepresentation of the correct coordination structure.The coordination instances in the CRAFT datawere converted to simplified coordination structuresconsisting of conjunctions and their conjuncts usinga script that cleanly translates the vast majority ofcoordination structures.
đang được dịch, vui lòng đợi..