Contemporary Amharic Corpus (CACO)
This is the Contemporary Amharic Corpus (CACO) version 1.1. CACO is collected from various sources which are proofread or edited. The corpus contains about 24 million tokens. Since it is partly a web corpus, we made some automatic spelling error corrections. We have also modified the existing morphological analyzer, HornMorpho, to use it for automatic tagging. (The modified version of HornMorpho, HornMorphoA version 3.1, is available at: https://github.com/hltdi/HornMorpho)
All the documents in the corpus are documents which have been made publicly available in the Web. In this distribution, for copyright reasons, the sentences are randomized. By downloading this corpus you agree that the corpus should only be used for research purposes.
When using this data, please cite the original publication:
Gezmu, Andargachew Mekonnen, Binyam Ephrem Seyoum, Michael Gasser, and Andreas Nürnberger. "Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus." In Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pp. 65-70. 2018. Available at: http://www.aclweb.org/anthology/W18-3809
The documents are provided in plain text format and XML format. The XML documents are the tagged versions of the plain text documents. For more details about the corpus, refer to the original publication.