The Daily Corpora
The Daily Corpora is a web based platform for evaluating and exploring linguistically annotated text corpora.
For performance reasons, in the public available instance (link above) only the wp-2020 corpus is integrated. The raw data of the corpus together with some derived data files can be downloaded here:
- Single file (AA/wiki_00, 700 KB, unzipped ca. 3.4 MB)
- One subfolder (AA, 67 MB, unzipped ca. 327 MB)
- The whole wp-2022 corpus (4.4 GB, unzipped ca. 22GB)
- Named Entities Annotations (1.1 GB, unzipped ca. 2.7 GB)
- Lemma list (199 MB zipped, including token and POS frequencies)
- All trigrams of the corpus (4.4 GB zipped)