Skip to content

Welcome

TDC Logo

The Daily Corpora (TDC) is a platform for evaluating and exploring linguistically annotated text corpora. It consists of several parts:

  • wp-2022: A linguistic annotated version of the German Wikipedia, containing tokens, POS tags, lemmata and named entities. The corpus is stored in an easy to process text based data format. Depending on available hardware, maybe more corpora will be added in the future.
  • lateco: A library for processing the data format of the wp-2022 corpus. Iterating over the corpus elements can be done with a few lines of code. Originally implemented in Python, a much faster version in Go is also available.
  • The Web Application: A web application which can be used for answering many common linguistic research questions. If you don't know where to start, click here 😄

This is just a hobby project 😄 .