WP-2022 Text Corpus

Currently, the wp-2022 text corpus is the only one which is available in TDC. It was compiled from a database dump of the whole German Wikipedia in January 2022, keeping only the plain text of the Wikipedia articles. Lemma, POS and Named Entitiy annotations were added via the Spacy library. The corpus contains:

  • 10.947.568 different lemmata
  • 63.768.638 sentences
  • 4.319.566 articles

The annotated corpus can be found here.