Skip to content

Introduction

Lateco

The Large Text Corpora Library (lateco) is an objectoriented Python library for processing the data format of the wp-2022 corpus. It offers easy to use objects to iterate over a whole corpus or parts of it as well as helper functions for linguistic analysis. Lateco is available for Python versions >= 3.8 and can be installed via the Python Package Index:

pip install lateco

The Modules

The library is split into several Python modules:

  • The Items module offers the basic functionality for iterating over a corpus.
  • SearchEngine offers functionality to connect to an ElasticSearch instance which has corpus data stored in various indices.
  • SentiWs contains functions to work with the SentiWS dataset for sentiment analysis.
  • Tagsets.py has functions to work with POS tagsets, mainly the German STTS.