The Large Text Corpora Library (lateco) is an objectoriented Python library for processing the data format of the wp-2022 corpus. It offers easy to use objects to iterate over a whole corpus or parts of it as well as helper functions for linguistic analysis. Lateco is available for Python versions >= 3.8 and can be installed via the Python Package Index:
The library is split into several Python modules:
- The Items module offers the basic functionality for iterating over a corpus.
- SearchEngine offers functionality to connect to an ElasticSearch instance which has corpus data stored in various indices.
- SentiWs contains functions to work with the SentiWS dataset for sentiment analysis.
- Tagsets.py has functions to work with POS tagsets, mainly the German STTS.