Skip to content

Quickstart

If you have the library installed and a corpus in wp-2022 format is somewhere stored on your harddisk, you can iterate over all elements of the corpus with a few lines of code:

import lateco.Items as items

# Path to the corpus:
c = items.Corpus("/data/wp-2022/annotated/") 

# subcorpora is a list of corpus files found in the path above:
subcorpora = c.getSubcorpora()

tokenCount = 0
sentenceCount = 0
articleCount = 0

# Lets iterate over the corpus structure:
for subcorpus in subcorpora:
    # In this line most of the work is done:
    su = items.Subcorpus(subcorpus)

    # Iterate over all articles in the current subcorpus:
    for article in su.articles:
        articleCount = articleCount + 1
        # The sentences:
        for sentence in article.sentences:
            sentenceCount = sentenceCount + 1
            # Last not least, counting the tokens:
            tokenCount = tokenCount + len(sentence.tokens)

print("Number of Tokens: ", tokenCount)
print("Number of Sentences: ", sentenceCount)
print("Number of Articels: ", articleCount)