Quickstart¶
If you have the library installed and a corpus in wp-2022 format is somewhere stored on your harddisk, you can iterate over all elements of the corpus with a few lines of code:
import lateco.Items as items
# Path to the corpus:
c = items.Corpus("/data/wp-2022/annotated/")
# subcorpora is a list of corpus files found in the path above:
subcorpora = c.getSubcorpora()
tokenCount = 0
sentenceCount = 0
articleCount = 0
# Lets iterate over the corpus structure:
for subcorpus in subcorpora:
# In this line most of the work is done:
su = items.Subcorpus(subcorpus)
# Iterate over all articles in the current subcorpus:
for article in su.articles:
articleCount = articleCount + 1
# The sentences:
for sentence in article.sentences:
sentenceCount = sentenceCount + 1
# Last not least, counting the tokens:
tokenCount = tokenCount + len(sentence.tokens)
print("Number of Tokens: ", tokenCount)
print("Number of Sentences: ", sentenceCount)
print("Number of Articels: ", articleCount)