As you can see in the Quickstart example, the Items module offers objects for reading in a corpus in WP-2022 format. In hierarchical order, top down, these are:
Represents a whole corpus, stored in a folder somewhere on your storage system. The object takes the path to this folder as init argument.
Make sure that all files in the given directory are in wp-2022 format. The Corpus object is just a wrapper class for these files, it will not validate the content of the files!
- init: takes a path to a folder as argument
- getSubcorpora: reads recursivly all files in that directory into a list, interpreting every file as a subcorpus
- getName, setName: useful to name your corpus
- getFolderList: returns a list with all subfolders of the start directory. Usefull for parallelization of analysis.
- getFolders: same as getFolderList. One could be deleted :-)
- getSubcorporaOfFolder: retunrs a list with files in a subfolder
- getArticleByPosition: takes the position of an article and returns the article as object
The Subcorpus object is the basis for reading in corpus data. It takes as init argument a path to a file in wp-2022 format - for example, an entry of the list which you get by getSubcorpora in the Corpus object. It will read in the file and create the corpus hierarchy from articles down to tokens. As the article is the highest corpus element in a subcorpus, the Subcorpus object offers a list of article objects after the file was read.
- init: takes a path to a (subcorpus) file in wp-2022 format and reads the whole subcorpus structure into objects
- getName: returns the name of the subcorpus (really necessary?)
- bagOfWords: calls the bagOfWords method of all articles in the articles list and creates one bag of words for the whole subcorpus
- getArticleByID: returns the article object with the responding corpus ID
- articles: a list of article objects, represnting the articles in the subcorpus
Represents an article as part of a subcorpus.
- bagOfWords: creates a bag of words for the article
- display: prints the metadata and the text of the article
- id: the corpus ID of the article
- url: a URL - in the wp-2022 corpus, pointing to the original wikipedia article
- title: The title of the article
- sentences: A list of Sentence objects, representing the sentences of the article
Represents a sentence as part of an Article object.
- init: creates an empty tokens list
- str: Displays all tokens of the sentence
- displayAsTokens: Prints all token text
- bagOfWords: creates a Bag of Words for the sentence
- toString: returns a string representation of all tokens
- tokens: a list of all Token objects in the sentence
The Token class represents the basic element of a corpus - the token.
- str: returns a string representation of the current Token instance
- token: the plain text representation of the token, as it is written down in the original text
- lemma: the lemmatized version of the text
- pos: the part of speech tag of the token (STTS for the wp-2022 corpus)
- isAlpha: set to true if the token is an alphanumeric string, false if not
- isStop: set to true if the token is a stop word, false if not
The Ner class represents named entities of a corpus
In the wp-2022 corpus definition, named entities are not stored aside with the syntactic structures of the corpus. Instead, the have their own format and are stored in extra folders / files
- init: takes a path to a folder which holds the NE annotations of a wp-2022 corpus
- getNERbyArticle: gets the position of an article in a wp-2022 corpus and returns all named entities of that article. The return is a dictionary with the keys LOC, PERS, ORG and MISC, representing the four kinds of named entity annotation. The key is a dictionary with the named entities themselfs as keys and their frequency as value
The position of the getNERbyArticle is not a corpus position, instead its just a string representing the position in the corpus ... some work to be done here :-)
- basePath: Path to a folder with named entity annotations
The Position object represents the relative path to a corpus element in the hierarchy of a corpus in wp-2022 format.
- getArticle: returns the article of the current position
- getSentence: returns the senetence of the current position
- toString: returns the position of the corpus entity as string