Work plan

Written language will be analysed primarily by using the Kres corpus, along with comparative data from Gigafida. Kres is the reference corpus with 100 million words sampled and balanced from Gigafida. The compilation and the structure of the corpora (linguistic annotation, metadata, etc.) are described in Logar Berginc et al. 2012. All corpora are segmented, tokenised, lemmatised and morpho-syntactically tagged with the Obeliks annotation tool (Grčar et al. 2012). Kres was also parsed with MSTParser (Rupnik et al. 2012) and imported in the Sketch Engine tool, which enables the use of various statistical processing methods, most notably "word sketches”. Spoken language will be analysed by using the Gos corpus containing 1 million words of transcribed Slovene speech. The corpus is balanced in terms of demographic, regional, gender and other criteria, as described in Verdonik and Zwitter Vitez 2011. In addition, the SST corpus (Dobrovoljc 2016) with manual syntactic dependency annotations will be used for various tasks. The project work will be organised according to three topics in five work packages:

TOPIC 1. Morphology and word formation - words and parts of words

TOP