In the project framework, we developed linguistic methodological foundations of a complex analysis of written and spoken Slovene, as found in the new corpora developed in recent projects. Doing so, we provided a sound foundation for future work on empirically based description of Slovene. Based on new methodology, we compiled and made publicly available extensive corpora datasets, which will be useful for the development of language technology applications for Slovene. The extracted data will be used for a linguistic analysis of real language, which represents the first step towards the compilation of a new descriptive corpus grammar of Slovene.
The project was based on the fact that in the last three decades language description has witnessed a noticeable paradigm shift from researching language as a system, on the level of phonology and (morpho)syntax, to a more empirically-oriented language description, which aims to describe workings of language in real life, and is linked to fields such as psychology, neurobiology, artificial intelligence etc. To enable research within the new paradigm, reliable empirical data about different language phenomena are needed which are provided by modern computational or corpus linguistics, using automatic methods to analyse extensive collections of written and spoken language data.
The project work plan was divided into several work packages whose titles reveal the types of corpus analyses: Morphology and word formation, Collocations, Multi word expressions, Valency and Formulaic sequences. Written language was analysed primarily by using the reference corpus of written standard Slovene Gigafida and by using comparable data from the manually annotated training corpus ssj500k. Spoken language was analysed by using the reference corpus of spoken Slovene GOS. All extracted data sets, programs and algorithms are published under open access or open source licenses and are structured with the intent to be useful for language technology applications.
Written language was analysed primarily by using the segmented, tokenised, lemmatised and morpho-syntactically tagged reference corpus of written standard Slovene Gigafida (Krek et al. 2020b). Spoken language was analysed by using the reference corpus of spoken Slovene GOS (Verdonik & Zwitter Vitez 2011). The project work will be organised according to three topics in five work packages, which represent different types of corpus analysis:
TOPIC 1. Morphology and word formation - words and parts of words
TOPIC 2. Phrase- and sentence-level syntax
- Work package 2: Collocations - methodology and data
- Work package 3: Multi word expressions (MWEs) - methodology and data
- Work package 4: Valency - methodology and data
TOPIC 3: Formulaic sequences