New grammar of contemporary standard Slovene: sources and methods

National project J6-8256

The project aims to explore linguistic methodological foundations of a complex analysis of written and spoken Slovene, as found in the new corpora developed in recent projects. Resulting methodology and data will provide a sound foundation for future work on empirically based description of Slovene. Following from the methodology we intend to compile and publish extensive collections of extracted material from corpora which will be useful for the development of language technology applications for Slovene. The extracted data will be used for a linguistic analysis real language, which represent the first step towards the compilation of a new descriptive corpus grammar of Slovene.

The project proposal is based on the fact that in the last three decades language description has witnessed a noticeable paradigm shift from researching language as a system, on the level of phonology and (morpho)syntax, to a more empirically-oriented language description which aims to describe workings of language in real life, and is linked to fields such as psychology, neurobiology, artificial intelligence etc. To enable research within the new paradigm, reliable empirical data about different language phenomena are needed which are provided by modern computational or corpus linguistics, using automatic methods to analyse extensive collections of written and spoken language data.

Project work plan is divided into several work packages whose titles reveal the types of proposed corpus analyses: Morphology and word formation, Collocations, Multi word expressions, Valency and Formulaic sequences. Written language will be analysed primarily by using the Kres corpus, along with comparative data from Gigafida. Kres is the reference corpus with 100 million words sampled and balanced from Gigafida. Spoken language will be analysed by using the Gos corpus containing 1 million words of transcribed Slovene speech the SST corpus  with manual syntactic dependency annotations. All extracted data collections, programs and algorithms will be published under open access or open source licenses, and structured with the intent to be useful for language technology applications.