Morphology and word formation

Goals

  • Improvement of the Slovene tagset (JOS) and linguistic annotation tool
  • Development of a methodology for a grammatical description of morphology and word formation in modern standard Slovene
  • Statistical analysis of morphological and word formation processes in modern standard Slovene

Software tools for corpus data preparation

We have developed two freely available software tools for corpus data preparation: the Q-CAT tool is intended for linguistic annotation and analysis of corpus texts at morphological and higher annotation levels, such as multiword expressions, syntax and semantics. The LIST tool enables statistical processing of large corpora and allows for language data at the levels of word parts, words and word sets to be exported. We prepared a user manual for both programs and presented it to the scientific community at a project event.

 

Statistically sorted corpus data

With the help of the LIST tool, we prepared data exports from the reference corpus of written standard Slovene Gigafida 2.0 and the reference corpus of spoken Slovene GOS. The exports are openly available on the CLARIN.SI repository, and provide frequency-sorted corpus data at various levels, from individual letter occurrences, word parts, word forms and lemmas to word strings/sets of various lengths. The content and formatting of the data were thoroughly presented in the publication A Guide to Frequency Lists from the Gigafida 2.0 and GOS 1.0 Corpora, which is available in Slovene and English.

  • Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon (2020). A Guide to Frequency Lists from the Gigafida 2.0 and GOS 1.0 Corpora. Ljubljana: Ljubljana University Press, Faculty of Arts. https://doi.org/10.4312/9789610604006.
  • Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon (2020). A Guide to Frequency Lists from the Gigafida 2.0 and GOS 1.0 Corpora. Ljubljana: Ljubljana University Press, Faculty of Arts. https://doi.org/10.4312/9789610604006.
  • Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon (2020). Frequency lists of word-level n-grams from the GOS 1.0 corpus 1.1, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1365.
  • Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon (2020). Frequency lists of character-level n-grams from the GOS 1.0 corpus 1.1, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1363.
  • Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon (2020). Consonant-vowel structures in the GOS 1.0 corpus 1.1, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1367.
  • Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon (2020). Frequency lists of word parts from the GOS 1.0 corpus 1.1, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1366.
  • Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon (2020). Frequency lists of words from the GOS 1.0 corpus 1.1, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1364.
  • Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon (2020). Consonant-vowel structures in the Gigafida 2.0 corpus, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1289.
  • Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon (2019). Frequency lists of character-level n-grams from the Gigafida 2.0 corpus, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1272.
  • Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon (2019). Frequency lists of words from the Gigafida 2.0 corpus, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1273.
  • Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon (2019). Frequency lists of word-level n-grams from the Gigafida 2.0 corpus, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1274.
  • Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon (2019). Frequency lists of word parts from the Gigafida 2.0 corpus, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1275.

 

Database of corpus-driven morphological and word-formation information

An important result of the project are two databases with morphological and word-formation information, which are based on the lexicon of Slovene word forms  Sloleks 2.0. The first mentioned database contains 96,290 lexicon entries (nouns, verbs, adjectives, and adverbs) – they are assigned a code for the morphological pattern they are inflected by. The preparation methodology was presented in a scientific article with a pattern for nouns. The second database contains 66,347 pairs of lexicon entries, which were recognized as word formation related by an automated process and a manually composed ruleset. Both databases have a high value for linguistic analyses as well as for the development of systems that support the machine organization of new lexicon units based on corpus data.

  • Arhar Holdt, Špela; Čibej, Jaka; Laskowski, Cyprian; Krek, Simon (2020)Morphological patterns from the Sloleks 2.0 lexicon 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1411.
  • Čibej, Jaka; Arhar Holdt, Špela; Krek, Simon (2020). List of word relations from the Sloleks 2.0 lexicon 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1386.
  • Arhar Holdt, Špela; Čibej, Jaka (2018). ”Oblikoslovni vzorci v leksikonu Sloleks: izhodiščni nabor za samostalnike”. Slovnične raziskave za jezikovni opis. Slovenščina 2.0, Thematic issue, vol. 6, issue 2, pg. 33–66. https://www.dlib.si/details/URN:NBN:SI:DOC-C6R9113Q.

 

Training corpus for Slovene

During the project, we improved the ssj500k training corpus, which is the basis for the development of annotation tools for different levels, from tokenization, segmentation, and lemmatization, to morphosyntax, multiword expressions, named entities, dependency syntax and participant roles. The 2.2 version of the training corpus is available on the CLARIN.SI repository. We presented it in two conference papers.

  • Krek, Simon; Erjavec, Tomaž; Dobrovoljc, Kaja; Gantar, Polona; Arhar Holdt, Špela; Čibej, Jaka; Brank, Janez (2020). ”The ssj500k training corpus for Slovene language processing”. IN: Darja Fišer, Tomaž Erjavec (ed.): Language Technologies & Digital Humanities: conference proceedings, str 24–33 Ljubljana: Institute of Contemporary History. http://nl.ijs.si/jtdh20/pdf/JT-DH_2020_Krek-et-al_The-ssj500k-Training-Corpus-for-Slovene-Language-Processing.pdf.
  • Bon, Mija; Gantar, Polona (2019). ”Levels of annotation in the Slovene Training Corpus ssj500k 2.2”. Jazykovedný časopis, 10th International Conference NLP, Corpus Linguistics, Language Dynamics and Change, Bratislava, Slovakia, vol. 70, issue 2, pg. 390–399. https://doi.org/10.2478/jazcas-2019-0068.
  • Krek, Simon; Dobrovoljc, Kaja; Erjavec, Tomaž; Može, Sara; Ledinek, Nina; Holz, Nanika; Zupan, Katja; Gantar, Polona; Kuzman, Taja; Čibej, Jaka; Arhar Holdt, Špela; Kavčič, Teja; Škrjanec, Iza; Marko, Dafne; Jezeršek, Lucija; Zajc, Anja (2019). Training corpus ssj500k 2.2, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1210.

 

Development of automatic morphological tagging

Based on the improved training set and analyses of existing grammatical taggers, we developed a new tool for lemmatization and morphological analysis of Slovene texts, i. e. metatagger), which was also used for automatic tagging of the reference corpus of written Slovene Gigafida 2.0. Given that there was a sharp increase in neural-network-based machine-learning methods during the project, we conducted their pilot evaluation for Slovene, as a starting point for further related research. We then upgraded them by integrating the Sloleks morphological lexicon, and investigated their effectiveness for machine-assisted tagging of spoken-language corpora.

  • Krek, Simon; Arhar Holdt, Špela; Erjavec, Tomaž; Čibej, Jaka; Repar, Andraž; Gantar, Polona; Ljubešić, Nikola; Kosem, Iztok; Dobrovoljc, Kaja (2020). ”Gigafida 2.0: The Reference Corpus of Written Standard Slovene”. IN: Nicoletta Calzolari (ed.): LREC 2020: Twelfth International Conference on Language Resources and Evaluation: conference proceedings, pg. 3340–3345. Paris: ELRA – European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.409.
  • Dobrovoljc, Kaja; Erjavec, Tomaž, Ljubešić, Nikola (2019). ”Improving UD processing via satellite resources for morphology”. IN: UDW 2019, Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019), conference proceedings, pg. 24 –34. Paris, France. Stroudsburg: Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-8004/.
  • Dobrovoljc, Kaja; Martinc, Matej (2019). ”Er ... well, it matters, right? On the role of data representations in spoken language dependency parsing”. IN: Marie-Catherine de Marneffe, Teresa Lynn, Sebastian Schuster (ed.): Second Workshop on Universal Dependencies (UDW 2018), conference proceedings, pg. 37–46. Brussels. Strasbourg: Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-6005/.
  • Ljubešić, Nikola; Dobrovoljc, Kaja (2019). ”What does neural bring? Analysing improvements in morphosyntactic annotation and lemmatisation of Slovenian, Croatian and Serbian”. IN: Tomaž Erjavec et al. (ed.): 7th Workshop on Balto-Slavic Natural Language Processing, conference proceedings, pg. 29–34. Firenze, Italy. Stroudsburg: The Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-3704/.
  • Ljubešić, Nikola (2018). Meta-tagger, programming code on GitHub, Slovenian language resource repository CLARIN.SI, https://github.com/clarinsi/meta-tagger.

 

Interproject cooperation

The knowledge and language resources obtained through the project were efficiently utilized in the project Kauč, specifically in developing and testing the readability of Slovene texts.

TOP