Multi word expressions (MWEs)

Goals

  • Development of a methodology for automatic identification and analysis of multi word expressions in modern standard Slovene
  • Computational analysis of multi word expressions in modern standard Slovene

Typology of multiword units

Based on previous analyses in the production of the Lexical Database for Slovene (Gantar 2015) and taking into account international guidelines for identifying different types of verbal multiword units, which was developed for 20 different languages ​​within the European project PARSEME (http://typo.uni-konstanz.de/parseme/), a typology of multiword units (cf. Kosem et al. 2020; Gantar 2021, in press) was developed, based on the separation between lexical or so-called conceptual multiword units, which carry meaning, and so-called structural multiword units, which are relevant primarily for natural language processing. For the purposes of compiling the computer database, the following types of multiword units were identified: a) multiword lexical units, which include a.1) phraseological units with subtypes (pragmatic, paremiological and citational), and a.2) terminological and non-terminological set phrases. On the other hand, we distinguish b) lexical-grammatical multiword units, within which we assume b.1) collocations with extended collocations, b.2) reflexive possessive verbs, b.3) verbs with prepositional morphemes, and a large group b.4) syntactic conjunctions, and finally b.5) phrases with verbs used in their weakened sense.

 

Methodology for automatic recognition of set phrases

As part of the project, we developed a methodology for the automatic recognition of set phrases in a computer-processed source and the machine acquisition of data on set phrases. The methodology is based on the production of formalized decision trees in which, with the help of structural, semantic, and syntactic tests, a word combination can be identified as a potential multiword unit and included in the appropriate category. For better accuracy of recognizing set phrases, an ensemble approach was used, one which combines different basic approaches and automatically eliminates errors. Syntactic methods, statistical methods, information-theoretic methods and machine learning methods based on derived characteristics were used as basic methods. As features, we used the surrounding words, n-grams, morphosyntactic descriptions, n-gram tags, and the already mentioned tests, which take into account the specifics of the Slovene language.

  • Krek, Simon; Gantar, Polona (2021, in print). Mehanizem za luščenje in prepoznavanje VLE v korpusu.
  • Škvorc, Tadej; Gantar, Polona; Robnik-Šikonja, Marko (2021, in print). Strojno prepoznavanje idiomov z globokimi nevronskimi mrežami.
  • Škvorc, Tadej; Robnik-Šikonja, Marko (2019). ”Prepoznavanje idiomatskih besednih zvez z uporabo besednih vložitev”. Uporabna Informatika, vol. 27, issue 3. https://uporabna-informatika.si/index.php/ui/article/view/63.

 

Lexicon of multiword units

The first version of the Lexicon of Multiword Units (Krek et al. 2021) contains a list of 5,264 multiword units with phraseme characteristics, and was built on the basis of the Lexical Database for Slovene, the Dictionary of Slovene Phrases by J. Kebr, and the ssj500k 2.0 training corpus. Each phraseological unit is assigned a) an appropriate syntactic structure, which is also defined in terms of the number and type of components, and b) information on the syntactic relation between FE components, which is formulated on the JOS model (Dobrovoljc et al. 2012). In addition, morphological constraints at the level of word type and other grammatical categories are ascribed to the components of a multiword unit, e.g. numbers, verbs, etc. For each FE, which represents an independent lexicon unit, a description of the meaning(s), where within an individual meaning a connection with variant and transformatively connected FEs is provided, which are also represented in the lexicon as lexicon units. The lexicon represents the starting points for further structural and semantic analyses of multiword units in modern Slovene and enables their inclusion in modern language resources.

  • Krek, Simon; Gantar, Polona; Kosem, Iztok; Dobrovoljc, Kaja; Laskovski, Cyprian; Krsnik, Luka; Brank, Janez; Arhar Holdt, Špela; Čibej, Jaka; Robnik Šikonja, Marko; Klemenc, Bojan; Gorjanc, Vojko (2021). Multiword Expressions lexicon extracted from the Gigafida 2.1 corpus, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1421.
  • Škvorc, Tadej; Gantar, Polona; Robnik-Šikonja, Marko (2020). Dataset of Slovene idiomatic expressions SloIE, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1335.

 

Analysis of multiword units in modern standard Slovene

Based on extracted multiword units from the GF 2.1 corpus, we made several linguistic analyses on various morphological, syntactic and semantic aspects of their behavior in real texts of modern Slovene. The structural analysis took into account the variability of the constituent elements of a multiword unit at both lexical and morphological levels, the possibility of intrusion of non-lexicalized elements into their structure, and the semantic connection of different variant and transformative realizations of multiword units. Based on these analyses, we developed rules for the canonical form of a multiword unit in the Lexicon and a system for their integration within a computerized structure.

  • Gantar, Polona (2021, in print). Zapis frazeoloških enot v Leksikonu večbesednih enot za slovenščino.
  • Gantar, Polona; Krek, Simon; Kuzman, Taja (2017). ”Verbal multiword expressions in Slovene”. IN: Ruslan Mitkov (ed.): Computational and corpus-based phraseology: conference proceedings, pg. 247–259. Cham: Springer. https://link.springer.com/chapter/10.1007/978-3-319-69805-2_18.
TOP