- Development of a methodology for the extraction and description of formulaic sequences in Slovene
- Analysis of the most frequent lexical bundles in Slovene
Formulaic-sequences extraction software
Based on the evaluation of different approaches to formulaic-sequences extraction (Dobrovoljc 2018), we implemented a module for the extraction of the most frequently recurring formulaic sequences and for their classification according to different association measures of collocability (frequency, Dice, MI, MI3, LL, t-test). The module is part of the LIST tool (Krsnik et al. 2019), a computer program for creating frequency lists from text corpora.
- Krsnik, Luka; Arhar Holdt, Špela; Čibej, Jaka; Dobrovoljc, Kaja; Ključevšek, Aleksander; Krek, Simon; Robnik-Šikonja, Marko (2019). Corpus extraction tool LIST 1.2, Slovenian language resource repository CLARIN.SI, https://www.clarin.si/repository/xmlui/handle/11356/1276.
- Dobrovoljc, Kaja (2018). ”N-gram Frequency Lists for Reference Corpora of Slovenian Language”. IN: Darja Fišer, Andrej Pančur (ed.): Language Technologies & Digital Humanities: conference proceedings, pg. 47–54. Ljubljana: Ljubljana University Press, Faculty of Arts. http://www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_Dobrovoljc-K_Frekvencni-seznami-n-gramov-v-korpusih-slovenskega-jezika.pdf.
List of the most frequent formulaic sequences in Slovene
Using the LIST tool, we exported lists of the most frequently recurring strings of two or more words at the level of word forms, lemmas, parts of speech, and morphosyntactic tags from the reference corpus of written standard Slovene Gigafida 2.0 (Čibej et al. 2019) and the reference corpus of spoken Slovene GOS (Čibej et al. 2020), which supplement related initial lists from other reference corpora (Dobrovoljc 2018abcd).
- Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon (2020). Frequency lists of word-level n-grams from the GOS 1.0 corpus 1.1, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1365.
- Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon (2019). Frequency lists of word-level n-grams from the Gigafida 2.0 corpus, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1274.
- Dobrovoljc, Kaja (2018). Kres corpus n-grams 2.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1193.
- Dobrovoljc, Kaja (2018). Gos corpus n-grams 2.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1195.
- Dobrovoljc, Kaja (2018). IMP corpus n-grams 2.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1194.
- Dobrovoljc, Kaja (2018). Janes corpus n-grams 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1192.
Linguistic analysis of the most relevant formulaic sequences and comparison of methods for their recognition
Approximately 2,000 statistically most prominent formulaic sequences in the Gigafida 2.0 and Gos 1.0 corpora were collected in a pilot lexicon of formulaic sequences in written or spoken Slovene (Dobrovoljc et al. 2020ab). In addition to exhaustive statistical data, we attributed them information on the syntactic structure, pragmatic function, and dictionaryrelevance (Dobrovoljc 2019). The resulting lexicons served as datasets for a more in-depth linguistic analysis of the types and uses of the statistically most prominent sequences in Slovene (Dobrovoljc 2018, Dobrovoljc 2021) and a comparison of different methods for their recognition in corpora (Dobrovoljc 2020).
- Dobrovoljc, Kaja (2021, in preparation). Leksikon formulaičnih besednih nizov v pisni in govorjeni slovenščini.
- Dobrovoljc, Kaja (2020). ”Identifying dictionary-relevant formulaic sequences in written and spoken corpora”. International Journal of Lexicography, vol. 33, issue 4, pg. 417–442. https://doi.org/10.1093/ijl/ecaa008.
- Dobrovoljc, Kaja; Roblek, Rebeka; Vianello, Chiara; Diaci, Ajda; Vuga, Zala (2020). List of formulaic sequences in spoken Slovenian, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1279.
- Dobrovoljc, Kaja; Roblek, Rebeka; Vianello, Chiara; Diaci, Ajda; Vuga, Zala (2020). List of formulaic sequences in standard written Slovenian, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1280.
- Dobrovoljc, Kaja (2019). ”Annotating formulaic sequences in spoken Slovenian: structure, function and relevance”. IN: Annemarie Friedrich, Deniz Zeyrek, Jet Hoek (ed.): LAW XIII, The 13th Linguistic Annotation Workshop, conference proceedings, pg. 108–112 Firenze, Italy. Stroudsburg: The Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-4013/.
- Dobrovoljc, Kaja (2018). ”Formulaičnost v slovenskem jeziku”. Slovnične raziskave za jezikovni opis. Slovenščina 2.0, Thematic issue, vol. 6, issue 2, pg. 67–95. http://www.dlib.si/?URN=URN:NBN:SI:DOC-IYNQSMXC.