1 |
Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora
|
|
|
|
In: Entropy; Volume 24; Issue 2; Pages: 280 (2022)
|
|
Abstract:
Zipf’s law of abbreviation, which posits a negative correlation between word frequency and length, is one of the most famous and robust cross-linguistic generalizations. At the same time, it has been shown that contextual informativity (average surprisal given previous context) is more strongly correlated with word length, although this tendency is not observed consistently, depending on several methodological choices. The present study examines a more diverse sample of languages than the previous studies (Arabic, Finnish, Hungarian, Indonesian, Russian, Spanish and Turkish). I use large web-based corpora from the Leipzig Corpora Collection to estimate word lengths in UTF-8 characters and in phonemes (for some of the languages), as well as word frequency, informativity given previous word and informativity given next word, applying different methods of bigrams processing. The results show different correlations between word length and the corpus-based measure for different languages. I argue that these differences can be explained by the properties of noun phrases in a language, most importantly, by the order of heads and modifiers and their relative morphological complexity, as well as by orthographic conventions.
|
|
Keyword:
corpora; frequency; informativity; linguistic typology; n-grams; Zipf’s law of abbreviation
|
|
URL: https://doi.org/10.3390/e24020280
|
|
BASE
|
|
Hide details
|
|
3 |
Multi-word units (and tokenization more generally): a multi-dimensional and largely information-theoretic approach
|
|
|
|
In: Lexis: Journal in English Lexicology, Vol 19 (2022) (2022)
|
|
BASE
|
|
Show details
|
|
4 |
Meta-Learner for Amharic Sentiment Classification
|
|
|
|
In: Applied Sciences ; Volume 11 ; Issue 18 (2021)
|
|
BASE
|
|
Show details
|
|
5 |
You are kidding right? The English present progressive as a stance marker in film dialogue ...
|
|
|
|
BASE
|
|
Show details
|
|
6 |
An interactive visualization of Google Books Ngrams with R and Shiny : exploring a(n) historical increase in onset strength in a(n) huge database
|
|
|
|
BASE
|
|
Show details
|
|
7 |
An interactive visualization of Google Books Ngrams with R and Shiny : exploring a(n) historical increase in onset strength in a(n) huge database
|
|
|
|
BASE
|
|
Show details
|
|
8 |
DIGITAL TECHNOLOGIES FOR GRAMMATICAL ERROR CORRECTION: DEEP LEARNING METHODS & SYNTACTIC N-GRAMS
|
|
|
|
In: Мова; No. 35 (2021) ; Мова; № 35 (2021) ; 2414-9489 ; 2307-4558 (2021)
|
|
BASE
|
|
Show details
|
|
9 |
You are kidding right? The English present progressive as a stance marker in film dialogue
|
|
|
|
In: Lingue e Linguaggi; Volume 44(2021); 183-202 (2021)
|
|
BASE
|
|
Show details
|
|
10 |
Visualizing the development of prose styles in Horse Manuals from Early Modern English to Present-Day English
|
|
|
|
In: EISSN: 2416-5999 ; Journal of Data Mining and Digital Humanities ; https://hal.archives-ouvertes.fr/hal-02283138 ; Journal of Data Mining and Digital Humanities, Episciences.org, 2020, Special Issue Visualisations in Historical Linguistics, Special issue on Visualisations in Historical Linguistics, pp.1-33 (2020)
|
|
BASE
|
|
Show details
|
|
11 |
Frequency lists of character-level n-grams from the GOS 1.0 corpus 1.1
|
|
|
|
BASE
|
|
Show details
|
|
13 |
Frequency lists of word-level n-grams from the GOS 1.0 corpus 1.1
|
|
|
|
BASE
|
|
Show details
|
|
15 |
Visualizing the development of prose styles in Horse Manuals from Early Modern English to Present-Day English
|
|
|
|
In: Journal of Data Mining and Digital Humanities, Vol Special issue on Visualisations in Historical Linguistics (2020) (2020)
|
|
BASE
|
|
Show details
|
|
16 |
An interactive visualization of Google Books Ngrams with R and Shiny: Exploring a(n) historical increase in onset strength in a(n) huge database
|
|
|
|
In: Journal of Data Mining and Digital Humanities, Vol Special issue on Visualisations in Historical Linguistics (2020) (2020)
|
|
BASE
|
|
Show details
|
|
17 |
The necessity modals have to, must, need to and should: using n-grams to help identify common and distinct semantic and pragmatic aspects. 11.2: 220-243
|
|
|
|
In: ISSN: 1876-1933 ; EISSN: 1876-1941 ; Constructions and Frames ; https://hal.archives-ouvertes.fr/hal-02369306 ; Constructions and Frames, John Benjamins, 2019, 11, pp.220 - 243. ⟨10.1075/cf.00029.cap⟩ (2019)
|
|
BASE
|
|
Show details
|
|
18 |
The necessity modals have to, must, need to and should: using n-grams to help identify common and distinct semantic and pragmatic aspects
|
|
|
|
In: ISSN: 1876-1933 ; EISSN: 1876-1941 ; Constructions and Frames ; https://hal.archives-ouvertes.fr/hal-02501498 ; Constructions and Frames, John Benjamins, 2019, 11 (2), pp.220-243. ⟨10.1075/cf.00029.cap⟩ (2019)
|
|
BASE
|
|
Show details
|
|
20 |
Dependency tree extraction tool STARK 1.0
|
|
Krsnik, Luka; Dobrovoljc, Kaja; Robnik-Šikonja, Marko. - : Centre for Language Resources and Technologies, University of Ljubljana, 2019. : Faculty of Arts, University of Ljubljana, 2019. : Faculty of Computer and Information Science, University of Ljubljana, 2019
|
|
BASE
|
|
Show details
|
|
|
|