6 |
Modeling the Unigram Distribution ...
|
|
|
|
Abstract:
Read paper: https://www.aclanthology.org/2021.findings-acl.326 Abstract: The unigram distribution is the non-contextual probability of finding a specific word form in a corpus. While of central importance to the study of language, it is commonly approximated by each word's sample frequency in the corpus. This approach, being highly dependent on sample size, assigns zero probability to any out-of-vocabulary (oov) word form. As a result, it produces negatively biased probabilities for any oov word form, while positively biased probabilities to in-corpus words. In this work, we argue in favor of properly modeling the unigram distribution---claiming it should be a central task in natural language processing. With this in mind, we present a novel model for estimating it in a language (a neuralization of Goldwater et al.'s (2011) model) and show it produces much better estimates across a diverse set of 7 languages than the naïive use of neural character-level language models. ...
|
|
URL: https://dx.doi.org/10.48448/fx5z-4a29 https://underline.io/lecture/26417-modeling-the-unigram-distribution
|
|
BASE
|
|
Hide details
|
|
8 |
A surprisal--duration trade-off across and within the world's languages ...
|
|
|
|
BASE
|
|
Show details
|
|
10 |
What About the Precedent: An Information-Theoretic Analysis of Common Law ...
|
|
|
|
BASE
|
|
Show details
|
|
12 |
Finding Concept-specific Biases in Form–Meaning Associations ...
|
|
|
|
BASE
|
|
Show details
|
|
13 |
Modeling the Unigram Distribution
|
|
|
|
In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (2021)
|
|
BASE
|
|
Show details
|
|
14 |
What About the Precedent: An Information-Theoretic Analysis of Common Law
|
|
|
|
In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021)
|
|
BASE
|
|
Show details
|
|
15 |
Finding Concept-specific Biases in Form–Meaning Associations
|
|
|
|
In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021)
|
|
BASE
|
|
Show details
|
|
16 |
A Non-Linear Structural Probe
|
|
|
|
In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021)
|
|
BASE
|
|
Show details
|
|
17 |
Disambiguatory Signals are Stronger in Word-initial Positions
|
|
|
|
In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (2021)
|
|
BASE
|
|
Show details
|
|
18 |
How (Non-)Optimal is the Lexicon?
|
|
|
|
In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021)
|
|
BASE
|
|
Show details
|
|
19 |
A Bayesian Framework for Information-Theoretic Probing
|
|
|
|
In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021)
|
|
BASE
|
|
Show details
|
|
|
|