1 |
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
|
|
|
|
In: https://hal.inria.fr/hal-03540069 ; 2022 (2022)
|
|
BASE
|
|
Show details
|
|
2 |
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection ...
|
|
|
|
BASE
|
|
Show details
|
|
3 |
SIGTYP 2020 Shared Task: Prediction of Typological Features ...
|
|
|
|
BASE
|
|
Show details
|
|
4 |
Linguistic calibration through metacognition: aligning dialogue agent responses with expected correctness ...
|
|
|
|
BASE
|
|
Show details
|
|
5 |
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset ...
|
|
|
|
BASE
|
|
Show details
|
|
6 |
Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model ...
|
|
|
|
BASE
|
|
Show details
|
|
8 |
Unsupervised Disambiguation of Syncretism in Inflected Lexicons ...
|
|
|
|
Abstract:
Lexical ambiguity makes it difficult to compute various useful statistics of a corpus. A given word form might represent any of several morphological feature bundles. One can, however, use unsupervised learning (as in EM) to fit a model that probabilistically disambiguates word forms. We present such an approach, which employs a neural network to smoothly model a prior distribution over feature bundles (even rare ones). Although this basic model does not consider a token's context, that very property allows it to operate on a simple list of unigram type counts, partitioning each count among different analyses of that unigram. We discuss evaluation metrics for this novel task and report results on 5 languages. ... : Published at NAACL 2018 ...
|
|
Keyword:
Computation and Language cs.CL; FOS Computer and information sciences
|
|
URL: https://arxiv.org/abs/1806.03740 https://dx.doi.org/10.48550/arxiv.1806.03740
|
|
BASE
|
|
Hide details
|
|
|
|