DE eng

Search in the Catalogues and Directories

Hits 1 – 20 of 20

1
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
In: https://hal.inria.fr/hal-03540069 ; 2022 (2022)
Abstract: 15 page preprint ; What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete and atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in many areas, enabling small vocabularies while still allowing for fast inference. Is the end of the road character-level model or byte-level processing? In this survey, we connect several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications and that thinking seriously about tokenization remains important for many applications.
Keyword: [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
URL: https://hal.inria.fr/hal-03540069
BASE
Hide details
2
Findings of the IWSLT 2020 Evaluation campaign
Niehues, Jan; Federico, Marcello; Ma, Xutai. - : Association for Computational Linguistics, 2022
BASE
Show details
3
KIT Lecture Translator: Multilingual Speech Translation with One-Shot Learning
Nguyen, Thai-Son; Zenkel, Thomas; Waibel, Alex. - : Association for Computational Linguistics, 2022
BASE
Show details
4
Tutorial: End-to-End Speech Translation
Negri, Matteo; Salesky, Elizabeth; Turchi, Marco. - : Association for Computational Linguistics, 2022
BASE
Show details
5
Assessing Evaluation Metrics for Speech-to-Speech Translation ...
BASE
Show details
6
Assessing Evaluation Metrics for Speech-to-Speech Translation ...
BASE
Show details
7
The Multilingual TEDx Corpus for Speech Recognition and Translation ...
BASE
Show details
8
Tutorial: End-to-End Speech Translation ...
Niehues, Jan; Salesky, Elizabeth; Turchi, Marco. - : Association for Computational Linguistics, 2021
BASE
Show details
9
A surprisal--duration trade-off across and within the world's languages ...
BASE
Show details
10
Assessing Evaluation Metrics for Speech-to-Speech Translation
In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2021)
BASE
Show details
11
Robust Open-Vocabulary Translation from Visual Text Representations ...
BASE
Show details
12
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection ...
BASE
Show details
13
SIGTYP 2020 Shared Task: Prediction of Typological Features ...
BASE
Show details
14
Findings of the IWSLT 2020 Evaluation campaign ...
Ansari, Ebrahim; Axelrod, Amittai; Bach, Nguyen. - : Association for Computational Linguistics, 2020
BASE
Show details
15
Generalized Entropy Regularization or: There’s Nothing Special about Label Smoothing ...
BASE
Show details
16
A Corpus for Large-Scale Phonetic Typology ...
BASE
Show details
17
A Corpus for Large-Scale Phonetic Typology ...
BASE
Show details
18
A corpus for large-scale phonetic typology
BASE
Show details
19
A Corpus for Large-Scale Phonetic Typology
In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020)
BASE
Show details
20
A Language-Independent Approach to Automatic Text Difficulty Assessment for Second-Language Learners
In: DTIC (2013)
BASE
Show details

Catalogues
0
0
0
0
0
0
0
Bibliographies
0
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
20
0
0
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern