1 |
Evaluating Multiway Multilingual NMT in the Turkic Languages ...
|
|
|
|
BASE
|
|
Show details
|
|
2 |
Findings of the WMT 2021 Shared Task on Quality Estimation ...
|
|
|
|
BASE
|
|
Show details
|
|
3 |
Pushing the Right Buttons: Adversarial Evaluation of Quality Estimation ...
|
|
|
|
BASE
|
|
Show details
|
|
4 |
Multilingual Domain Adaptation for NMT: Decoupling Language and Domain Information with Adapters ...
|
|
|
|
BASE
|
|
Show details
|
|
5 |
Robust Open-Vocabulary Translation from Visual Text Representations ...
|
|
|
|
BASE
|
|
Show details
|
|
6 |
Contrastive Learning for Context-aware Neural Machine Translation Using Coreference Information ...
|
|
|
|
BASE
|
|
Show details
|
|
7 |
To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation ...
|
|
|
|
BASE
|
|
Show details
|
|
8 |
Identifying the Importance of Content Overlap for Better Cross-lingual Embedding Mappings ...
|
|
|
|
BASE
|
|
Show details
|
|
9 |
Simultaneous Neural Machine Translation with Constituent Label Prediction ...
|
|
|
|
BASE
|
|
Show details
|
|
10 |
Just Ask! Evaluating Machine Translation by Asking and Answering Questions ...
|
|
|
|
BASE
|
|
Show details
|
|
11 |
An Analysis of Euclidean vs. Graph-Based Framing for Bilingual Lexicon Induction from Word Embedding Spaces ...
|
|
|
|
BASE
|
|
Show details
|
|
12 |
Findings of the WMT Shared Task on Machine Translation Using Terminologies ...
|
|
|
|
BASE
|
|
Show details
|
|
13 |
Translation Transformers Rediscover Inherent Data Domains ...
|
|
|
|
BASE
|
|
Show details
|
|
14 |
Phrase-level Active Learning for Neural Machine Translation ...
|
|
|
|
BASE
|
|
Show details
|
|
16 |
Wine is not v i n. On the Compatibility of Tokenizations across Languages ...
|
|
|
|
Abstract:
The size of the vocabulary is a central design choice in large pretrained language models, with respect to both performance and memory requirements. Typically, subword tokenization algorithms such as byte pair encoding and WordPiece are used. In this work, we investigate the compatibility of tokenizations for multilingual static and contextualized embedding spaces and propose a measure that reflects the compatibility of tokenizations across languages. Our goal is to prevent incompatible tokenizations, e.g., "wine" (word-level) in English vs. "v i n" (character-level) in French, which make it hard to learn good multilingual semantic representations. We show that our compatibility measure allows the system designer to create vocabularies across languages that are compatible -- a desideratum that so far has been neglected in multilingual models. ...
|
|
Keyword:
Bilingual Lexicon Induction; Language Models; Natural Language Processing
|
|
URL: https://dx.doi.org/10.48448/4bn9-4p23 https://underline.io/lecture/38413-wine-is-not-v-i-n.-on-the-compatibility-of-tokenizations-across-languages
|
|
BASE
|
|
Hide details
|
|
|
|