Home
Catalogue search
Refine your search:
Keyword:
Creator / Publisher:
Edunov, Sergey (1)
Fan, angelafan@fb.com (1)
Grave, Edouard (1)
Joulin, Armand (1)
Schwenk, Holger (1)
The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 2021 (1)
Wenzek, Guillaume (1)
Year:
2021 (1)
Medium:
Online (1)
Type:
Article (1)
BLLDB-Access
Search in the Catalogues and Directories
All fields
Title
Creator / Publisher
Keyword
Year
AND
OR
AND NOT
All fields
Title
Creator / Publisher
Keyword
Year
AND
OR
AND NOT
All fields
Title
Creator / Publisher
Keyword
Year
AND
OR
AND NOT
All fields
Title
Creator / Publisher
Keyword
Year
AND
OR
AND NOT
All fields
Title
Creator / Publisher
Keyword
Year
Sort by
creator [A → Z]
'
creator [Z → A]
'
publishing year ↑ (asc)
'
publishing year ↓ (desc)
'
title [A → Z]
'
title [Z → A]
'
Simple Search
Hits 1 – 1 of 1
1
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web ...
The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 2021
;
Edunov, Sergey
;
Fan, angelafan@fb.com
;
Grave, Edouard
;
Joulin, Armand
;
Schwenk, Holger
;
Wenzek, Guillaume
. - : Underline Science Inc., 2021
Abstract:
Read paper: https://www.aclanthology.org/2021.acl-long.507 Abstract: We show that margin-based bitext mining in a multilingual sentence space can be successfully scaled to operate on monolingual corpora of billions of sentences. We use 32 snapshots of a curated common crawl corpus (Wenzel et al, 2019) totaling 71 billion unique sentences. Using one unified approach for 90 languages, we were able to mine 10.8 billion parallel sentences, out of which only 2.9 billions are aligned with English. We illustrate the capability of our scalable mining system to create high quality training sets from one language to any other by training hundreds of different machine translation models and evaluating them on the many-to-many TED benchmark. Further, we evaluate on competitive translation benchmarks such as WMT and WAT. Using only mined bitext, we set a new state of the art for a single system on the WMT'19 test set for English-German/Russian/Chinese. In particular, our English/German and English/Russian systems ...
URL:
https://dx.doi.org/10.48448/z2vp-9188
https://underline.io/lecture/25726-ccmatrix-mining-billions-of-high-quality-parallel-sentences-on-the-web
BASE
Hide details
Mobile view
All
Catalogues
UB Frankfurt Linguistik
0
IDS Mannheim
0
OLC Linguistik
0
UB Frankfurt Retrokatalog
0
DNB Subject Category Language
0
Institut für Empirische Sprachwissenschaft
0
Leibniz-Centre General Linguistics (ZAS)
0
Bibliographies
BLLDB
0
BDSL
0
IDS Bibliografie zur deutschen Grammatik
0
IDS Bibliografie zur Gesprächsforschung
0
IDS Konnektoren im Deutschen
0
IDS Präpositionen im Deutschen
0
IDS OBELEX meta
0
MPI-SHH Linguistics Collection
0
MPI for Psycholinguistics
0
Linked Open Data catalogues
Annohub
0
Online resources
Link directory
0
Journal directory
0
Database directory
0
Dictionary directory
0
Open access documents
BASE
1
Linguistik-Repository
0
IDS Publikationsserver
0
Online dissertations
0
Language Description Heritage
0
© 2013 - 2024 Lin|gu|is|tik
|
Imprint
|
Privacy Policy
|
Datenschutzeinstellungen ändern