1 |
Neural Token Segmentation for High Token-Internal Complexity ...
|
|
|
|
Abstract:
Tokenizing raw texts into word units is an essential pre-processing step for critical tasks in the NLP pipeline such as tagging, parsing, named entity recognition, and more. For most languages, this tokenization step straightforward. However, for languages with high token-internal complexity, further token-to-word segmentation is required. Previous canonical segmentation studies were based on character-level frameworks, with no contextualised representation involved. Contextualized vectors a la BERT show remarkable results in many applications, but were not shown to improve performance on linguistic segmentation per se. Here we propose a novel neural segmentation model which combines the best of both worlds, contextualised token representation and char-level decoding, which is particularly effective for languages with high token-internal complexity and extreme morphological ambiguity. Our model shows substantial improvements in segmentation accuracy on Hebrew and Arabic compared to the state-of-the-art, and ...
|
|
Keyword:
Computation and Language cs.CL; FOS Computer and information sciences
|
|
URL: https://arxiv.org/abs/2203.10845 https://dx.doi.org/10.48550/arxiv.2203.10845
|
|
BASE
|
|
Hide details
|
|
2 |
Morphological Reinflection with Multiple Arguments: An Extended Annotation schema and a Georgian Case Study ...
|
|
|
|
BASE
|
|
Show details
|
|
10 |
Applying the Transformer to Character-level Transduction
|
|
|
|
In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (2021)
|
|
BASE
|
|
Show details
|
|
11 |
Telling BERT's Full Story: from Local Attention to Global Aggregation
|
|
|
|
In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (2021)
|
|
BASE
|
|
Show details
|
|
12 |
Disambiguatory Signals are Stronger in Word-initial Positions
|
|
|
|
In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (2021)
|
|
BASE
|
|
Show details
|
|
14 |
Asking It All: Generating Contextualized Questions for any Semantic Role ...
|
|
|
|
BASE
|
|
Show details
|
|
15 |
The Possible, the Plausible, and the Desirable: Event-Based Modality Detection for Language Processing ...
|
|
|
|
BASE
|
|
Show details
|
|
16 |
The Possible, the Plausible, and the Desirable: Event-Based Modality Detection for Language Processing ...
|
|
|
|
BASE
|
|
Show details
|
|
17 |
Formae reformandae: for a reorganisation of verb form annotation in Universal Dependencies illustrated by the specific case of Latin
|
|
Cecchini, Flavio Massimiliano (orcid:0000-0001-9029-1822). - : Association for Computational Linguistics, 2021. : country:BGR, 2021. : place:Sofia, 2021
|
|
BASE
|
|
Show details
|
|
18 |
RelWalk - A Latent Variable Model Approach to Knowledge Graph Embedding.
|
|
|
|
BASE
|
|
Show details
|
|
|
|