1 |
Resourceful at Any Size: A Predictive Methodology Using Linguistic Corpus Metrics for Multi-Source Training in Neural Dependency Parsing
|
|
|
|
Abstract:
Thesis (Ph.D.)--University of Washington, 2021 ; Multilingual modeling comes up in natural language processing at any scale. High-resource language corpora train high-performing models, and can be combined with other language corpora of all sizes to make better models for low-resource languages. Projects like Universal Dependencies even make it possible to train highly multilingual models from standardized morphosyntactic labels. Multilingual (or, more generally, multi-source) training does not consistently improve modeling performance, however. With an abundance of language resources comes a difficult design choice: which corpora will train better together rather than separately? More specifically, when is it worthwhile to supplement (i.e., concatenate) one corpus with another during training, rather than training on the first corpus alone? Approaches to selecting and evaluating candidate combinations tend toward two extremes: ad hoc or exhaustive. In this work, I put forth an alternative, predictive methodology for outcomes of concatenative training in dependency parsing. I leverage treebanks from the Universal Dependencies framework to assess the utility of linguistic corpus metrics in multi-source modeling. This approach is both robust and practical, using computationally simple metrics that expand upon intuitions of linguistic similarity, and making it possible to reasonably predict which conditions will yield significant improvement for a target corpus. Although the results are specific to a particular family of models and the task of dependency parsing, the approach holds promise for any number of natural language processing applications.
|
|
Keyword:
Computational Linguistics; Computer science; Corpus Linguistics; Dependency Parsing; Linguistics; Multilingual Modeling; Multitask Modeling; Natural Language Processing
|
|
URL: http://hdl.handle.net/1773/48283
|
|
BASE
|
|
Hide details
|
|
2 |
ASR and Human Recognition Errors: Predictability and Lexical Factors
|
|
|
|
BASE
|
|
Show details
|
|
6 |
The Language of Law: An Analysis of Gender and Turn-Taking in U.S. Supreme Court Oral Arguments
|
|
|
|
BASE
|
|
Show details
|
|
7 |
Speech to Text to Semantics: A Sequence-to-Sequence System for Spoken Language Understanding
|
|
|
|
BASE
|
|
Show details
|
|
8 |
Dialogical Signals of Stance Taking in Spontaneous Conversation
|
|
|
|
BASE
|
|
Show details
|
|
10 |
Exploring Phone Recognition in Pre-verbal and Dysarthric Speech
|
|
|
|
BASE
|
|
Show details
|
|
11 |
Enriching Scientific Paper Embeddings with Citation Context
|
|
|
|
BASE
|
|
Show details
|
|
12 |
Labeling and Automatically Identifying Basic-Level Categories
|
|
|
|
BASE
|
|
Show details
|
|
13 |
Exposing the hidden vocal channel: Analysis of vocal expression
|
|
|
|
BASE
|
|
Show details
|
|
14 |
Three Cheers For Partisanship: Lexical Framing and Applause in U.S. Presidential Primary Debates
|
|
|
|
BASE
|
|
Show details
|
|
15 |
STREAMLInED Challenges: Aligning Research Interests with Shared Tasks
|
|
|
|
In: 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages, March 6-7, 2017. Honolulu, Hawai‘i, USA (2017)
|
|
BASE
|
|
Show details
|
|
16 |
The prosody of negative ‘yeah’
|
|
|
|
In: LSA Annual Meeting Extended Abstracts; Vol 6: LSA Annual Meeting Extended Abstracts 2015; 6:1-5 ; 2377-3367 (2015)
|
|
BASE
|
|
Show details
|
|
17 |
Detection of Agreement and Disagreement: An investigation of linguistic coordination and conversational features
|
|
|
|
BASE
|
|
Show details
|
|
18 |
An Independent Assessment of Phonetic Distinctive Feature Sets used to Model Pronunciation Variation
|
|
|
|
BASE
|
|
Show details
|
|
|
|