1 |
Modeling Inflectional Complexity in Natural Language Processing
|
|
|
|
Abstract:
Degree: Doctor of Philosophy ; Abstract: Inflectional morphology presents numerous problems for traditional computational models, not least of which is an increase in the number of rare types in any corpus. Although few annotated corpora exist for morphologically complex languages, it is possible for lay-speakers of the language to generate data such as inflection tables that describe patterns that can be learned by machine learning algorithms. We investigate four inflectional tasks: inflection generation, stemming, lemmatization, and morphological analysis, and demonstrate that each of these tasks can be accurately modeled using sequential string transduction methods. Furthermore, expert annotation is unnecessary: inflectional models are learned from crowd-sourced inflection tables. We first investigate inflection generation: given a dictionary form and a tag representing inflectional information, we produce inflected word-forms. We then refine our predictions by referring to the other forms within a paradigm. Results of experiments on six diverse languages with varying amounts of training data demonstrate that our approach improves the state of the art in terms of predicting inflected word-forms. We next investigate stemming: the removal of inflectional prefixes and suffixes from a word. Unlike the inflection generation task, it is not possible to use inflection tables to learn a fully-supervised stemming model; however, we exploit paradigmatic regularity to identify stems in an unsupervised manner with over 85% accuracy. Experiments on English, Dutch, and German show that our stemmers substantially outperform rule-based and unsupervised stemmers such as Snowball and Morfessor, and approach the accuracy of a fully-supervised system. Furthermore, the generated stems are more consistent than those annotated by experts. We also use the inflection tables to learn models that generate lemmas from inflected forms. Unlike stemming, lemmatization restores orthographic changes that have occurred during inflection. These models are more accurate than Morfette and Lemming on most datasets. Finally, we extend our lemmatization methods to produce complete morphological analyses: given a word, return a set of lemma / tag pairs that may have generated it. This task is more ambiguous than inflectional generation or lemmatization which typically produce only a small number of outputs. Thus, morphological analysis involves producing a complete list of lemma+tag analyses for a given word-form. Experiments on four languages demonstrate that our system has much higher coverage than a hand-engineered FST analyzer, and is more accurate than a state-of-the-art morphological tagger.
|
|
Keyword:
Computational Linguistics; Inflection; Morphology; Natural Language Processing; String Transduction
|
|
URL: https://era.library.ualberta.ca/items/c19e53ed-60b2-4742-bb6a-01f5683de6a0 https://doi.org/10.7939/R3HT2GR63 http://hdl.handle.net/10402/era.44191
|
|
BASE
|
|
Hide details
|
|
2 |
to appear). Semi-supervised learning of morphological paradigms and lexicons
|
|
|
|
In: http://aclweb.org/anthology//E/E14/E14-1060.pdf (2014)
|
|
BASE
|
|
Show details
|
|
3 |
Parallel MultiTheory Annotations of Syntactic Structure
|
|
|
|
In: http://www.lrec-conf.org/proceedings/lrec2008/pdf/587_paper.pdf (2008)
|
|
BASE
|
|
Show details
|
|
4 |
POS-tag based poetry generation with WordNet
|
|
|
|
In: http://wing.comp.nus.edu.sg/~antho/W/W13/W13-2121.pdf
|
|
BASE
|
|
Show details
|
|
5 |
ZeuScansion: a tool for scansion of English poetry
|
|
|
|
In: http://www.aclweb.org/anthology/W/W13/W13-1803.pdf
|
|
BASE
|
|
Show details
|
|
6 |
using a
|
|
|
|
In: http://www.aclweb.org/anthology-new/W/W11/W11-2605.pdf
|
|
BASE
|
|
Show details
|
|
7 |
Finite state applications with Javascript
|
|
|
|
In: http://emmtee.net/oe/nodalida13/conference/91.pdf
|
|
BASE
|
|
Show details
|
|
8 |
Foma: a finite-state compiler and library
|
|
|
|
In: http://aclweb.org/anthology/E/E09/E09-2008.pdf
|
|
BASE
|
|
Show details
|
|
9 |
Revisiting multi-tape automata for Semitic morphological analysis and generation
|
|
|
|
In: http://aclweb.org/anthology-new/W/W09/W09-0803.pdf
|
|
BASE
|
|
Show details
|
|
10 |
Using foma for language-based games
|
|
|
|
In: http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1349278797/publikoak/pdf/
|
|
BASE
|
|
Show details
|
|
11 |
ACTIV-ES: a comparable, cross-dialect corpus of 'everyday' Spanish from Argentina, Mexico, and Spain
|
|
|
|
In: http://www.lrec-conf.org/proceedings/lrec2014/pdf/691_Paper.pdf
|
|
BASE
|
|
Show details
|
|
|
|