1 |
Automatic Dialect Density Estimation for African American English ...
|
|
|
|
BASE
|
|
Show details
|
|
2 |
DIALKI: Knowledge Identification in Conversational Systems through Dialogue-Document Contextualization ...
|
|
|
|
BASE
|
|
Show details
|
|
3 |
Dialogue State Tracking with a Language Model using Schema-Driven Prompting ...
|
|
|
|
BASE
|
|
Show details
|
|
5 |
Neural Models for Integrating Prosody in Spoken Language Understanding
|
|
|
|
BASE
|
|
Show details
|
|
6 |
Automatic Analysis of Language Use in K-16 STEM Education and Impact on Student Performance
|
|
|
|
BASE
|
|
Show details
|
|
7 |
Asynchronous Speech Recognition Affects Physician Editing of Notes
|
|
|
|
BASE
|
|
Show details
|
|
9 |
Parsing Speech: A Neural Approach to Integrating Lexical and Acoustic-Prosodic Information ...
|
|
|
|
BASE
|
|
Show details
|
|
10 |
Effective Use of Cross-Domain Parsing in Automatic Speech Recognition and Error Detection
|
|
|
|
BASE
|
|
Show details
|
|
12 |
Data Selection for Statistical Machine Translation
|
|
|
|
Abstract:
Thesis (Ph.D.)--University of Washington, 2014 ; Machine translation, the computerized translation of one human language to another, could be used to communicate between the thousands of languages used around the world. Statistical machine translation (SMT) is an approach to building these translation engines without much human intervention, and large-scale implementations by Google, Microsoft, and Facebook in their products are used by millions daily. The quality of SMT systems depends on the example translations used to train the models. Data can come from a variety of sources, many of which are not optimal for common specific tasks. The goal is to be able to find the right data to use to train a model for a particular task. This work determines the most relevant subsets of these large datasets with respect to a translation task, enabling the construction of task-specific translation systems that are more accurate and easier to train than the large-scale models. Three methods are explored for identifying task-relevant translation training data from a general data pool. The first uses only a language model to score the training data according to lexical probabilities, improving on prior results by using a bilingual score that accounts for differences between the target domain and the general data. The second is a topic-based relevance score that is novel for SMT, using topic models to project texts into a latent semantic space. These semantic vectors are then used to compute similarity of sentences in the general pool to to the target task. This work finds that what the automatic topic models capture for some tasks is actually the style of the language, rather than task-specific content words. This motivates the third approach, a novel style-based data selection method. Hybrid word and part-of-speech (POS) representations of the two corpora are constructed by retaining the discriminative words and using POS tags as a proxy for the stylistic content of the infrequent words. Language models based on these representations can be used to quantify the underlying stylistic relevance between two texts. Experiments show that style-based data selection can outperform the current state-of-the-art method for task-specific data selection, in terms of SMT system performance and vocabulary coverage. Taken together, the experimental results indicate that it is important to characterize corpus differences when selecting data for statistical machine translation.
|
|
Keyword:
Computer science; data selection; electrical engineering; language modeling; machine translation; natural language processing; topic modeling
|
|
URL: http://hdl.handle.net/1773/26146
|
|
BASE
|
|
Hide details
|
|
16 |
Graph-based Algorithms for Lexical Semantics and its Applications
|
|
|
|
BASE
|
|
Show details
|
|
|
|