DE eng

Search in the Catalogues and Directories

Page: 1 2 3
Hits 1 – 20 of 57

1
Multi-source morphosyntactic tagging for Spoken Rusyn
In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (2017)
BASE
Show details
2
A Quantitative Approach to Swiss German Dialect Syntax
In: International Conference on Language Variation in Europe (ICLAVE 9) (2017) (2017)
BASE
Show details
3
Lexicon Induction for Spoken Rusyn – Challenges and Results
In: Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing (2017)
BASE
Show details
4
Findings of the VarDial Evaluation Campaign 2017
In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (2017)
BASE
Show details
5
Towards automatic geolocalisation of speakers of European French
In: International Conference on Language Variation in Europe (ICLAVE 9) (2017) (2017)
BASE
Show details
6
Combien d'accents en français? Focus sur la France, la Belgique et la Suisse
In: Processus de différenciation: des pratiques langagières à leur interprétation sociale - Actes du colloque VALS-ASLA 2016, Vol. 1 (2017)
BASE
Show details
7
Schweizerdeutsche Dialekte quantitativ – Dialektometrische Analysen und Vergleich linguistischer Ebenen
In: 13. Bayerisch-Österreichische Dialektologentagung (BÖDT) (2016) (2016)
BASE
Show details
8
Modernising historical Slovene words
In: ISSN: 1351-3249 ; Natural Language Engineering, Vol. 22, No 6 (2016) pp. 881-905 (2016)
BASE
Show details
9
Normalizing orthographic and dialectal variants in the ArchiMob corpus of spoken Swiss German
In: 6th Days of Swiss Linguistics (2016) (2016)
Abstract: To study and automatically process Swiss German, it is necessary to resolve the issue of variation in the written representation of words. Due to the lack of written tradition and to the considerable regional variation, Swiss German writing is highly inconsistent, making it hard to establish identity between lexical items that are felt like “the same word”. This poses a problem for any task that requires establishing lexical identities, such as efficient corpus querying for linguistic research, semantic processing, and information retrieval. In the context of building the general-purpose electronic corpus ArchiMob, we have chosen to create an additional annotation layer that maps the original word forms to unified normalised representations. In this paper, we argue that these normalised representations can be induced in a semi-automatic fashion using techniques from machine translation. A lexical unit can be pronounced, and therefore transcribed, in various ways, due to dialectal variation, intra-speaker variation, code-switching or occasional transcription errors. In order to establish lexical identities between the items felt like “the same word”, the transcribed texts need to be normalised. We propose an approach to automatic normalisation that casts the task as simplified machine translation from inconsistently written texts to a unified representation. The resulting normalisation is treated as word-level annotation which is internally used for executing search queries, but is not intended to be presented to human users. Normalisation is carried out manually on a set of six documents, which serve as training and development data. An important feature of the particular normalised representation implemented in our work is that it diverges from both standard German and Swiss German. Many Swiss German lexical items do not have any etymologically related standard counterparts. We chose to normalise them using a convenient, etymologically motivated common construction. Thus, öpper is normalised as etwer, töff as töff, and gheie as geheien. Standard German conventions regarding word boundaries are often not applicable to Swiss German, where articles and pronouns tend to be cliticised. In these cases, we decided to keep the standard word boundaries whenever this was possible. Thus, hettemers is normalised as hätten wir es, bimene as bei einem. The six manually normalised documents are used as training data to automatically predict normalisation candidates for the following documents. The automatic processing is intended to speed up annotating our corpus, but also to replace manual annotation of new data that is not part of the corpus. Developing an automatic approach, however, is not trivial because the mappings between the transcriptions and their corresponding normalisations need to be learned on a small and extremely sparse data set. We need to be able to fit the training data, but also learn to generalise beyond the cases seen in the training set. The core of our approach is to distinguish four classes of words based on the distribution of their normalisations in the training data, and to apply an appropriate normalisation technique to each class. The words associated with only one or one predominant normalization are best treated using word-to-word translation. To address the words associated with multiple normalisations none of which is predominant, we train a trigram language model. Finally, to address new words that have not been seen at all in the training data, we train a full character-based statistical machine translation system (Vilar et al. 2007, Tiedemann 2009). We show that the combination of the methods gives better results than any of them individually, allowing us to obtain a relatively good automatic normalisation of a wide range of variants in Swiss German using a small training set.
Keyword: info:eu-repo/classification/ddc/410
URL: https://archive-ouverte.unige.ch/unige:90850
BASE
Hide details
10
Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation
In: Proceedings of the 13th Conference on Natural Language Processing (KONVENS) (2016)
BASE
Show details
11
On-line Multilingual Linguistic Services
In: ISBN: 978-4-87974-703-7 ; Proceedings of COLING 2016 System Demonstrations (2016)
BASE
Show details
12
ArchiMob - A Corpus of Spoken Swiss German
In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (2016)
BASE
Show details
13
Normalising orthographic and dialectal variants for the automatic processing of Swiss German
In: Proceedings of the 7th Language and Technology Conference (2015)
BASE
Show details
14
Crowdsourced mapping of pronunciation variants in European French
In: Proceedings of the 18th International Congress of Phonetic Science pp. 1-5 (2015)
BASE
Show details
15
Dialektometrische Analyse von schweizerdeutschen Dialektdaten
In: 18. Arbeitstagung zur alemannischen Dialektologie (2014) (2014)
BASE
Show details
16
Part-of-speech tagging for regional languages and dialects : A generic approach based on unsupervised learning
In: 8èmes Journées Suisses de la Linguistique (2014) (2014)
BASE
Show details
17
A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages
In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014) (2014)
BASE
Show details
18
Unsupervised adaptation of supervised part-of-speech taggers for closely related languages
In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial) pp. 30-38 (2014)
BASE
Show details
19
The distribution of aggregated syntactic construction types compared with other linguistic levels - A dialectometrical analysis of Swiss German dialects
In: Methods in Dialectology XV (2014) (2014)
BASE
Show details
20
Computerlinguistische Experimente für die schweizerdeutsche Dialektlandschaft: Maschinelle Übersetzung und Dialektometrie
In: ISBN: 978-3-515-10343-5 ; Alemannische Dialektologie: Dialekte im Kontakt (Beiträge zur 17. Arbeitstagung für alemannische Dialektologie in Strassburg) pp. 261-278 (2014)
BASE
Show details

Page: 1 2 3

Catalogues
0
0
0
0
0
0
0
Bibliographies
0
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
57
0
0
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern