1 |
MarsaTag, a tagger for French written texts and speech transcriptions
|
|
|
|
In: Second Asian Pacific Corpus linguistics Conference ; https://hal.archives-ouvertes.fr/hal-01500736 ; Second Asian Pacific Corpus linguistics Conference, Mar 2014, Hong Kong, China. pp.220-220 (2014)
|
|
BASE
|
|
Show details
|
|
2 |
Phrase extraction and rescoring in statistical machine translation
|
|
Srivastava, Ankit Kumar. - : Dublin City University. Centre for Next Generation Localisation (CNGL), 2014. : Dublin City University. School of Computing, 2014
|
|
In: Srivastava, Ankit Kumar (2014) Phrase extraction and rescoring in statistical machine translation. PhD thesis, Dublin City University. (2014)
|
|
BASE
|
|
Show details
|
|
3 |
Deep Syntax Annotation of the Sequoia French Treebank
|
|
|
|
In: International Conference on Language Resources and Evaluation (LREC) ; https://hal.inria.fr/hal-00969191 ; International Conference on Language Resources and Evaluation (LREC), May 2014, Reykjavik, Iceland (2014)
|
|
BASE
|
|
Show details
|
|
4 |
Rhapsodie: a Prosodic-Syntactic Treebank for Spoken French
|
|
|
|
In: Language Resources and Evaluation Conference ; https://hal.sorbonne-universite.fr/hal-00968959 ; Language Resources and Evaluation Conference, May 2014, Reykjavik, Iceland (2014)
|
|
BASE
|
|
Show details
|
|
5 |
Correcting and Validating Syntactic Dependency in the Spoken French Treebank Rhapsodie
|
|
|
|
In: Proceedings of the 9th Language Resources and Evaluation Conference (LREC) ; https://halshs.archives-ouvertes.fr/halshs-01011059 ; Proceedings of the 9th Language Resources and Evaluation Conference (LREC), 2014, Iceland. pp.1-6 (2014)
|
|
BASE
|
|
Show details
|
|
12 |
Building Computational Resources : The URDU.KON-TB Treebank and the Urdu Parser
|
|
|
|
Abstract:
This work presents the development of the URDU.KON-TB treebank, its annotation evaluation & guidelines and the construction of the Urdu parser for a South Asian language Urdu. Urdu is comparatively an under-resourced language and the development of a reliable treebank and a parser will have significant impact on the state-of-the-art for automatic Urdu language processing. The work includes the construction of the raw corpus containing 1400 sentences collected from Urdu Wikipedia and the Jang newspaper. The corpus contains text of local & international news, social stories, sports, culture, finance, religion, traveling, etc. The hierarchal annotation scheme adopted has a combination of phrase structure and hyper dependency structure. A semi-semantic part of speech tag set, a semi-semantic syntactic tag set and a functional tag set are proposed, which are further revised during the annotation of the raw corpus. The annotation of the sentences was performed manually. Due to the addition of morphology, part of speech, syntactical, semantical, clausal, grammatical and miscellaneous features, the annotation scheme is linguistically rich. The annotation resulted in a treebank for Urdu, called the URDU.KON-TB. This is presented in Chapter 3. For an evaluation of the annotation scheme, Krippendorff's Alpha coefficient is selected. This is a statistical measure to evaluate inter-annotator agreement. Randomly selected 100 sentences from the URDU.KON-TB treebank were given to five trained annotators for annotation. The annotated sentences then evaluated using the Krippendorff's Alpha coefficient. The alpha values of inter-annotator agreement obtained for part of speech, syntactical and functional annotation are 0.964, 0.817 and 0.806, respectively. The evaluation is presented in Chapter 4. All of the three values lie in the range of perfect agreement. The annotation guidelines devised in the development of the URDU.KON-TB treebank were revised during and after this annotation evaluation. The updated version is presented in Chapter 2. For the development of an Urdu parser, 1400 annotated sentences in the URDU.KON-TB treebank are divided into 80% training data and 20% test data. A context free grammar is extracted from this training data, which is then given to the Urdu parser after its development. The test data is divided into 10% held out data and 10% test data. The test data then contains 140 sentences with an average length of 13.73 words per sentence. The held out data is used during the development of the Urdu parser. Urdu parser is an extended version of dynamic programming algorithm known as the Earley parsing algorithm. The extensions made are discussed in Chapter 5 along with the issues faced during the development. All items which can occur in a normal text are considered, e.g., punctuation, null elements, diacritics, headings, regard titles, Hadees (the statements of prophets), anaphora with in a sentence, and others. The PARSEVAL measures are used to evaluate the results of the Urdu parser. By applying a sufficiently rich grammar along with the extended parsing model, the parser gives 87% of f-score and outperforms the multi-path-shift-reduce parser for Urdu, a two stage Hindi dependency parser and a simple Hindi dependency parser with 4.8%, 12.48% and 22% increase in recall, respectively. The URDU.KON-TB treebank and the Urdu parser is a contribution to the overall computational resources of Urdu. By products of this work are a semi-semantic part of speech tagset, a semi-semantic syntactic tagset, a functional tagset, annotation guidelines, a grammar with sufficient encoded information for parsing of morphologically rich language Urdu and a part of speech tagged corpus, which can be used for the training of part of speech taggers. These resources will be enhanced further and can be used for natural language processing such as probabilistic parsing, training of POS taggers, disambiguation of spoken sentences, grammar development, language identification, sources for linguistic inquiry and psychological modeling, or pattern matching.
|
|
Keyword:
ddc:004; Functional Tagset; Semi-Semantic Part of Speech Tagset; Semi-Semantic Syntactic Tagset; Urdu Parser; Urdu Treebank Statistical Evaluation; URDU.KON.TB Treebank
|
|
URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-290530
|
|
BASE
|
|
Hide details
|
|
13 |
From Syntax to Semantics. First Steps Towards Tectogrammatical Annotation of Latin
|
|
Passarotti, Marco Carlo (orcid:0000-0002-9806-7187). - : The Association for Computational Linguistics, 2014. : country:SWE, 2014. : place:Gothenburg, 2014
|
|
BASE
|
|
Show details
|
|
14 |
Reflexões sobre anotação sintática e ferramentas de busca - Uso da linguagem XML para anotação sintática no corpus digital DOViC
|
|
|
|
In: Letras & Letras; v. 30, n. 2 (2014): Linguística de Corpus: abordagem e metodologia em pesquisas linguísticas de base empírica; 82-103 ; 1981-5239 (2014)
|
|
BASE
|
|
Show details
|
|
15 |
Challenges in Enhancing the Index Thomisticus Treebank with Semantic and Pragmatic Annotation
|
|
|
|
BASE
|
|
Show details
|
|
|
|