85 |
Dataset and baseline model of moderated content FRENK-STYRIA-24sata 1.0
|
|
|
|
BASE
|
|
Show details
|
|
86 |
hr500k – A Reference Training Corpus of Croatian.
|
|
|
|
In: Conference papers (2018)
|
|
BASE
|
|
Show details
|
|
88 |
Integrating corpora of computer-mediated communication into the language resources landscape: Initiatives and best practices from French, German, Italian and Slovenian projects
|
|
|
|
DNB Subject Category Language
|
|
Show details
|
|
90 |
TEI-Lex0 guidelines for the encoding of dictionary information on written and spoken forms
|
|
|
|
In: Electronic Lexicography in the 21st Century: Proceedings of ELex 2017 Conference ; https://hal.inria.fr/hal-01757108 ; Electronic Lexicography in the 21st Century: Proceedings of ELex 2017 Conference, Sep 2017, Leiden, Netherlands (2017)
|
|
BASE
|
|
Show details
|
|
91 |
Universal Dependencies 2.1
|
|
|
|
In: https://hal.inria.fr/hal-01682188 ; 2017 (2017)
|
|
BASE
|
|
Show details
|
|
92 |
Closing a gap in the language resources landscape : Groundwork and best practices from projects on computer-mediated communication in four European countries.
|
|
|
|
In: CLARIN Annual Conference 2016 ; https://hal.archives-ouvertes.fr/hal-01379621 ; CLARIN Annual Conference 2016, Oct 2016, Aix-en-Provence, France. 136, Linköping Electronic Conference Proceedings, pp.1-19, 2017, Selected papers from the CLARIN Annual Conference 2016, 978-91-7685-499-0 ; http://www.ep.liu.se/ecp/contents.asp?issue=136 (2017)
|
|
BASE
|
|
Show details
|
|
95 |
Universal Dependencies 2.0 – CoNLL 2017 Shared Task Development and Test Data
|
|
|
|
BASE
|
|
Show details
|
|
100 |
CMC training corpus Janes-Tag 2.0
|
|
|
|
Abstract:
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity annotation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. As an update to version 1.2, 2.0 corrects some minor errors and includes named entity annotation. A slightly older version of this corpus is described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1084.
|
|
Keyword:
computer-mediated communication; lemmatisation; manual annotation; named entities; tagging; TEI; tokenisation; word normalisation
|
|
URL: http://hdl.handle.net/11356/1123
|
|
BASE
|
|
Hide details
|
|
|
|