42 |
Avtomatsko pridobivanje besednih zvez iz korpusa z uporabo leksikona SSJ
|
|
|
|
BASE
|
|
Show details
|
|
44 |
Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.1)
|
|
|
|
BASE
|
|
Show details
|
|
45 |
Developmental corpus of Slovene (without language corrections) Šolar-Clear
|
|
|
|
BASE
|
|
Show details
|
|
49 |
Value of Language-Related Questions and Comments in Digital Media for Lexicographical User Research
|
|
|
|
In: International Journal of Lexicography 30 (2017) 3, 285-308
|
|
IDS OBELEX meta
|
|
Show details
|
|
50 |
CMC training corpus Janes-Tag 2.0
|
|
|
|
Abstract:
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity annotation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. As an update to version 1.2, 2.0 corrects some minor errors and includes named entity annotation. A slightly older version of this corpus is described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1084.
|
|
Keyword:
computer-mediated communication; lemmatisation; manual annotation; named entities; tagging; TEI; tokenisation; word normalisation
|
|
URL: http://hdl.handle.net/11356/1123
|
|
BASE
|
|
Hide details
|
|
|
|