1 |
Closing a gap in the language resources landscape : Groundwork and best practices from projects on computer-mediated communication in four European countries.
|
|
|
|
In: CLARIN Annual Conference 2016 ; https://hal.archives-ouvertes.fr/hal-01379621 ; CLARIN Annual Conference 2016, Oct 2016, Aix-en-Provence, France. 136, Linköping Electronic Conference Proceedings, pp.1-19, 2017, Selected papers from the CLARIN Annual Conference 2016, 978-91-7685-499-0 ; http://www.ep.liu.se/ecp/contents.asp?issue=136 (2017)
|
|
BASE
|
|
Show details
|
|
5 |
CMC training corpus Janes-Syn 1.0
|
|
|
|
Abstract:
Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard training and testing dataset for syntactic annotation of Slovene computer-mediated communication and for detailed linguistic explorations which require highly accurate and reliable annotations. Words in the dataset are normalised, lemmatised, PoS-tagged and syntactically annotated with the JOS dependency model (http://eng.slovenscina.eu/tehnologije/razclenjevalnik). The annotations on all levels were manually corrected. The corpus creation and structure are described in: ARHAR HOLDT, Špela, FIŠER, Darja, ERJAVEC, Tomaž, KREK, Simon. Syntactic annotation of Slovene CMC : first steps. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 27-28 September 2016, Ljubljana, Slovenia, 2016, pp. 3-6. http://nl.ijs.si/janes/cmc-corpora2016/proceedings/ Janes-Syn was created from two larger corpora that are also available in the repository: Janes-Norm (http://hdl.handle.net/11356/1084) and Janes-Tag (http://hdl.handle.net/11356/1123).
|
|
Keyword:
computer-mediated communication; dependency treebank; manual annotation; syntactic annotation; TEI; tokenisation
|
|
URL: http://hdl.handle.net/11356/1086
|
|
BASE
|
|
Hide details
|
|
|
|