22 |
Offensive language dataset of Croatian, English and Slovenian comments FRENK 1.1
|
|
|
|
BASE
|
|
Show details
|
|
26 |
Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1
|
|
|
|
BASE
|
|
Show details
|
|
27 |
Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.0
|
|
|
|
BASE
|
|
Show details
|
|
29 |
Multilingual comparable corpora of parliamentary debates ParlaMint 2.0
|
|
|
|
BASE
|
|
Show details
|
|
32 |
Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives ...
|
|
|
|
BASE
|
|
Show details
|
|
33 |
Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives ...
|
|
|
|
BASE
|
|
Show details
|
|
37 |
The CLASSLA-StanfordNLP model for lemmatisation of standard Macedonian 1.0
|
|
|
|
BASE
|
|
Show details
|
|
38 |
The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Macedonian 1.0
|
|
|
|
BASE
|
|
Show details
|
|
39 |
Multilingual comparable corpora of parliamentary debates ParlaMint 1.0
|
|
|
|
BASE
|
|
Show details
|
|
40 |
Slovenian parliamentary corpus (1990-2018) siParl 2.0
|
|
|
|
Abstract:
The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of Slovenia from the 1st to the 7th legislative period 1992-2018, minutes of the working bodies of the National Assembly of the Republic of Slovenia from the 2nd to the 7th legislative period 1996-2018, and minutes of the Council of the President of the National Assembly from the 2nd to the 7th legislative period 1996-2018. The corpus comprises over 10 thousand sessions, one million speeches or 200 million words. The corpus contains meta-data about the speakers, a typology of sessions etc. and structural, editorial and linguistic annotations. The corpus is encoded according to the Parla-CLARIN schema (https://github.com/clarin-eric/parla-clarin). Each mandate is in one directory, and each session in one file. This item comprises the following datasets: 1. source DARAH-SI Parla-CLARIN encoded corpus; 2. linguistically annotatated Parla-CLARIN encoded corpus: tokenisation, MSD tagging, lemmatisation, Universal Dependencies features and syntactic parses, named entities; 3. linguisticaly annotated corpus in vertical format used by CWB and Sketch Engine concordancers; this format is simpler and smaller but does not contain all the information from the source TEI; 4. linguisticaly annotated corpus in CONLL-U format as used by Universal Dependencies 5. plain text of the corpus Note that each dataset also includes TSV meta-data files on sessions (files) and speakers. As opposed to the previous version 1.0, this version corrects many errors, has substantially better meta-data and the linguistic processing has more levels and less errors.
|
|
Keyword:
Parla-CLARIN; parliamentary debates; Slovenian Parliament; TEI; universal dependencies
|
|
URL: http://hdl.handle.net/11356/1300
|
|
BASE
|
|
Hide details
|
|
|
|