21 |
Corpus of Croatian news portals ENGRI (2014-2018)
|
|
|
|
Abstract:
The corpus consists of texts collected from the most popular (based on the Reuters Institute Digital News Report for 2018, retrieved from http://www.digitalnewsreport.org in April, 2019) news portals in Croatia in the period from 2014 to 2018: Direktno, Dnevno, Net Hr, Hrt, Index_Hr, Jutarnji, Novilist, Rtl, SlobodnaDalmacija, Večernji, Tportal, Dnevnik. Web browsing and web crawling were used to select and store the texts with their useful HTML information (publication date of the article, its URL, and title). The linguistic processing of the corpus was performed with the CLASSLA package (https://pypi.org/project/classla/) on the levels of tokenization, sentence splitting, morphosyntactic tagging, lemmatization, dependency parsing and named entity recognition. This corpus is a linguistically-processed version of the original corpus published at https://repository.pfri.uniri.hr/islandora/object/pfri%3A2156 and is distributed in the CoNLL-U format (https://universaldependencies.org/format.html).
|
|
Keyword:
contemporary language; news corpus
|
|
URL: http://hdl.handle.net/11356/1416
|
|
BASE
|
|
Hide details
|
|
22 |
Offensive language dataset of Croatian, English and Slovenian comments FRENK 1.1
|
|
|
|
BASE
|
|
Show details
|
|
26 |
Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1
|
|
|
|
BASE
|
|
Show details
|
|
27 |
Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.0
|
|
|
|
BASE
|
|
Show details
|
|
29 |
Multilingual comparable corpora of parliamentary debates ParlaMint 2.0
|
|
|
|
BASE
|
|
Show details
|
|
32 |
Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives ...
|
|
|
|
BASE
|
|
Show details
|
|
33 |
Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives ...
|
|
|
|
BASE
|
|
Show details
|
|
37 |
The CLASSLA-StanfordNLP model for lemmatisation of standard Macedonian 1.0
|
|
|
|
BASE
|
|
Show details
|
|
38 |
The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Macedonian 1.0
|
|
|
|
BASE
|
|
Show details
|
|
39 |
Multilingual comparable corpora of parliamentary debates ParlaMint 1.0
|
|
|
|
BASE
|
|
Show details
|
|
|
|