Home Catalogue search

eng

Refine your search:

Search in the Catalogues and Directories






	Sort by
Simple Search

Page: 1 2 3 4 5...23

Hits 1 – 20 of 441

1	The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian 1.4
	Ljubešić, Nikola; Krsnik, Luka. - : Jožef Stefan Institute, 2022
	BASE
	Show details

2	Texte électronique enrichi par lemmatisation et étiquetage morphosyntaxique, portion de La Mort du roi Arthur , http://www.atilf.fr/dmf/MortArthur/
	Jézéquel, Jean-Michel; Bazin-Tacchella, Sylvie; Souvay, Gilles...
	In: https://hal.archives-ouvertes.fr/hal-03426756 ; 2021 (2021)
	BASE
	Show details

3	Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre
	Camps, Jean-Baptiste; Gabay, Simon; Fièvre, Paul...
	In: EISSN: 2416-5999 ; Journal of Data Mining and Digital Humanities ; https://halshs.archives-ouvertes.fr/halshs-02591388 ; Journal of Data Mining and Digital Humanities, Episciences.org, 2021, ⟨10.46298/jdmdh.6485⟩ (2021)
	BASE
	Show details

4	Corpus and Models for Lemmatisation and POS-tagging of Old French
	Camps, Jean-Baptiste; Clérice, Thibault; Duval, Frédéric...
	In: https://halshs.archives-ouvertes.fr/halshs-03353125 ; 2021 (2021)
	BASE
	Show details

5	Expanding the content model of annotationBlock
	Bartz, Alexandre; Janes, Juliette; Romary, Laurent...
	In: Next Gen TEI, 2021 - TEI Conference and Members’ Meeting ; https://hal.archives-ouvertes.fr/hal-03380805 ; Next Gen TEI, 2021 - TEI Conference and Members’ Meeting, Oct 2021, Virtual, United States (2021)
	BASE
	Show details

6	Annotated Corpus of Pre-Standardized Balkan Slavic Literature 1.1
	Šimko, Ivan. - : Slavic Seminary, University of Zurich, 2021
	BASE
	Show details

7	Ekspress news article archive (in Estonian and Russian) 1.0
	Purver, Matthew; Pollak, Senja; Freienthal, Linda; Kuulmets, Hele-Andra; Krustok, Ivar; Shekhar, Ravi. - : Ekspress Meedia Group, 2021
	Abstract: The dataset is an archive of articles from the Ekspress Meedia news site from 2009-2019, containing over 1.4M articles, mostly in Estonian language (1,115,120 articles) with some in Russian (325,952 articles). Keywords are included for articles after 2015. The main archive is in file ee_articles_2009_2019. Other files contain derived versions and subsets - please see README files inside those zip files. The main archive contains JSON files of all the Estonian articles from the year 2009 to 2019 May. These datasets are intended for usage in EMBEDDIA, a H2020 project. Articles are in Estonian language with some in Russian. The main archive is in file ee_articles_2009_2019. Other files contain derived versions and subsets (please see README files inside those zip files), in short: - eearticles2015-2019: This dataset contains Estonian and Russian articles - 5 years, with tags, that were missing in the previous versions. - files eearticles20152019lemmatized and eearticles20092014lemmatized are the files preprocessed by TEXTA (contact linda@texta.ee) - in file eeandsttarticlelemmasembeddingsand_usage you can find w2v embeddings trained by TEXTA (contact linda@texta.ee) Description of the Main Dataset (eearticles_2009_2019) There are 12 JSON files: articles_2009_ver2.json contains 161394 articles from the year 2009 articles_2010_ver2.json contains 151033 articles from the year 2010 articles_2011_ver2.json contains 168273 articles from the year 2011 articles_2012_ver2.json contains 152772 articles from the year 2012 articles_2013_ver2.json contains 141012 articles from the year 2013 articles_2014_ver2.json contains 128388 articles from the year 2014 articles_2015_ver2.json contains 127425 articles from the year 2015 articles_2016_ver2.json contains 130704 articles from the year 2016 articles_2017_ver2.json contains 119318 articles from the year 2017 articles_2018_ver2.json contains 117388 articles from the year 2018 articles_2019_Jan-Apr_ver2.json contains 35076 articles from the year 2019 January to April articles_2019_May_ver2.json contains 8329 articles from the year 2019 May In sum: 1 441 112 articles Each JSON file is a list of dictionaries, i.e. each article is represented as a dictionary. Each dictionary contains the following: id (integer) - the ID of the article title (string) - the title of the article lead (string) - the lead of the article (can contain HTML, e.g. tag) url (string) - the URL of the article tags (list of dictionaries or None) [1]: each dictionary represents one tag. The tag dictionary contains the following: domain_id (string) [2] - the ID of the domain id (string) - the ID of the tag lang (string) - the language of the tag tag (string) - the tag itself, e.g. Kert Kingo (a name) translitted_name (string) - a modified version of the tag, e.g. kert-kingo rawBody (string) - the raw text of the article (contains HTML) bodyText (string) - clean article text (stripped from HTML) publishDate (string) - published date & time of the article categoryPrimary (dictionary or empty list) - the dictionary contains the following information: categoryId (integer) - the ID of the category categoryName (string)- the name of the category (e.g. World) channelId (integer) - the ID of the channel OR articleId (integer) - the ID of the article categoryId (integer) - the ID of the category categoryName (string)- the name of the category (e.g. World) categoryPrimary (boolean) - unknown categorySort (integer) - unknown categoryUrl (string) - the URL of the category categoryVisible (boolean) - unknown channelId (integer) - the ID of the channel channelUrl (string) - the URL of the channel (e.g. 'https://sport.delfi.ee') directoryName (string) - unknown parentId (integer) - unknown channelLanguage (string or None) [3] - the language of the channel categoryLanguage (int or None) [4] -unknown commentCount (int) [5] - the number of comments relatedArticles (list of integers) - a list of related articles' ID's
	Keyword: lemmatisation; news corpus; word embeddings
	URL: http://hdl.handle.net/11356/1408
	BASE
	Hide details

8	The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian 1.3
	Ljubešić, Nikola; Krsnik, Luka. - : Jožef Stefan Institute, 2021
	BASE
	Show details

9	Corpus of Written Standard Slovene Gigafida 2.0
	Krek, Simon; Erjavec, Tomaž; Repar, Andraž. - : Centre for Language Resources and Technologies, University of Ljubljana, 2021
	BASE
	Show details

10	Die leksikografiese bewerking van verwantskapsterme in Sepedi
	D.J. Prinsloo
	In: Lexikos, Vol 22, Pp 261-271 (2021) (2021)
	BASE
	Show details

11	Lexical explorer: extending access to the database for spoken German for user-specific purposes
	Batinić, Dolores
	In: Corpora. - Edinburgh : Univ. Press 15 (2020) 1, 55-76
	BLLDB
	Show details

12	Guidelines for linguistic annotation of modern French (16th-18th c.) ; Manuel d'annotation linguistique pour le français moderne (XVIe -XVIIIe siècles)
	Gabay, Simon; Camps, Jean-Baptiste; Clérice, Thibault
	In: https://hal.archives-ouvertes.fr/hal-02571190 ; 2020 (2020)
	BASE
	Show details

13	Standardizing linguistic data: method and tools for annotating (pre-orthographic) French ; Standardiser les données linguistiques: méthodes et outils pour l'annotation du français (pré-orthographique)
	Gabay, Simon; Clérice, Thibault; Camps, Jean-Baptiste...
	In: Proceedings of the 2nd International Digital Tools & Uses Congress (DTUC '20) ; https://hal.archives-ouvertes.fr/hal-03018381 ; Proceedings of the 2nd International Digital Tools & Uses Congress (DTUC '20), Oct 2020, Hammamet, Tunisia. ⟨10.1145/3423603.3423996⟩ (2020)
	BASE
	Show details

14	Texte électronique enrichi par lemmatisation et étiquetage morphosyntaxique, Lais,Testament, Poésies diverses de François Villon, http://www.atilf.fr/dmf/VillonAgregation/
	Jézéquel, Jean-Michel; Bazin-Tacchella, Sylvie; Souvay, Gilles
	In: https://hal.archives-ouvertes.fr/hal-02978724 ; 2020 (2020)
	BASE
	Show details

15	CORPUS17: a philological corpus for 17th c. French ; CORPUS17: un corpus philologique pour le 17ème siècle français
	Gabay, Simon; Bartz, Alexandre; Deguin, Yohann
	In: Proceedings of the 2nd International Digital Tools & Uses Congress (DTUC ’20) ; https://hal.archives-ouvertes.fr/hal-03041871 ; Proceedings of the 2nd International Digital Tools & Uses Congress (DTUC ’20), Oct 2020, Hammamet, Tunisia. ⟨10.1145/3423603.3424002⟩ (2020)
	BASE
	Show details

16	A Critical Evaluation of Three Sesotho Dictionaries
	Setaka, Mmasibidi; Prinsloo, D.J.
	In: Lexikos; Vol. 30 (2020) ; 2224-0039 (2020)
	BASE
	Show details

17	The CLASSLA-StanfordNLP model for lemmatisation of standard Macedonian 1.0
	Ljubešić, Nikola; Zdravkova, Katerina; Erjavec, Tomaž. - : Jožef Stefan Institute, 2020
	BASE
	Show details

18	Spoken Torlak dialect corpus 1.0 (transcription)
	Vuković, Teodora. - : Slavisches Seminar, University of Zurich, 2020
	BASE
	Show details

19	Annotated Corpus of Pre-Standardized Balkan Slavic Literature
	Šimko, Ivan. - : Slavic Seminary, University of Zurich, 2020
	BASE
	Show details

20	The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian 1.1
	Ljubešić, Nikola. - : Jožef Stefan Institute, 2020
	BASE
	Show details

Page: 1 2 3 4 5...23

© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern