Menu

List of Corpora

Name Size Description Language ELRA Details Your selection
2006 CoNLL Shared Task - Ten Languages 85.2 Mb 2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 2006… ELRA-W0086 Details
2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish 45 Mb 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish consists of dependency treebanks in four languages used as pa… ELRA-W0121 Details
2007 CoNLL Shared Task - Greek, Hungarian & Italian 18 Mb 2007 CoNLL Shared Task - Greek, Hungarian & Italian consists of dependency treebanks in three languages used as part of… ELRA-W0122 Details
Al-Hayat Arabic Corpus 1.1 Gb The corpus was developed in the course of a research project at the University of Essex, in collaboration with the Open… ELRA-W0030 Details
Amaryllis Corpus - Evaluation Package 505 Mb Launched at the end of 1995, the AMARYLLIS project aimed at evaluating information retrieval software for French text c… ELRA-W0029 Details
Amharic-English bilingual corpus 15 Mb The Amharic-English bilingual corpus contains parallel text from legal and news domains in Amharic script, in translite… ELRA-W0074 Details
An-Nahar Newspaper Text Corpus 794 Mb The An-Nahar Lebanon Newspaper Text Corpus comprises articles in standard Arabic from 1995 to 2000 (6 years) stored as … ELRA-W0027 Details
Arbobanko (Esperanto Treebank) 12 Mb The Arbobanko (Esperanto Treebank) is a 52,000 token dependency treebank of Esperanto with texts from the MONATO news m… ELRA-W0129 Details
Arboretum treebank 26 Mb The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences, taken from Korp… ELRA-W0084 Details
ARCADE/ROMANSEVAL corpus 63 Mb The ARCADE/ROMANSEVAL corpus was used as a reference corpus in two international competitions:· ARCADE, an exercise on … ELRA-W0018 Details
A "scientific" corpus of modern French ("La Recherche" magazine) - Complete version 23 Mb This "scientific" corpus of modern French was produced by the University of Nantes (France) within the European Commiss… ELRA-W0025-02 Details
Bilingual Bulgarian-English corpus from the 2018 Proposal for a National Climate Change Adaptation Strategy and Action Plan from the website of the Bulgarian Ministry of Environment and Water (Processed) 12 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0263 Details
Bilingual Bulgarian-English corpus from the National Revenue Agency (BG) (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0173 Details
Bilingual collection of documents about the Cyprus Problem (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0132 Details
Bilingual collection of reports of the Greek Public Power Corporation (Processed) 13 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0244 Details
Bilingual Croatian-English Parallel Corpus (Processed) 18 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0204 Details
Bilingual documents Bulgarian-English in the field of ICT and Transport (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0133 Details
Bilingual documents Bulgarian-English in the field of open data, broadband and information society (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0134 Details
Bilingual documents Bulgarian-English in the field of transport (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0161 Details
Bilingual hr-en parallel corpus from Croatian Mine Action website (Processed) 12 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0131 Details
Bilingual hr-en parallel corpus from Croatian National Bank website (Processed) 8 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0226 Details
Bilingual hr-en parallel corpus from the Journal of the Croatian Association of Civil Engineers website (Processed) 12 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0273 Details
Bilingual hr-en parallel corpus from the National and University Library in Zagreb website (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0135 Details
Bilingual resource with Bulgarian strategic documents in the field of innovations and digital growth (Bulgarian - English) (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0153 Details
Bilingual resource with Bulgarian strategic documents in the field of telecommunications and broadband (Bulgarian - English) (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0171 Details
BMI Brochures 2011-2015 (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0200 Details
BMI Brochures and Website 2016 (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0199 Details
BMVI Publications (Processed) 5 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0197 Details
BMVI Website (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0198 Details
Catalan Corpus of News Articles 645 Mb The Catalan Corpus of News Articles comprises articles in Catalan from 1 January 1999 to 31 March 2007. These articles … ELRA-W0047 Details
Catalan-Spanish Parallel Corpus 686 Mb This corpus contains more than 100 million words and it contains 10 years of bilingual articles from “El Periódico de C… ELRA-W0053 Details
Central Statistical Office Dataset (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0174 Details
Chinese-Vietnamese Parallel Corpus 74 Mb The Chinese-Vietnamese Parallel Corpus consists of 200,000 sentence pairs, with an average length of 15 words per sente… ELRA-W0312 Details
CINTIL-DeepBank 213 Mb The CINTIL-DeepBank (Branco et al., 2010) is a corpus of sentences annotated with their full-fledged deep grammatical r… ELRA-W0062 Details
CINTIL-DependencyBank 1.4 Mb The CINTIL-DependencyBank (Silva and Branco, 2012) is a corpus of sentences annotated with their syntactic dependency g… ELRA-W0061 Details
CINTIL-PropBank 3.6 Mb The CINTIL-PropBank is a corpus of sentences annotated with their constituency structure and semantic role tags, compos… ELRA-W0056 Details
CINTIL-TreeBank 3.1 Mb The CINTIL-TreeBank is a corpus of syntactic constituency trees of Portuguese texts composed of 10,039 sentences and 11… ELRA-W0055 Details
Civil Aviation Regulations (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0186 Details
Compendium The Social Insurance Institution (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0225 Details
Convention against Torture and Other Cruel, Inhuman or Degrading Treatment or Punishment - United Nations (French-English-Greek) (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0309 Details
Convention on the transfer of sentenced persons (English - Greek) (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0196 Details
Corpus of Contemporaneous Spanish Novels 4.8 Mb This corpus consists of 11 novels written in Castilian Spanish by Inmaculada Ferrer-Vidal Turull, a contemporaneous aut… ELRA-W0041 Details
Corpus of Icelandic texts from the Central Bank of Iceland (Processed) 33 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0298 Details
Corpus of State-related content from the Latvian Web (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0169 Details
Corpus on Finance and Economics from Bank of Latvia (Processed) 6 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0216 Details
CRATER 2 Corpus 359 Mb The CRATER corpus was built upon the foundations of an earlier project, ET10/63, which was funded in the final phase of… ELRA-W0033 Details
CRATER corpus 276 Mb The Corpus Resources and Terminology Extraction project (MLAP-93 20) has extended the bilingual annotated English-Frenc… ELRA-W0003 Details
Croatian-English corpus with Acts on Biological and Landscape Diversity and Environmental Protection (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0142 Details
Croatian-English corpus with statistical reports and studies from the Croatian Bureau of Statistics website (Processed) 9 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0264 Details
Croatian-English corpus with studies on the challenges to the Croatian Accession to the European Union from the Croatian Institute of Public Finance website (Processed) 9 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0266 Details
Croatian-English corpus with the Rural Development Programme for the Period 2014-2020 from the Croatian Rural Development Programme website (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0295 Details
Croatian-English parallel corpus from the website of the Croatian Journal of Fisheries (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0294 Details
Croatian-English parallel corpus from the website of the Embassy of Finland, Zagreb (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0292 Details
Croatian-English parallel corpus from the website of the Government Office for Cooperation with NGOs (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0291 Details
Croatian-English parallel corpus from the website of the Ministry of Foreign and European Affairs, Republic of Croatia (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0293 Details
DA-EN Danish Ministry of Higher Education and Science 2 (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0157 Details
DA-EN Danish Ministry of Higher Education and Science 3 (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0155 Details
DA-EN Danish Ministry of Higher Education and Science 4 (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0172 Details
DA-EN Danish Ministry of Higher Education and Science (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0166 Details
Danish Propbank 18 Mb The Danish Propbank (DPB) is a multi-layer treebank, annotated not only with morphosyntactic, but also with semantic in… ELRA-W0117 Details
deL1L2IM corpus 2.8 Mb The deL1L2IM corpus, created between May and August 2012 and last updated in August 2014, has been collected within the… ELRA-W0083 Details
Dutch PAROLE Distributable Corpus 70 Mb The Dutch PAROLE Distributable Corpus is a 3 million words selection from the 20 million words Dutch PAROLE Reference c… ELRA-W0019 Details
ECI-ELSNET Italian & German tagged sub-corpus 3 Mb The objective is to provide a small but fine grained morphosyntactically tagged corpus, 50.000 running words for each o… ELRA-W0005 Details
ECI/MCI (European Corpus Initiative/Multilingual Corpus I) 655 Mb The European Corpus Initiative (ECI) was founded to oversee the acquisition and preparation of a large multilingual cor… ELRA-W0004 Details
ECPC Corpus (European Comparable and Parallel Corpora of Parliamentary Speeches Archive) – set 1 802 Mb The European Comparable and Parallel Corpora of Parliamentary Speeches Archive (ECPC), compiled at the Universitat Jaum… ELRA-W0128 Details
EJTN Handbook (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0163 Details
Ema-lon Manipuri Corpus (including word embedding and language model) The Ema-lon Manipuri Corpus consists of a set of resources for Manipuri language (locally known as Meiteilon) for the p… ELRA-W0316 Details
Employment in Poland 2009 report in EN-PL (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0242 Details
English-Chinese-Vietnamese Trilingual Parallel Corpus 6 Mb The English-Chinese-Vietnamese Trilingual Parallel Corpus consists of 20,046 trilingual sets of sentence pairs. The cor… ELRA-W0314 Details
English - Croatian parallel corpus from texts of the Swedish Crime Victim Compensation and Support Authority (Brottsoffermyndigheten) web site (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0238 Details
English-Danish Parallel corpus from Tatoeba project (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0214 Details
English-Estonian corpus from Finnish Information Bank (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0218 Details
English-Estonian Parallel corpus compiled from translated annual reports from Estonian Academy of Sciences 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0265 Details
English-Finnish corpus from Finnish Information Bank (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0217 Details
English-Icelandic parallel corpus from Statistics Iceland (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0219 Details
English-Nepali Parallel Corpus 47 Mb The Nepali Monolingual written corpus is one of the 3 resources that constitute the Nepali National Corpus. The Nepali … ELRA-W0077 Details
English-Norwegian parallel corpus from Forbruker Europa, 2017 release (Processed) 6 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0195 Details
English-Persian parallel corpus 287 Mb The English-Persian parallel corpus contains more than 200,000 aligned sentences across a variety of text types from th… ELRA-W0118 Details
English-Persian parallel Corpus 40 Mb Please refer to ELRA-W0118 for the latest version of this corpus. This version consists of about 3,500,000 English and … ELRA-W0051 Details
ENGLISH/POLISH PHRASE BOOK FOR ADMINISTRATIVE STAFF of LOCAL GOVERNMENT UNITS (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0227 Details
English-Slovak corpus of annual reports from the Slovak National Centre for Human Rights website (Processed) 5 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0137 Details
English-Slovak corpus of annual reports on immigration and asylum policies from the EMN National Contact Point for the Slovak Republic website (Processed) 6 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0136 Details
English-Slovak parallel corpus of texts from The Ministry of Culture of the Slovak Republic (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0188 Details
English-Slovak parallel corpus of texts from The Ministry of Justice of the Slovak Republic (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0189 Details
English-Swedish corpus from Finnish Information Bank (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0222 Details
English-Swedish parallel corpus from Annual Reports of the Swedish Pension System (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0268 Details
English - Swedish parallel corpus from texts of the Swedish Crime Victim Compensation and Support Authority (Brottsoffermyndigheten) web site (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0237 Details
English-Swedish parallel corpus from the Annual Overview of Sweden’s Official aid Agency SIDA Activities (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0269 Details
English-Swedish parallel corpus from the translation of 'Sweden a Pocket Guide' book (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0130 Details
English-Swedish parallel corpus from the web site of the Swedish Migration Board - Migrationsverket (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0239 Details
English-Swedish parallel texts from The Swedish Agency for Economic and Regional Growth - Tillväxtverket (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0240 Details
English-Vietnamese Parallel Corpus 166 Mb This is a corpus of 500,000 English-Vietnamese sentence pairs, built to develop SMT (Statistical Machine Translation) s… ELRA-W0124 Details
English-Vietnamese Parallel Corpus 397 Mb The English-Vietnamese Parallel Corpus consists of 1,000,000 sentence pairs, with an average length of 20 words per sen… ELRA-W0311 Details
EUIPO - IP case law French-English (Processed) 56 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0138 Details
EUIPO - IP case law German-English (Processed) 154 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0140 Details
EUIPO - IP case law Italian-English (Processed) 22 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0141 Details
EUIPO - IP case law Spanish-English (Processed) 74 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0139 Details
EUIPO - list of goods and services French and English (Processed) 7 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0149 Details
EUIPO - list of goods and services German and English (Processed) 7 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0143 Details
EUIPO - list of goods and services German and French (Processed) 7 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0145 Details
EUIPO - list of goods and services German and Italian (Processed) 7 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0146 Details
EUIPO - list of goods and services German and Spanish (Processed) 7 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0144 Details
EUIPO - list of goods and services Italian and English (Processed) 8 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0150 Details
EUIPO - list of goods and services Italian and French (Processed) 11 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0152 Details
EUIPO - list of goods and services Italian and Spanish (Processed) 11 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0151 Details
EUIPO - list of goods and services Spanish and English (Processed) 8 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0147 Details
EUIPO - list of goods and services Spanish and French (Processed) 11 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0148 Details
EUROPARL Corpus Parallel Corpora: Portuguese-English 2.3 Gb The EUROPARL Corpus (Portuguese-English subpart of the parallel corpora), was extracted from the proceedings of the Eur… ELRA-W0090 Details
Expression of interest (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0209 Details
Financial Stability Reports from the National Bank of Poland (2013-14) (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0228 Details
Financial Stability Reports from the National Bank of Poland (2015-16) (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0229 Details
GeFRePaC - German French Reciprocal Parallel Corpus 1.3 Gb The German-French Reciprocal Parallel Corpus (GeFRePaC) was produced by the Multilinguale Forschung/Multilingual Resear… ELRA-W0031 Details
General Romanian-English bilingual corpus (Processed) 75 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0193 Details
Greek anti-corruption legislation and National Anti-Corruption Plan (greek-english) (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0164 Details
Greek-English parallel corpus from the website of the Prime Minister of the Hellenic Republic (Processed) 5 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0272 Details
Hallituskausi 2007-2011 -- Finnish-English Translation Memory (Processed) 23 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0220 Details
Hallituskausi 2011-2015 -- Finnish-English Translation Memory (Processed) 14 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0221 Details
Hellenic Ministry of Foreign Affairs Greek-English announcements corpus (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0271 Details
Helsinki Corpus of Swahili 1117 Mb This is a text corpus of Swahili language of 25 million words, annotated for part-of-speech, morphology and syntax. The… ELRA-W0119 Details
ICE-GB (British English component of the International Corpus of English) 97 Mb ICE-GB is the British component of the International Corpus of English (ICE). ICE began in 1990 with the primary aim of… ELRA-W0021 Details
ILSP/ELEFTHEROTYPIA Corpus (Greek corpus) 27 Mb The ILSP/ELEFTHEROTYPIA Corpus contains approximately 3 million words classified and annotated according to the common … ELRA-W0022 Details
International Agreements (Processed) 20 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0158 Details
Italian Syntactic-Semantic Treebank (ISST) 90 Mb ISST comprises 89,941 tokens for the financial-domain part and 215,606 tokens for the general part. It is formatted in … ELRA-W0044 Details
Karl May Korpus (KMK) 77 Mb The "Karl-May-Korpus" is a monolingual German corpus, available in an SGML-tagged ASCII text format. It contains the wo… ELRA-W0016 Details
Khresmoi manually annotated reference corpus 1.3 Gb The Manually Annotated Reference Corpus is a collection of English web documents annotated with key entities (such as d… ELRA-W0081 Details
Korean-Vietnamese Parallel Corpus 62 Mb The Korean-Vietnamese Parallel Corpus consists of 200,000 sentence pairs, with an average length of 15 words per senten… ELRA-W0313 Details
Laws of Malta (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0234 Details
Legal texts from Estonian Ministry of Justice (Processed) 23 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0167 Details
"Le Monde Diplomatique" Arabic tagged corpus 59 Mb This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see al… ELRA-W0049 Details
"Le Monde Diplomatique" Text corpus in Arabic 57 Mb Electronic archiving of "Le Monde Diplomatique" articles in Arabic from 2000. The corpus is available in HTML. Each HTM… ELRA-W0036-04 Details
"Le Monde Diplomatique" Text corpus in English 28 Mb Electronic archiving of "Le Monde Diplomatique" articles in English from 1999. The corpus is available in HTML. Each HT… ELRA-W0036-03 Details
"Le Monde Diplomatique" Text corpus in French - archives 1980-1998 233 Mb Electronic archiving of "Le Monde Diplomatique" articles in French from 1980 to 1998. The corpus is available in HTML. … ELRA-W0036-01 Details
"Le Monde Diplomatique" Text corpus in French - archives from 1999 90 Mb Electronic archiving of "Le Monde Diplomatique" articles in French from 1999. The corpus is available in HTML. Each HTM… ELRA-W0036-02 Details
Letter of rights for persons arrested and or detained (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0308 Details
Letter of rights for persons arrested on the basis of a European Arrest Warrant (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0301 Details
LT Corpus 43 Mb The LT Corpus is composed of 70 fiction texts from Portuguese renowned authors. The corpus contains 1,781,083 tokens. T… ELRA-W0059 Details
Luxembourg Museum Websites (de-en) (Processed) 45 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0201 Details
Macroeconomic Developments (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0207 Details
Malta Government Gazette (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0233 Details
Maltese-English website parallel corpus (Processed) 10 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0232 Details
Memorandum for a ESM programme (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0210 Details
Methodological Reconciliation (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0208 Details
MLCC Multilingual and Parallel Corpora 915 Mb The MLCC text corpus has two main components - one set to allow comparable studies to be carried out in different langu… ELRA-W0023 Details
Modern French Corpus including Anaphors Tagging 13 Mb The corpus that includes the tagging of the anaphors was created by the CRISTAL-GRESEC (Stendhal-Grenoble 3 University,… ELRA-W0032 Details
Monolingual documents from the Government of Lithuania (Processed) 10 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0299 Details
Monolingual Greek corpus 5.1 Mb Monolingual Greek corpus of 1 million words. The corpus consists of articles written in 1996 from the Greek daily newsp… ELRA-W0014 Details
Monolingual Vietnamese Annotated Corpus 36 Mb The Monolingual Vietnamese Annotated Corpus consists of 100,000 sentences, manually annotated with word boundaries, POS… ELRA-W0310 Details
MTP Annotated German corpus - tagged version 35 Mb This morphosyntactically annotated 500,000 word German corpus was developed as part of the Münster Tagging Project (MTP… ELRA-W0008-02 Details
MTP Annotated German corpus - untagged version 283 Mb This morphosyntactically annotated 500,000 word German corpus was developed as part of the Münster Tagging Project (MTP… ELRA-W0008-01 Details
MULTEXT JOC Corpus 114 Mb This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-… ELRA-W0017 Details
Multilingual Corpus 9.9 Mb Multilingual parallel corpus produced by Kaist Korterm containing 60 000 expressions in Korean, Chinese and English. ELRA-W0035 Details
National Health Fund Dataset (Processed) 5 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0178 Details
Natolin European Centre Dataset (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0176 Details
NE3L named entities Arabic corpus 3 Mb The NE3L project (Named Entities 3 Languages) consisted in annotating several corpora with different languages with nam… ELRA-W0078 Details
NE3L named entities Chinese corpus 4.8 Mb The NE3L project (Named Entities 3 Languages) consisted in annotating several corpora with different languages with nam… ELRA-W0079 Details
NE3L named entities Russian corpus 2.7 Mb The NE3L project (Named Entities 3 Languages) consisted in annotating several corpora with different languages with nam… ELRA-W0080 Details
NEMLAR Written Corpus 136 Mb This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the sa… ELRA-W0042 Details
Nepali Monolingual written corpus 683 Mb The Nepali Monolingual written corpus is one of the 3 resources that constitute the Nepali National Corpus. The Nepali … ELRA-W0076 Details
Normalized Arabic Fragments for Inestimable Stemming (NAFIS) 1 Mb Normalized Arabic Fragments for Inestimable Stemming (NAFIS) is an Arabic stemming gold standard corpus composed by a c… ELRA-W0127 Details
NPChunks 412 Kb NPChunks is a training corpus containing approximately 1,000 sentences, with a total of 24,243 tokens, selected randoml… ELRA-W0089 Details
NUM 5M Mongolian written corpus 65 Mb This is a corpus of Mongolian text mostly from domains like online or printed daily newspapers, literature, and laws.Th… ELRA-W0120 Details
PANACEA English-French and English-Greek parallel corpus acquired for Environment domain 11 Mb The PANACEA English-French and English-Greek parallel corpus was acquired in the framework of the PANACEA project (Plat… ELRA-W0057 Details
PANACEA English-French and English-Greek parallel corpus acquired for Labour Legislation domain 16 Mb The PANACEA English-French and English-Greek parallel corpus was acquired in the framework of the PANACEA project (Plat… ELRA-W0058 Details
PANACEA Environment English monolingual corpus 2.7 Gb The PANACEA Environment English monolingual corpus was acquired in the framework of the PANACEA project (Platform for A… ELRA-W0063 Details
PANACEA Environment French monolingual corpus 2.1 Gb The PANACEA Environment French monolingual corpus was acquired in the framework of the PANACEA project (Platform for Au… ELRA-W0065 Details
PANACEA Environment Greek monolingual corpus 2 Gb The PANACEA Environment Greek monolingual corpus was acquired in the framework of the PANACEA project (Platform for Aut… ELRA-W0067 Details
PANACEA Environment Italian monolingual corpus 1.8 Gb The PANACEA Environment Italian monolingual corpus was acquired in the framework of the PANACEA project (Platform for A… ELRA-W0069 Details
PANACEA Environment Spanish monolingual corpus 2.3 Gb The PANACEA Environment Spanish monolingual corpus was acquired in the framework of the PANACEA project (Platform for A… ELRA-W0071 Details
PANACEA Labour English monolingual corpus 1.6 Gb The PANACEA Labour English monolingual corpus was acquired in the framework of the PANACEA project (Platform for Automa… ELRA-W0064 Details
PANACEA Labour French monolingual corpus 2.5 Gb The PANACEA Labour French monolingual corpus was acquired in the framework of the PANACEA project (Platform for Automat… ELRA-W0066 Details
PANACEA Labour Greek monolingual corpus 1.4 Gb The PANACEA Labour Greek monolingual corpus was acquired in the framework of the PANACEA project (Platform for Automati… ELRA-W0068 Details
PANACEA Labour Italian monolingual corpus 2.4 Gb The PANACEA Labour Italian monolingual corpus was acquired in the framework of the PANACEA project (Platform for Automa… ELRA-W0070 Details
PANACEA Labour Spanish monolingual corpus 1.9 Gb The PANACEA Labour Spanish monolingual corpus was acquired in the framework of the PANACEA project (Platform for Automa… ELRA-W0072 Details
Parallel corpus (Bulgarian - English) in the public administration domain (Processed) 9 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0211 Details
Parallel corpus (en-pl) from the Export Promotion Portal of Poland (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0247 Details
Parallel corpus from Bank of Estonia (Processed) 8 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0162 Details
Parallel corpus from Estonian Cabinet of Ministers (Processed) 7 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0154 Details
Parallel corpus from Estonian Ministry of Foreign Affairs (Processed) 12 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0168 Details
Parallel corpus from Parliament of Estonia (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0215 Details
Parallel corpus from Social Insurance Agency -- Försäkringskassan (Sweden) (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0213 Details
Parallel corpus from the website of the Chancellery of the Prime Minister of Poland (Processed) 6 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0249 Details
Parallel Corpus from the Web Site of the the MFA of Latvia (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0159 Details
Parallel corpus (Greek - English) in the law domain (Processed) (Part1) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0205 Details
Parallel corpus (Greek - English) in the public administration domain (Processed) 14 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0203 Details
Parallel corpus (Polish - English) from the website of the Polish Investment and Trade Agency (Processed) 8 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0212 Details
Parallel Global Voices (Bulgarian - English) (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0297 Details
Parallel Global Voices (English - Polish) (Processed) 28 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0241 Details
Parallel Global Voices (Greek - English) (Processed) 43 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0202 Details
Parallel texts from Swedish Labour market agency. Part 2 (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0300 Details
Parallel texts from Swedish Labour market agency (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0302 Details
Parallel texts from Swedish National Food Agency (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0305 Details
Parallel texts from Swedish Social Security Authority (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0303 Details
Parallel texts from Swedish Work environment Authority (Processed) 7 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0304 Details
Parallel texts from the Swedish Competition Authority - Konkurrensverket (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0231 Details
PAROLE French Corpus 349 Mb The PAROLE French corpus contains the following data:Miscellaneous: Data provided by ELRA (CRATER, MLCC Multilingual an… ELRA-W0020 Details
PAROLE Irish Distributable Corpus 25 Mb The PAROLE Irish Distributable Corpus consists of over 8 million words (a subset of the 15+ million words Irish Referen… ELRA-W0026 Details
PAROLE Italian Corpus 44 Mb The PAROLE Italian Corpus comprises 3,135,651 words collected from four different domains: •newspapers: 2,179,800 words… ELRA-W0043 Details
PAROLE Portuguese Corpus - complete version 57 Mb The parole Portuguese corpus contains approximately 3 million running words of European Portuguese distributed by Mediu… ELRA-W0024-01 Details
Persian 1984 corpus (Multext-East framework) 5.9 Mb This corpus contains the Persian (Farsi) translation of a part of the novel “1984” (G. Orwell) annotated in the Multext… ELRA-W0054 Details
Persian Ezafe Construction Dataset The Persian Ezafe Construction Dataset includes gold Ezafe tags in almost 30 thousand Persian sentences. The sentences … ELRA-W0315 Details
PKN Orlen Dataset (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0175 Details
Polish-English parallel corpus from the website "Business in Poland" (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0274 Details
Polish-English parallel corpus from the website "geoportal.gov.pl" (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0285 Details
Polish-English parallel corpus from the website of Public Employment Services in Poland (member of EURES network) (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0259 Details
Polish-English parallel corpus from the website of the Central Statistical Office (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0279 Details
Polish-English parallel corpus from the website of the Citizens Information Board (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0251 Details
Polish-English parallel corpus from the website of the ING Polish Art Foundation (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0261 Details
Polish-English parallel corpus from the website of the Institute of Mathematics of the Polish Academy of Sciences (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0283 Details
Polish-English parallel corpus from the website of the Ministry of Agriculture and Rural Development (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0252 Details
Polish-English parallel corpus from the website of the Ministry of Culture and National Heritage (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0257 Details
Polish-English parallel corpus from the website of the Ministry of Development (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0253 Details
Polish-English parallel corpus from the website of the Ministry of Digital Affairs (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0284 Details
Polish-English parallel corpus from the website of the Ministry of Digitization (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0255 Details
Polish-English parallel corpus from the website of the Ministry of Foreign Affairs (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0256 Details
Polish-English parallel corpus from the website of the Ministry of Justice (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0254 Details
Polish-English parallel corpus from the website of the Ministry of National Defence (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0250 Details
Polish-English parallel corpus from the website of the Ministry of Regional Development (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0282 Details
Polish-English parallel corpus from the website of the Ministry of Science and Higher Education (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0286 Details
Polish-English parallel corpus from the website of the Ministry of the Interior and Administration (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0258 Details
Polish-English parallel corpus from the website of the National Audiovisual Institute (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0289 Details
Polish-English parallel corpus from the website of the National Centre for Nuclear Research (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0278 Details
Polish-English parallel corpus from the website of the National Centre for Research and Development (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0280 Details
Polish-English parallel corpus from the website of the National Digital Archives (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0290 Details
Polish-English parallel corpus from the website of the National Science Centre (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0260 Details
Polish-English parallel corpus from the website of the National Security Bureau (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0262 Details
Polish-English parallel corpus from the website of the Office of the Commissioner for Human Rights (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0281 Details
Polish-English parallel corpus from the website of the Polish Tourism Organisation (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0276 Details
Polish-English parallel corpus from the website of the State Marine Accident Investigation Commission (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0288 Details
Polish-English parallel corpus from the website of the U.S. EMBASSY and CONSULATE IN POLAND (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0277 Details
Polish-English parallel corpus from the website "Polish Aid" (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0275 Details
Polish-English parallel corpus from the website "Science in Poland" (Processed) 18 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0287 Details
Polish Food 4 & Food Policy Dataset (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0179 Details
Polish Food Dataset 2 (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0180 Details
Polish Food DataSet 3 (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0181 Details
Polish Food Dataset (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0177 Details
Polish Ministry of Foreign Affairs Historical Dataset (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0183 Details
Polish Ministry of Foreign Affairs Regional Dataset (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0182 Details
Polish Ministry of Foreign Affairs reports in EN and PL (Processed) 3 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0235 Details
Polish Ministry of Foreign Affairs Youth 2011 Report (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0184 Details
Portuguese-English bilingual corpus from Legislation concerning the Portuguese Parliament (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0245 Details
Portuguese-English bilingual corpus from the Portuguese Constitution (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0246 Details
PRESS 65 6.3 Mb Språkdata has made available the first of its many Swedish corpora, PRESS 65. It consists of one million running words … ELRA-W0010 Details
PTPARL Corpus 25 Mb The PTPARL Corpus contains 1,076 texts consisting of adapted transcriptions of the Portuguese Parliament sessions. The … ELRA-W0060 Details
Public Procurement Dataset 1 (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0187 Details
Public Procurement Dataset 2 (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0185 Details
Quaero Old Press Extended Named Entity corpus 6.8 Gb The Quaero Old Press Extended Named Entity corpus consists of the manual annotation of 76 newspaper issues published in… ELRA-W0073 Details
Qualified POS Tagged Corpus 66 Mb Monolingual corpus in a .txt format, produced by KAIST KORTERM, containing 1020000 eojeols (Korean terms) in Korean. Th… ELRA-W0034 Details
Quarterly Reports of the Parliamentary Budget Office (Hellenic Parliament) (Processed) 15 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0243 Details
ROCO Romanian journalistic corpus 729 Mb ROCO is a Romanian journalistic corpus containing approximately 7.1 million tokens, the number of types being 231,626. … ELRA-W0085 Details
Romanian-English corpus with studies, reports and statistical data in the field of culture from the National Institute for Cultural Research and Training website (Processed) 8 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0270 Details
Romanian - English literature corpus (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0192 Details
Romanian – English New Criminal Procedure Code (Processed) 4 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0170 Details
Romanian - English news corpus (Processed) 63 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0194 Details
Romanian Ombudsman archive (Processed) 5 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0206 Details
ROMBAC - Romanian balanced corpus 1.1 Gb ROMBAC is a Romanian corpus containing equal shares of texts from 5 different genres: journalism, legalese, fiction, me… ELRA-W0088 Details
Secretariat-General parallel corpus SL-EN and EN-SL (part 1) (Processed) 34 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0190 Details
Secretariat-General parallel corpus SL-EN and EN-SL (part 2) (Processed) 39 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0191 Details
SIP Publications (Processed) 7 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0306 Details
Slovenian-English corpus with statistical reports from the Statistical Office of the Republic of Slovenia website (Processed) 9 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0267 Details
Spanish-English website parallel corpus (Processed) 9 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0248 Details
Tagged text in French (MEMODATA) with rules of morphological disambiguation 3.1 Gb More than 170 books (classical novels, legal texts...) are tagged with rules of morphological disambiguation. A tagged … ELRA-W0012 Details
Tagged text in French (MEMODATA) with typographic tags 247 Mb More than 170 books (classical novels, legal texts...) are tagged with typographic tags. A tagged corpus of 50 books is… ELRA-W0011 Details
Text corpus of "Le Monde" 3.9 Gb Electronic archiving of "Le Monde" articles started on 1 January 1987. Some 200 articles are added every day, and as of… ELRA-W0015 Details
The CINTIL Corpus – International Corpus of Portuguese 20 Mb CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portugue… ELRA-W0050 Details
The Coimisineir Teanga Bilingual Corpus of Reference Documents (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0224 Details
The Coimisineir Teanga Bilingual Corpus of Reports and Press Releases (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0230 Details
The Croatian-English corpus with the nature protection strategy of Croatia (Processed) 1 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0296 Details
The EMILLE/CIIL Corpus 1.5 Gb The EMILLE/CIIL Corpus consists of three components: monolingual, parallel and annotated corpora. There are fourteen mo… ELRA-W0037 Details
The Gaois bilingual corpus of English-Irish legislation (Processed) 26 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0223 Details
The Lancaster Corpus of Mandarin Chinese (LCMC) 45 Mb The Lancaster Corpus of Mandarin Chinese (LCMC) is designed as a Chinese match for the FLOB and FROWN corpora for moder… ELRA-W0039 Details
TRAD Arabic-English Mailing lists Parallel corpus - Development set 2 Mb This is a parallel corpus of 10,000 words in Arabic and a reference translation in English. The source texts are emails… ELRA-W0108 Details
TRAD Arabic-English Mailing lists Parallel corpus - Test set 2 Mb This is a parallel corpus of 10,000 words in Arabic and 2 reference translations in English. The source texts are email… ELRA-W0106 Details
TRAD Arabic-English Newspaper Parallel corpus - Test set 1 2 Mb This is a parallel corpus of 10,000 words in Arabic and 2 reference translations in English. The source texts are artic… ELRA-W0099 Details
TRAD Arabic-English Parallel corpus of transcribed Broadcast News Speech 2 Mb This is a parallel corpus of 10,000 words in Arabic and 2 reference translations in English. The source texts are trans… ELRA-W0102 Details
TRAD Arabic-English Web domain (blogs) Parallel corpus 2 Mb This is a parallel corpus of 10,000 words in Arabic and 2 reference translations in English. The source texts are blog … ELRA-W0104 Details
TRAD Arabic-French Mailing lists Parallel corpus - Development set 1 Mb This is a parallel corpus of 10,000 words in Arabic and a reference translation in French. The source texts are emails … ELRA-W0107 Details
TRAD Arabic-French Mailing lists Parallel corpus - Test set 2 Mb This is a parallel corpus of 10,000 words in Arabic and 4 reference translations in French. The source texts are emails… ELRA-W0105 Details
TRAD Arabic-French Newspaper Parallel corpus - Test set 1 2 Mb This is a parallel corpus of 10,000 words in Arabic and 4 reference translations in French. The source texts are articl… ELRA-W0098 Details
TRAD Arabic-French Newspaper Parallel corpus - Test set 2 2 Mb This is a parallel corpus of 10,000 words in Arabic and 2 reference translations in French. The source texts are articl… ELRA-W0100 Details
TRAD Arabic-French Parallel corpus of transcribed Broadcast News Speech 2 Mb This is a parallel corpus of 10,000 words in Arabic and 4 reference translations in French. The source texts are transc… ELRA-W0101 Details
TRAD Arabic-French Web domain (blogs) Parallel corpus 2 Mb This is a parallel corpus of 10,000 words in Arabic and 4 reference translations in French. The source texts are blog a… ELRA-W0103 Details
TRAD Chinese-English Email Parallel corpus – Development Set 1 Mb This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and a reference translation in E… ELRA-W0113 Details
TRAD Chinese-English Email Parallel corpus – Test Set 1 Mb This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and 2 reference translations in … ELRA-W0115 Details
TRAD Chinese-English News Articles Parallel corpus 1 Mb This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and 2 reference translations in … ELRA-W0112 Details
TRAD Chinese-English Web domain (blogs) Parallel corpus 1 Mb This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and 2 reference translations in … ELRA-W0110 Details
TRAD Chinese-French Email Parallel corpus – Development Set 2 Mb This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and a reference translation in F… ELRA-W0114 Details
TRAD Chinese-French Email Parallel corpus – Test Set 2 Mb This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and 2 reference translations in … ELRA-W0116 Details
TRAD Chinese-French News Articles Parallel corpus 2 Mb This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and 2 reference translations in … ELRA-W0111 Details
TRAD Chinese-French Web domain (blogs) Parallel corpus 2 Mb This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and 2 reference translations in … ELRA-W0109 Details
TRAD Pashto-English News Articles Parallel corpus 602 Kb This is a parallel corpus, which contains 10,000 Pashto words translated into English by two different translators. The… ELRA-W0097 Details
TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data 575 Kb This is a parallel corpus, which contains 10,000 Pashto words translated into English. The source texts come from 3 bro… ELRA-W0095 Details
TRAD Pashto-French News Articles Parallel corpus 970 Kb This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The … ELRA-W0096 Details
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test data 29 Mb This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The … ELRA-W0094 Details
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training data 473 Mb The corpus consists of the transcription of 106 hours of recordings in Pashto translated into French. The transcription… ELRA-W0093 Details
TRAD Pashto Monolingual text Corpus 2.2 Gb This is a monolingual text corpus in Pashto. The corpus contains about 112,000,000 tokens collected from 46 different b… ELRA-W0092 Details
Training and test data for Arabizi detection and transliteration 1 Mb The dataset is composed of two distinct resources:1) A collection of mixed English and Arabizi text intended to train a… ELRA-W0126 Details
Translation memories from The Ministry of Foreign Affairs of Norway (Processed) 620 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0156 Details
Translation memory from Swedish National Audit Office (NAO) - Riksrevisionen (Processed) 12 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0236 Details
Translations of Lithuanian legislation from Seimas of the Republic of Lithuania (Processed) 70 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0165 Details
Trilingual Documents related to International Judicial Cooperation in Civil Matters (Greek-English-French) (Processed) 2 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0307 Details
TSNLP (Test Suites for NLP Testing) 4.5 Mb The TSNLP project (LRE 62-089) has produced a database of test suites for English, French and German containing over 4,… ELRA-W0013 Details
Venice Italian Treebank (VIT) 149 Mb The VIT, Venice Italian Treebank is the effort of the collaboration of people working at the Laboratory of Computationa… ELRA-W0040 Details
Website of the President of the Republic of Lithuania (Processed) 7 Mb This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Eur… ELRA-W0160 Details
Wolverhampton Business English Corpus 118 Mb The WBE was created by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA i… ELRA-W0028 Details
Name Size Description Language ELRA Details Your selection