Liste der Korpora

Name Größe Beschreibung Sprache ELRA Details Ihre Auswahl
"Le Monde Diplomatique" Arabic tagged corpus 59 Mb This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see ... Arabic (ara) ELRA-W0049 Details

"Le Monde Diplomatique" Arabic tagged corpus

Name "Le Monde Diplomatique" Arabic tagged corpus (ELRA-W0049)
URL http://catalog.elra.info/product_info.php?products_id=1096
Beschreibung This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see also ELRA-W0036-04). To each text are associated 3 files : raw text in Arabic, vowelized text in Arabic, one XML file containing the morphological annotation of the text.
Sprachen Arabic (ara)
"Le Monde Diplomatique" Text corpus in Arabic 57 Mb Electronic archiving of "Le Monde Diplomatique" articles in Arabic from 2000. The corpus is available in HTML. Each H... Arabic (ara) ELRA-W0036-04 Details

"Le Monde Diplomatique" Text corpus in Arabic

Name "Le Monde Diplomatique" Text corpus in Arabic (ELRA-W0036-04)
URL http://catalog.elra.info/product_info.php?products_id=717
Beschreibung Electronic archiving of "Le Monde Diplomatique" articles in Arabic from 2000. The corpus is available in HTML. Each HTML file contains one article.
Sprachen Arabic (ara)
"Le Monde Diplomatique" Text corpus in English 28 Mb Electronic archiving of "Le Monde Diplomatique" articles in English from 1999. The corpus is available in HTML. Each ... English (eng) ELRA-W0036-03 Details

"Le Monde Diplomatique" Text corpus in English

Name "Le Monde Diplomatique" Text corpus in English (ELRA-W0036-03)
URL http://catalog.elra.info/product_info.php?products_id=8
Beschreibung Electronic archiving of "Le Monde Diplomatique" articles in English from 1999. The corpus is available in HTML. Each HTML file contains one article.
Sprachen English (eng)
"Le Monde Diplomatique" Text corpus in French - archives 1980-1998 233 Mb Electronic archiving of "Le Monde Diplomatique" articles in French from 1980 to 1998. The corpus is available in HTML... French (fre) ELRA-W0036-01 Details

"Le Monde Diplomatique" Text corpus in French - archives 1980-1998

Name "Le Monde Diplomatique" Text corpus in French - archives 1980-1998 (ELRA-W0036-01)
URL http://catalog.elra.info/product_info.php?products_id=7
Beschreibung Electronic archiving of "Le Monde Diplomatique" articles in French from 1980 to 1998. The corpus is available in HTML. Each HTML file contains one article.
Sprachen French (fre)
"Le Monde Diplomatique" Text corpus in French - archives from 1999 90 Mb Electronic archiving of "Le Monde Diplomatique" articles in French from 1999. The corpus is available in HTML. Each H... French (fre) ELRA-W0036-02 Details

"Le Monde Diplomatique" Text corpus in French - archives from 1999

Name "Le Monde Diplomatique" Text corpus in French - archives from 1999 (ELRA-W0036-02)
URL http://catalog.elra.info/product_info.php?products_id=9
Beschreibung Electronic archiving of "Le Monde Diplomatique" articles in French from 1999. The corpus is available in HTML. Each HTML file contains one article.
Sprachen French (fre)
2006 CoNLL Shared Task - Ten Languages 85.2 Mb 2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 20... Turkish (tur); Bulgarian (bul); Dutch, ... ELRA-W0086 Details

2006 CoNLL Shared Task - Ten Languages

Name 2006 CoNLL Shared Task - Ten Languages (ELRA-W0086)
URL http://catalog.elra.info/product_info.php?products_id=1250
Beschreibung 2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. The languages covered in this release are: Bulgarian, Danish, Dutch, German, Japanese, Portuguese, Slovene, Spanish, Swedish and Turkish. The source data in the treebanks in this release consists principally of various texts (e.g., textbooks, news, literature) annotated in dependency format.
Sprachen
  • Turkish (tur)
  • Bulgarian (bul)
  • Dutch, Flemish (dut)
  • German (ger)
  • Japanese (jpn)
  • Spanish, Castilian (spa)
  • Danish (dan)
  • Portuguese (por)
  • Swedish (swe)
  • Slovenian (slv)
A "scientific" corpus of modern French ("La Recherche" magazine) - Complete version 23 Mb Produced through a funding from ELRA in the framework of the European Commission project LRsPProduced through a fundi... French (fre) ELRA-W0025-02 Details

A "scientific" corpus of modern French ("La Recherche" magazine) - Complete version

Name A "scientific" corpus of modern French ("La Recherche" magazine) - Complete version (ELRA-W0025-02)
URL http://catalog.elra.info/product_info.php?products_id=595
Beschreibung Produced through a funding from ELRA in the framework of the European Commission project LRsPProduced through a funding from ELRA in the framework of the European Commission project LRsP&P (Language Resources Production & Packaging - LE4-8335), the corpus contains all articles published in La Recherche magazine in 1998, including issues 305 (January) to 315 (December), which amounts to 447,244 tokens and 30,238 types. Two versions are available: the raw data (XML format) and the complete version (XML and SGML formats)
Sprachen French (fre)
ARCADE/ROMANSEVAL corpus 63 Mb The corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission... English (eng); French (fre); Italian (i... ELRA-W0018 Details

ARCADE/ROMANSEVAL corpus

Name ARCADE/ROMANSEVAL corpus (ELRA-W0018)
URL http://catalog.elra.info/product_info.php?products_id=535
Beschreibung The corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050), composed of 1 million words in English and four romance languages: French, Italian, Spanish and Portuguese (Written Question and Answers from the Official Journal of the European Commission). The annotation concerns all the contexts of 60 different test words (20 nouns, 20 adjectives, 20 verbs), i.e. ca. 3700 contexts all together. It comprises: semantic tagging of all the occurrences of the test words in the JOC corpus for French and Italian; and word-level alignment of all the occurrences of the test words between French and English.
Sprachen
  • English (eng)
  • French (fre)
  • Italian (ita)
Al-Hayat Arabic Corpus 1.1 Gb The corpus contains articles extracted from the newspeper Al-Hayat, organised in 7 domains, for language engineering ... Arabic (ara) ELRA-W0030 Details

Al-Hayat Arabic Corpus

Name Al-Hayat Arabic Corpus (ELRA-W0030)
URL http://catalog.elra.info/product_info.php?products_id=632
Beschreibung The corpus contains articles extracted from the newspeper Al-Hayat, organised in 7 domains, for language engineering applications developement.
Sprachen Arabic (ara)
Amaryllis Corpus - Evaluation Package 505 Mb AMARYLLIS was organised by the Institut de l'Information Scientifique et Technique (INIST) with the support of the Ag... French (fre) ELRA-W0029 Details

Amaryllis Corpus - Evaluation Package

Name Amaryllis Corpus - Evaluation Package (ELRA-W0029)
URL http://catalog.elra.info/product_info.php?products_id=626
Beschreibung AMARYLLIS was organised by the Institut de l'Information Scientifique et Technique (INIST) with the support of the Agence francophone pour l'enseignement supérieur et la recherche (AUPELF-UREF) and the French Ministère de l'Education Nationale, de la Recherche et de la Technologie (MERT) to create document corpora, questions and answers, in the framework of the Action de Recherche Concertée (ARC A1, renamed as Amaryllis- Access to text information in French), in order to get similar works to the United States project TREC. All corpora are structured as SGML files with isolatin character-encoding.
Sprachen French (fre)
Amharic-English bilingual corpus 15 Mb The Amharic-English bilingual corpus contains parallel text from legal and news domains in Amharic script, in transli... English (eng); Amharic (amh) ... ELRA-W0074 Details

Amharic-English bilingual corpus

Name Amharic-English bilingual corpus (ELRA-W0074)
URL http://catalog.elra.info/product_info.php?products_id=1215
Beschreibung The Amharic-English bilingual corpus contains parallel text from legal and news domains in Amharic script, in transliterated form and in English. The size of the corpus is of 232,653 words in Amharic and 291,701 in English.
Sprachen
  • English (eng)
  • Amharic (amh)
An-Nahar Newspaper Text Corpus 794 Mb The An-Nahar Newspaper Text Corpus comprises articles in Arabic (Lebanon) from 1995 to 2000 (6 years) stored as HTML ... Arabic (ara) ELRA-W0027 Details

An-Nahar Newspaper Text Corpus

Name An-Nahar Newspaper Text Corpus (ELRA-W0027)
URL http://catalog.elra.info/product_info.php?products_id=767
Beschreibung The An-Nahar Newspaper Text Corpus comprises articles in Arabic (Lebanon) from 1995 to 2000 (6 years) stored as HTML files onCDRommedia. Each yearcontains 45000 articles and 24 million words.
Sprachen Arabic (ara)
Arboretum treebank 26 Mb The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences. It consists o... Danish (dan) ELRA-W0084 Details

Arboretum treebank

Name Arboretum treebank (ELRA-W0084)
URL http://catalog.elra.info/product_info.php?products_id=1248
Beschreibung The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences. It consists of about 425,000 tokens and there are ca. 22,260 sentences/utterances containing 3 or more tokens. Arboretum provides named entity categories for all proper nouns. It also contains subclass categorisation for the pronoun and adverb word classes The final version of the treebank consists of two independent versions, constituent trees and dependency trees, and is distributed in the following versions: 1. Native dependency format (Constraint Grammar format) 2. Dependency annotation converted to MALT xml format 3. Native constituent tree format (Cross-language VISL standard) 4. Constituent format converted to TIGER xml
Sprachen Danish (dan)
CINTIL-DeepBank 213 Mb The CINTIL-DeepBank (Branco et al., 2010) is a corpus of sentences annotated with their full-fledged deep grammatical... Portuguese (por) ELRA-W0062 Details

CINTIL-DeepBank

Name CINTIL-DeepBank (ELRA-W0062)
URL http://catalog.elra.info/product_info.php?products_id=1181
Beschreibung The CINTIL-DeepBank (Branco et al., 2010) is a corpus of sentences annotated with their full-fledged deep grammatical representations, composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), and novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) used for regression testing of the computational grammar that supported the annotation of the corpus.
Sprachen Portuguese (por)
CINTIL-DependencyBank 1.4 Mb The CINTIL-DependencyBank (Silva and Branco, 2012) is a corpus of sentences annotated with their syntactic dependency... Portuguese (por) ELRA-W0061 Details

CINTIL-DependencyBank

Name CINTIL-DependencyBank (ELRA-W0061)
URL http://catalog.elra.info/product_info.php?products_id=1180
Beschreibung The CINTIL-DependencyBank (Silva and Branco, 2012) is a corpus of sentences annotated with their syntactic dependency graphs and grammatical function tags composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) that are used for regression testing of the computational grammar that supported the annotation of the corpus.
Sprachen Portuguese (por)
CINTIL-PropBank 3.6 Mb The CINTIL-PropBank is a corpus of sentences annotated with their constituency structure and semantic role tags, comp... Portuguese (por) ELRA-W0056 Details

CINTIL-PropBank

Name CINTIL-PropBank (ELRA-W0056)
URL http://catalog.elra.info/product_info.php?products_id=1176
Beschreibung The CINTIL-PropBank is a corpus of sentences annotated with their constituency structure and semantic role tags, composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), and novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) used for regression testing of the computational grammar that supported the annotation of the corpus.
Sprachen Portuguese (por)
CINTIL-TreeBank 3.1 Mb The CINTIL-TreeBank is a corpus of syntactic constituency trees of Portuguese texts composed of 10,039 sentences and ... Portuguese (por) ELRA-W0055 Details

CINTIL-TreeBank

Name CINTIL-TreeBank (ELRA-W0055)
URL http://catalog.elra.info/product_info.php?products_id=1174
Beschreibung The CINTIL-TreeBank is a corpus of syntactic constituency trees of Portuguese texts composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) that are used for regression testing of the computational grammar that supported the annotation of the corpus.
Sprachen Portuguese (por)
CRATER 2 Corpus 359 Mb The CRATER 2 parallel corpus is an extension of the CRATER corpus, available in the catalogue under reference W0003. ... English (eng); French (fre); Spanish, C... ELRA-W0033 Details

CRATER 2 Corpus

Name CRATER 2 Corpus (ELRA-W0033)
URL http://catalog.elra.info/product_info.php?products_id=636
Beschreibung The CRATER 2 parallel corpus is an extension of the CRATER corpus, available in the catalogue under reference W0003. It consists of 1,500,000 tokens for English and French and of 1,000,000 tokens for Spanish, with morphosyntactical annotations. CRATER 2 (ref. ELRA-W0033) includes CRATER (ref. ELRA-W0003)
Sprachen
  • English (eng)
  • French (fre)
  • Spanish, Castilian (spa)
Catalan Corpus of News Articles 645 Mb The Catalan Corpus of News Articles comprises articles in Catalan from 1 January 1999 to 31 March 2007. These article... Catalan, Valencian (cat) ELRA-W0047 Details

Catalan Corpus of News Articles

Name Catalan Corpus of News Articles (ELRA-W0047)
URL http://catalog.elra.info/product_info.php?products_id=990
Beschreibung The Catalan Corpus of News Articles comprises articles in Catalan from 1 January 1999 to 31 March 2007. These articles are grouped per trimester without chronological order inside.
Sprachen Catalan, Valencian (cat)
Catalan-Spanish Parallel Corpus 686 Mb This corpus contains more than 100 million words and it contains 10 years of bilingual articles from El Periódico de ... Spanish, Castilian (spa); Catalan, Vale... ELRA-W0053 Details

Catalan-Spanish Parallel Corpus

Name Catalan-Spanish Parallel Corpus (ELRA-W0053)
URL http://catalog.elra.info/product_info.php?products_id=1122
Beschreibung This corpus contains more than 100 million words and it contains 10 years of bilingual articles from El Periódico de Catalunya. The data are aligned at sentence level and stored in text files, in a one sentence per line basis. The data are provided in plain text, with no encoding whatsoever.
Sprachen
  • Spanish, Castilian (spa)
  • Catalan, Valencian (cat)
Corpus of Contemporaneous Spanish Novels 4.8 Mb This corpus consists of 11 novels written in Castilian Spanish by Inmaculada Ferrer-Vidal Turull, a contemporaneous a... Spanish, Castilian (spa) ELRA-W0041 Details

Corpus of Contemporaneous Spanish Novels

Name Corpus of Contemporaneous Spanish Novels (ELRA-W0041)
URL http://catalog.elra.info/product_info.php?products_id=847
Beschreibung This corpus consists of 11 novels written in Castilian Spanish by Inmaculada Ferrer-Vidal Turull, a contemporaneous author.
Sprachen Spanish, Castilian (spa)
Dutch PAROLE Distributable Corpus 70 Mb This Dutch corpus is a 3 million words selection built according to the specifications of the PAROLE project. Over 25... Dutch, Flemish (dut) ELRA-W0019 Details

Dutch PAROLE Distributable Corpus

Name Dutch PAROLE Distributable Corpus (ELRA-W0019)
URL http://catalog.elra.info/product_info.php?products_id=543
Beschreibung This Dutch corpus is a 3 million words selection built according to the specifications of the PAROLE project. Over 250,000 words of corpus texts (with TEI markup suppressed) have been PoS-tagged automatically. A total of 59,798 running words has been manually corrected and checked.
Sprachen Dutch, Flemish (dut)
ECI-ELSNET Italian & German tagged sub-corpus 3 Mb The data is extracted from the ECI corpus (the German Frankfurter Rundschau part) and the Italian corpus of ILC/CNR. ... German (ger); Italian (ita) ... ELRA-W0005 Details

ECI-ELSNET Italian & German tagged sub-corpus

Name ECI-ELSNET Italian & German tagged sub-corpus (ELRA-W0005)
URL http://catalog.elra.info/product_info.php?products_id=86
Beschreibung The data is extracted from the ECI corpus (the German Frankfurter Rundschau part) and the Italian corpus of ILC/CNR. It contains the following domains: Economy (17,000 words), Politics (14,000 words), Culture (18,000 words), Sports (9,000 words), Local Events (8,500 words).
Sprachen
  • German (ger)
  • Italian (ita)
ECI/MCI (European Corpus Initiative/Multilingual Corpus I) 655 Mb Over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese,... Turkish (tur); Albanian (alb); Bulgaria... ELRA-W0004 Details

ECI/MCI (European Corpus Initiative/Multilingual Corpus I)

Name ECI/MCI (European Corpus Initiative/Multilingual Corpus I) (ELRA-W0004)
URL http://catalog.elra.info/product_info.php?products_id=85
Beschreibung Over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, Malay and more.
Sprachen
  • Turkish (tur)
  • Albanian (alb)
  • Bulgarian (bul)
  • Chinese (chi)
  • Czech (cze)
  • Dutch, Flemish (dut)
  • English (eng)
  • Estonian (est)
  • French (fre)
  • Gaelic, Scottish Gaelic (gla)
  • German (ger)
  • Greek, Modern (1453-) (gre)
  • Italian (ita)
  • Japanese (jpn)
  • Latin (lat)
  • Lithuanian (lit)
  • Malay (may)
  • Spanish, Castilian (spa)
  • Serbian (scc)
  • Danish (dan)
  • Russian (rus)
  • Norwegian (nor)
  • Uzbek (uzb)
  • Portuguese (por)
  • Swedish (swe)
EUROPARL Corpus Parallel Corpora: Portuguese-English 2.3 Gb The Portuguese-English subpart of the EUROPARL Corpus was extracted from the proceedings of the European Parliament. ... English (eng); Portuguese (por) ... ELRA-W0090 Details

EUROPARL Corpus Parallel Corpora: Portuguese-English

Name EUROPARL Corpus Parallel Corpora: Portuguese-English (ELRA-W0090)
URL http://catalog.elra.info/product_info.php?products_id=1257
Beschreibung The Portuguese-English subpart of the EUROPARL Corpus was extracted from the proceedings of the European Parliament. It contains approximately 58,324,562 tokens of European Portuguese (L1) and 49,216,896 tokens of English (translation). It is composed of one text file for the English corpus and two files for the Portuguese version: a text file and an annotated file, containing a PoS tag and a lemma for each token.
Sprachen
  • English (eng)
  • Portuguese (por)
English-Nepali Parallel Corpus 47 Mb This corpus consists of a collection of national development texts in English and Nepali. A small set of data is alig... English (eng); Nepali (nep) ... ELRA-W0077 Details

English-Nepali Parallel Corpus

Name English-Nepali Parallel Corpus (ELRA-W0077)
URL http://catalog.elra.info/product_info.php?products_id=1217
Beschreibung This corpus consists of a collection of national development texts in English and Nepali. A small set of data is aligned at the sentence level (27,060 English words; 21,756 Nepali words), and a larger set of texts at the document level (617,340 English words; 596,571 Nepali words). An additional set of monolingual data in Nepali is also provided (386,879 words in Nepali).
Sprachen
  • English (eng)
  • Nepali (nep)
English-Persian parallel Corpus 40 Mb Please refer to ELRA-W0118 for the latest version of this corpus. This version consists of about 3,500,000 English an... English (eng); Persian (per) ... ELRA-W0051 Details

English-Persian parallel Corpus

Name English-Persian parallel Corpus (ELRA-W0051)
URL http://catalog.elra.info/product_info.php?products_id=1111
Beschreibung Please refer to ELRA-W0118 for the latest version of this corpus. This version consists of about 3,500,000 English and Persian (Farsi) words aligned at sentence level (about 100,000 sentences). The format of the files is Unicode. It has been originally created with SQL Server, but it is presented in access file type.
Sprachen
  • English (eng)
  • Persian (per)
GeFRePaC - German French Reciprocal Parallel Corpus 1.3 Gb GeFRePac was produced in the framework of the LRsPGeFRePac was produced in the framework of the LRsP&P project. It co... French (fre); German (ger) ... ELRA-W0031 Details

GeFRePaC - German French Reciprocal Parallel Corpus

Name GeFRePaC - German French Reciprocal Parallel Corpus (ELRA-W0031)
URL http://catalog.elra.info/product_info.php?products_id=633
Beschreibung GeFRePac was produced in the framework of the LRsPGeFRePac was produced in the framework of the LRsP&P project. It contains 30 million words (15 million for each language) for the purpose of developing, enhancing and improving translation aids.
Sprachen
  • French (fre)
  • German (ger)
ICE-GB (British English component of the International Corpus of English) 97 Mb British component of the International Corpus of English (ICE), ICE-GB consists of a million words (83,394 parse tree... English (eng) ELRA-W0021 Details

ICE-GB (British English component of the International Corpus of English)

Name ICE-GB (British English component of the International Corpus of English) (ELRA-W0021)
URL http://catalog.elra.info/product_info.php?products_id=762
Beschreibung British component of the International Corpus of English (ICE), ICE-GB consists of a million words (83,394 parse trees, including 59,640 in the spoken part of the corpus) extracted from 200 written and 300 spoken English texts. It is fully grammatically annotated and has been fully checked. ICE-GB is distributed with the retrieval software ICECUP (the International Corpus of English Corpus Utility Program).
Sprachen English (eng)
ILSP/ELEFTHEROTYPIA Corpus (Greek corpus) 27 Mb This corpus contains approximately 3 million words from the daily newspaper ELEFTHEROTYPIA, classified and annotated ... Greek, Modern (1453-) (gre) ... ELRA-W0022 Details

ILSP/ELEFTHEROTYPIA Corpus (Greek corpus)

Name ILSP/ELEFTHEROTYPIA Corpus (Greek corpus) (ELRA-W0022)
URL http://catalog.elra.info/product_info.php?products_id=763
Beschreibung This corpus contains approximately 3 million words from the daily newspaper ELEFTHEROTYPIA, classified and annotated accordingly to the common core PAROLE encoding standard. The format of the corpus is SGML files. A subset of the corpus (250,000 words) is morpho-syntactically tagged; all the words are also lemmatised and checked.
Sprachen Greek, Modern (1453-) (gre)
Italian Syntactic-Semantic Treebank (ISST) 90 Mb ISST comprises 89,941 tokens for the financial-domain part and 215,606 tokens for the general part. It is formatted i... Italian (ita) ELRA-W0044 Details

Italian Syntactic-Semantic Treebank (ISST)

Name Italian Syntactic-Semantic Treebank (ISST) (ELRA-W0044)
URL http://catalog.elra.info/product_info.php?products_id=887
Beschreibung ISST comprises 89,941 tokens for the financial-domain part and 215,606 tokens for the general part. It is formatted in XML. This Treebank has a five-level structure covering orthographic, morpho-syntactic, syntactic; semantic and lexico-semantic levels of linguistic description. Syntactic annotation is distributed over two different levels: the constituent structure level and the functional relations level. The fifth level deals with lexico-semantic annotation, which is carried out in terms of sense tagging of lexical heads (nouns, verbs and adjectives) augmented with other types of semantic information: ItalWordNet (see ELRA-M0018) is the reference lexical resource used for the sense tagging task . Both syntactic and lexico-semantic annotations refer to the morpho-syntactically annotated text, which in turn is linked to the orthographic file with the text and mark-up of macrotextual organisation (e.g. titles, subtitles, summary, body of article, paragraphs).
Sprachen Italian (ita)
Karl May Korpus (KMK) 77 Mb Karl-May-Korpus is a German monolingual corpus, available in an SGML-tagged ASCII text format. It contains the works ... German (ger) ELRA-W0016 Details

Karl May Korpus (KMK)

Name Karl May Korpus (KMK) (ELRA-W0016)
URL http://catalog.elra.info/product_info.php?products_id=450
Beschreibung Karl-May-Korpus is a German monolingual corpus, available in an SGML-tagged ASCII text format. It contains the works of the German author Karl May and consists of around 1.6 million words (divided into 9 sub-corpora of about 180,000 words each).
Sprachen German (ger)
Khresmoi manually annotated reference corpus 1.3 Gb This corpus is a collection of Khresmoi English web documents annotated with key entities (such as disease, drug). Th... English (eng) ELRA-W0081 Details

Khresmoi manually annotated reference corpus

Name Khresmoi manually annotated reference corpus (ELRA-W0081)
URL http://catalog.elra.info/product_info.php?products_id=1237
Beschreibung This corpus is a collection of Khresmoi English web documents annotated with key entities (such as disease, drug). The corpus is divided into two parts: 1. The initial corpus: 625 documents from the Genetics Home Reference data set, automatically annotated with anatomical locations and diseases, and manually corrected by 3-4 annotators. Size of documents: between 26 and 8,306 tokens each. 2. The main corpus: 6,950 English documents from the Khresmoi crawl and 5,518 English Wikipedia pages, automatically annotated through the GATE Platform for Anatomy, Disease, Drug and Investigation. Size of documents: between 200 and 2,000 tokens each. The corpus is using the GATE XML format.
Sprachen English (eng)
LT Corpus 43 Mb The LT Corpus is composed of 70 fiction texts from Portuguese renowned authors. The corpus contains 1,781,083 tokens.... Portuguese (por) ELRA-W0059 Details

LT Corpus

Name LT Corpus (ELRA-W0059)
URL http://catalog.elra.info/product_info.php?products_id=1178
Beschreibung The LT Corpus is composed of 70 fiction texts from Portuguese renowned authors. The corpus contains 1,781,083 tokens. The texts date from before 1940. The corpus is delivered in one file, in two different formats. The txt version has one sentence per line, an identification number for each text and no further annotation. The cqpweb file is one token per line, followed by pos tag and lemma, and is annotated for NP chunks.
Sprachen Portuguese (por)
MLCC Multilingual and Parallel Corpora 915 Mb The first set contains articles from 6 European newspapers: Het Financieele Dagblad (Dutch, 8.5 million words), The F... Dutch, Flemish (dut); English (eng); Fr... ELRA-W0023 Details

MLCC Multilingual and Parallel Corpora

Name MLCC Multilingual and Parallel Corpora (ELRA-W0023)
URL http://catalog.elra.info/product_info.php?products_id=764
Beschreibung The first set contains articles from 6 European newspapers: Het Financieele Dagblad (Dutch, 8.5 million words), The Financial Times (English, 30 million words), Le Monde (French, 10 million words), Handelsblatt (German, 33 million words), Il sole 24 Ore (Italian, 1.88 million words), Expansion (Spanish, 10 million words). The second set consists of a parallel corpus of translated data in the nine European official languages (1992-1994) divided into 2 sub-corpora: written questions (10.2 million words) and parliamentary debates (5 to 8 million words per language).
Sprachen
  • Dutch, Flemish (dut)
  • English (eng)
  • French (fre)
  • German (ger)
  • Italian (ita)
  • Spanish, Castilian (spa)
MTP Annotated German corpus - untagged version 283 Mb A 500,000 German words corpus of SGML-formatted texts from two German newspapers, the Frankfurter Allgemeine Zeitung ... German (ger) ELRA-W0008-01 Details

MTP Annotated German corpus - untagged version

Name MTP Annotated German corpus - untagged version (ELRA-W0008-01)
URL http://catalog.elra.info/product_info.php?products_id=47
Beschreibung A 500,000 German words corpus of SGML-formatted texts from two German newspapers, the Frankfurter Allgemeine Zeitung and Die Zeit, for the years 1990 to 1992.
Sprachen German (ger)
MULTEXT JOC Corpus 114 Mb This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 6... English (eng); French (fre); German (ge... ELRA-W0017 Details

MULTEXT JOC Corpus

Name MULTEXT JOC Corpus (ELRA-W0017)
URL http://catalog.elra.info/product_info.php?products_id=534
Beschreibung This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050). This part contains raw, tagged and aligned data from the Written Questions and Answers of the Official Journal of the European Community. The corpus contains ca. 5 million words in English, French, German, Italian and Spanish (ca. 1 million words par language). About 800,000 words were grammatically tagged and manually checked for English, French, Italian and Spanish, i.e. roughly 200,000 words per language. The same subset for French, German, Italian and Spanish was aligned to English at the sentence level.
Sprachen
  • English (eng)
  • French (fre)
  • German (ger)
  • Italian (ita)
  • Spanish, Castilian (spa)
Modern French Corpus including Anaphors Tagging 13 Mb This modern French corpus contains over 1 million words with a tagging of the anaphors, and cover many different aspe... French (fre) ELRA-W0032 Details

Modern French Corpus including Anaphors Tagging

Name Modern French Corpus including Anaphors Tagging (ELRA-W0032)
URL http://catalog.elra.info/product_info.php?products_id=634
Beschreibung This modern French corpus contains over 1 million words with a tagging of the anaphors, and cover many different aspects of the French language (scientific and human sciences articles, extracts from newspapers and magazines, legal texts, etc.). The annotation scheme was defined in XML.
Sprachen French (fre)
Monolingual Greek corpus 5.1 Mb Corpus of 1 million words consisting of articles written in 1996 from the Greek daily newspaper ELEFTHEROTIPIA. Greek, Modern (1453-) (gre) ... ELRA-W0014 Details

Monolingual Greek corpus

Name Monolingual Greek corpus (ELRA-W0014)
URL http://catalog.elra.info/product_info.php?products_id=716
Beschreibung Corpus of 1 million words consisting of articles written in 1996 from the Greek daily newspaper ELEFTHEROTIPIA.
Sprachen Greek, Modern (1453-) (gre)
Multilingual Corpus 9.9 Mb Multilingual parallel corpus produced by Kaist Korterm containing 60 000 expressions in Korean, Chinese and English. Chinese (chi); English (eng); Korean (k... ELRA-W0035 Details

Multilingual Corpus

Name Multilingual Corpus (ELRA-W0035)
URL http://catalog.elra.info/product_info.php?products_id=655
Beschreibung Multilingual parallel corpus produced by Kaist Korterm containing 60 000 expressions in Korean, Chinese and English.
Sprachen
  • Chinese (chi)
  • English (eng)
  • Korean (kor)
NE3L named entities Arabic corpus 3 Mb The Arabic corpus contains 103,363 words coming from articles extracted from Le Monde Diplomatique newspaper, and pub... Arabic (ara) ELRA-W0078 Details

NE3L named entities Arabic corpus

Name NE3L named entities Arabic corpus (ELRA-W0078)
URL http://catalog.elra.info/product_info.php?products_id=1226
Beschreibung The Arabic corpus contains 103,363 words coming from articles extracted from Le Monde Diplomatique newspaper, and published in 2004. 2 named entity categories were taken into account: Time and Amount.
Sprachen Arabic (ara)
NE3L named entities Chinese corpus 4.8 Mb The Chinese corpus contains 79,302 words coming from articles extracted from Le Monde Diplomatique newspaper, and pub... Chinese (chi) ELRA-W0079 Details

NE3L named entities Chinese corpus

Name NE3L named entities Chinese corpus (ELRA-W0079)
URL http://catalog.elra.info/product_info.php?products_id=1227
Beschreibung The Chinese corpus contains 79,302 words coming from articles extracted from Le Monde Diplomatique newspaper, and published in 2001. 3 named entity categories were taken into account: Person, Place and Organisation.
Sprachen Chinese (chi)
NE3L named entities Russian corpus 2.7 Mb The Russian corpus contains 75,784 words coming from articles extracted from Izvestia newspaper, and published in 199... Russian (rus) ELRA-W0080 Details

NE3L named entities Russian corpus

Name NE3L named entities Russian corpus (ELRA-W0080)
URL http://catalog.elra.info/product_info.php?products_id=1228
Beschreibung The Russian corpus contains 75,784 words coming from articles extracted from Izvestia newspaper, and published in 1995. 2 named entity categories were taken into account: Time and Amount.
Sprachen Russian (rus)
NEMLAR Written Corpus 136 Mb The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories. The corpus is ... Arabic (ara) ELRA-W0042 Details

NEMLAR Written Corpus

Name NEMLAR Written Corpus (ELRA-W0042)
URL http://catalog.elra.info/product_info.php?products_id=873
Beschreibung The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories. The corpus is provided in 4 different versions: raw text, fully vowelized text, text with Arabic lexical analysis, text with Arabic POS-tags.
Sprachen Arabic (ara)
NPChunks 412 Kb NPChunks is a training corpus containing approximately 1,000 sentences, with a total of 24,243 tokens, selected rando... Portuguese (por) ELRA-W0089 Details

NPChunks

Name NPChunks (ELRA-W0089)
URL http://catalog.elra.info/product_info.php?products_id=1256
Beschreibung NPChunks is a training corpus containing approximately 1,000 sentences, with a total of 24,243 tokens, selected randomly from the written part of the CINTIL corpus. The corpus is PoS-annotated at token level, including punctuation. Noun Phrases were annotated with specific tags. It was automatically PoS-tagged with MBT tagger, and lemmatized with MBLEM, following the annotation scheme of the Corpus of Reference of Contemporary Portuguese.
Sprachen Portuguese (por)
Nepali Monolingual written corpus 683 Mb The Nepali Monolingual written corpus comprises the core corpus (core sample) and the general corpus. The core sample... Nepali (nep) ELRA-W0076 Details

Nepali Monolingual written corpus

Name Nepali Monolingual written corpus (ELRA-W0076)
URL http://catalog.elra.info/product_info.php?products_id=1216
Beschreibung The Nepali Monolingual written corpus comprises the core corpus (core sample) and the general corpus. The core sample (CS) represents the collection of Nepali written texts from 15 different genres with 2000 words each published between 1990 and 1992. It is based on FLOB/FROWN corpora and contains 802,000 words. The general corpus (GC) consists of written texts collected opportunistically from a wide range of sources such as the internet webs, newspapers, books, publishers and authors. It contains 1,400,000 words.
Sprachen Nepali (nep)
PANACEA English-French and English-Greek parallel corpus acquired for Environment domain 11 Mb This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Environment do... English (eng); French (fre) ... ELRA-W0057 Details

PANACEA English-French and English-Greek parallel corpus acquired for Environment domain

Name PANACEA English-French and English-Greek parallel corpus acquired for Environment domain (ELRA-W0057)
URL http://catalog.elra.info/product_info.php?products_id=1182
Beschreibung This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Environment domain automatically acquired from the web during 2010 and 2011. It was acquired in the framework of the PANACEA project. Data and language pairs are split into training, test and development test sets.
Sprachen
  • English (eng)
  • French (fre)
PANACEA English-French and English-Greek parallel corpus acquired for Labour Legislation domain 16 Mb This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Labour Legisla... English (eng); Greek, Modern (1453-) (g... ELRA-W0058 Details

PANACEA English-French and English-Greek parallel corpus acquired for Labour Legislation domain

Name PANACEA English-French and English-Greek parallel corpus acquired for Labour Legislation domain (ELRA-W0058)
URL http://catalog.elra.info/product_info.php?products_id=1183
Beschreibung This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Labour Legislation domain automatically acquired from the web during 2010 and 2011. It was acquired in the framework of the PANACEA project. Data and language pairs are split into training, test and development test sets.
Sprachen
  • English (eng)
  • Greek, Modern (1453-) (gre)
PANACEA Environment English monolingual corpus 2.7 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the English l... English (eng) ELRA-W0063 Details

PANACEA Environment English monolingual corpus

Name PANACEA Environment English monolingual corpus (ELRA-W0063)
URL http://catalog.elra.info/product_info.php?products_id=1184
Beschreibung This corpus consists of documents that were acquired from the web, were automatically detected to be in the English language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 50,541,538 tokens, divided into a total of 28,071 documents that were crawled from 3,121 web sites.
Sprachen English (eng)
PANACEA Environment French monolingual corpus 2.1 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the French la... French (fre) ELRA-W0065 Details

PANACEA Environment French monolingual corpus

Name PANACEA Environment French monolingual corpus (ELRA-W0065)
URL http://catalog.elra.info/product_info.php?products_id=1186
Beschreibung This corpus consists of documents that were acquired from the web, were automatically detected to be in the French language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 47,364,125 tokens, divided into a total of 23,514 documents that were crawled from 1,969 web sites.
Sprachen French (fre)
PANACEA Environment Greek monolingual corpus 2 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek lan... Greek, Modern (1453-) (gre) ... ELRA-W0067 Details

PANACEA Environment Greek monolingual corpus

Name PANACEA Environment Greek monolingual corpus (ELRA-W0067)
URL http://catalog.elra.info/product_info.php?products_id=1188
Beschreibung This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 27,958,530 tokens, divided into a total of 16,073 documents that were crawled from 1,063 web sites.
Sprachen Greek, Modern (1453-) (gre)
PANACEA Environment Italian monolingual corpus 1.8 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian l... Italian (ita) ELRA-W0069 Details

PANACEA Environment Italian monolingual corpus

Name PANACEA Environment Italian monolingual corpus (ELRA-W0069)
URL http://catalog.elra.info/product_info.php?products_id=1190
Beschreibung This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 40,044,852 tokens, divided into a total of 16,159 documents that were crawled from 1,211 web sites.
Sprachen Italian (ita)
PANACEA Environment Spanish monolingual corpus 2.3 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish l... Spanish, Castilian (spa) ELRA-W0071 Details

PANACEA Environment Spanish monolingual corpus

Name PANACEA Environment Spanish monolingual corpus (ELRA-W0071)
URL http://catalog.elra.info/product_info.php?products_id=1192
Beschreibung This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 46,225,624 tokens, divided into a total of 26,009 documents that were crawled from 2,053 web sites.
Sprachen Spanish, Castilian (spa)
PANACEA Labour English monolingual corpus 1.6 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the English l... English (eng) ELRA-W0064 Details

PANACEA Labour English monolingual corpus

Name PANACEA Labour English monolingual corpus (ELRA-W0064)
URL http://catalog.elra.info/product_info.php?products_id=1185
Beschreibung This corpus consists of documents that were acquired from the web, were automatically detected to be in the English language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 46,431,351 tokens, divided into a total of 15,197 documents that were crawled from 1,558 web sites.
Sprachen English (eng)
PANACEA Labour French monolingual corpus 2.5 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the French la... French (fre) ELRA-W0066 Details

PANACEA Labour French monolingual corpus

Name PANACEA Labour French monolingual corpus (ELRA-W0066)
URL http://catalog.elra.info/product_info.php?products_id=1187
Beschreibung This corpus consists of documents that were acquired from the web, were automatically detected to be in the French language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 56,440,425 tokens, divided into a total of 26,675 documents that were crawled from 1,391 web sites.
Sprachen French (fre)
PANACEA Labour Greek monolingual corpus 1.4 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek lan... Greek, Modern (1453-) (gre) ... ELRA-W0068 Details

PANACEA Labour Greek monolingual corpus

Name PANACEA Labour Greek monolingual corpus (ELRA-W0068)
URL http://catalog.elra.info/product_info.php?products_id=1189
Beschreibung This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 21,077,196 tokens, divided into a total of 7,124 documents that were crawled from 598 web sites.
Sprachen Greek, Modern (1453-) (gre)
PANACEA Labour Italian monolingual corpus 2.4 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian l... Italian (ita) ELRA-W0070 Details

PANACEA Labour Italian monolingual corpus

Name PANACEA Labour Italian monolingual corpus (ELRA-W0070)
URL http://catalog.elra.info/product_info.php?products_id=1191
Beschreibung This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 70,563,320 tokens, divided into a total of 12,706 documents that were crawled from 864 web sites.
Sprachen Italian (ita)
PANACEA Labour Spanish monolingual corpus 1.9 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish l... Spanish, Castilian (spa) ELRA-W0072 Details

PANACEA Labour Spanish monolingual corpus

Name PANACEA Labour Spanish monolingual corpus (ELRA-W0072)
URL http://catalog.elra.info/product_info.php?products_id=1193
Beschreibung This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 53,922,118 tokens, divided into a total of 13,188 documents that were crawled from 1,015 web sites.
Sprachen Spanish, Castilian (spa)
PAROLE French Corpus 349 Mb The PAROLE French corpus contains the following data: Miscellaneous: Data provided by ELRA (CRATER, MLCC Multilingua... French (fre) ELRA-W0020 Details

PAROLE French Corpus

Name PAROLE French Corpus (ELRA-W0020)
URL http://catalog.elra.info/product_info.php?products_id=565
Beschreibung The PAROLE French corpus contains the following data: Miscellaneous: Data provided by ELRA (CRATER, MLCC Multilingual and Parallel Corpora) 2 025 964 words Books: CNRS Editions 3 267 409 words Periodicals: CNRS Info, Hermès 942 963 words Newspapers: Le Monde, provided by ELRA 13 856 763 words Total 20 093 099 words
Sprachen French (fre)
PAROLE Irish Distributable Corpus 25 Mb This corpus consists of over 8 million words The text is marked-up in accordance with the PAROLE encoding standard. A... Irish (gle) ELRA-W0026 Details

PAROLE Irish Distributable Corpus

Name PAROLE Irish Distributable Corpus (ELRA-W0026)
URL http://catalog.elra.info/product_info.php?products_id=597
Beschreibung This corpus consists of over 8 million words The text is marked-up in accordance with the PAROLE encoding standard. All the files are in SGML format with a detailed header and the body of the text tagged to paragraph level. A subset of the corpus is morpho-syntactically tagged. Included in this distribution is approximately 3,000 manually checked words.
Sprachen Irish (gle)
PAROLE Italian Corpus 44 Mb The PAROLE Italian Corpus comprises 3,135,651 words collected from four different domains: newspapers (2,179,800 word... Italian (ita) ELRA-W0043 Details

PAROLE Italian Corpus

Name PAROLE Italian Corpus (ELRA-W0043)
URL http://catalog.elra.info/product_info.php?products_id=886
Beschreibung The PAROLE Italian Corpus comprises 3,135,651 words collected from four different domains: newspapers (2,179,800 words), periodicals (143,810 words), books (564,964 words), miscellaneous (247,077 words). About 250,000 words were morphosyntactically annotated and lemmatized.
Sprachen Italian (ita)
PAROLE Portuguese Corpus - complete version 57 Mb The parole Portuguese corpus contains approximately 3 million running words of European Portuguese distributed by Med... Portuguese (por) ELRA-W0024-01 Details

PAROLE Portuguese Corpus - complete version

Name PAROLE Portuguese Corpus - complete version (ELRA-W0024-01)
URL http://catalog.elra.info/product_info.php?products_id=765
Beschreibung The parole Portuguese corpus contains approximately 3 million running words of European Portuguese distributed by Medium (Newspaper, Book, Periodical, Miscellaneous). The corpus was classified and encoded according to the common core parole encoding standard. The file format of this corpus is SGML. Also availabe, a subcorpus consists of about 250,000 words morpho-syntactically tagged. Disambiguation was manually checked.
Sprachen Portuguese (por)
PTPARL Corpus 25 Mb The PTPARL Corpus contains 1,076 texts consisting of adapted transcriptions of the Portuguese Parliament sessions. Th... Portuguese (por) ELRA-W0060 Details

PTPARL Corpus

Name PTPARL Corpus (ELRA-W0060)
URL http://catalog.elra.info/product_info.php?products_id=1179
Beschreibung The PTPARL Corpus contains 1,076 texts consisting of adapted transcriptions of the Portuguese Parliament sessions. The corpus contains 1,000,441 tokens. The corpus is delivered in one file, in two different formats. The txt version has one sentence per line, an identification number for each text and no further annotation. The cqpweb file is one token per line, followed by pos tag and lemma, and is annotated for NP chunks.
Sprachen Portuguese (por)
Persian 1984 corpus (Multext-East framework) 5.9 Mb This corpus contains the Persian (Farsi) translation of a part of the novel 1984 (G. Orwell) annotated in the Multext... Persian (per) ELRA-W0054 Details

Persian 1984 corpus (Multext-East framework)

Name Persian 1984 corpus (Multext-East framework) (ELRA-W0054)
URL http://catalog.elra.info/product_info.php?products_id=1124
Beschreibung This corpus contains the Persian (Farsi) translation of a part of the novel 1984 (G. Orwell) annotated in the Multext-East framework (Multilingual Text Tools and Corpora for Eastern and Central European Languages). The corpus contains approximately 100,000 words (6,604 sentences, 13,247 lemmas), with extensive headers and markup for document structure, sentences, and various sub-sentence annotations in the XML-format following the TEI guidelines. Annotation includes POS (part-of-speech) and lemmas.
Sprachen Persian (per)
Quaero Old Press Extended Named Entity corpus 6.8 Gb This corpus consists of the manual annotation of 76 newspaper issues published in 1890-1891 and provided by the Frenc... French (fre) ELRA-W0073 Details

Quaero Old Press Extended Named Entity corpus

Name Quaero Old Press Extended Named Entity corpus (ELRA-W0073)
URL http://catalog.elra.info/product_info.php?products_id=1194
Beschreibung This corpus consists of the manual annotation of 76 newspaper issues published in 1890-1891 and provided by the French National Library (Bibliothèque Nationale de France). Three different titles are used (Le Temps, La Croix and Le Figaro) for a total of 295 pages. The corpus is fully manually annotated according to the Quaero extended and structured named entity definition.
Sprachen French (fre)
Qualified POS Tagged Corpus 66 Mb Monolingual corpus in a .txt format, produced by KAIST KORTERM, containing 1020000 eojeols (Korean terms) in Korean. ... Korean (kor) ELRA-W0034 Details

Qualified POS Tagged Corpus

Name Qualified POS Tagged Corpus (ELRA-W0034)
URL http://catalog.elra.info/product_info.php?products_id=654
Beschreibung Monolingual corpus in a .txt format, produced by KAIST KORTERM, containing 1020000 eojeols (Korean terms) in Korean. This corpus is morphologically analyzed, POS tagged, and rectified 3 times by specialists.
Sprachen Korean (kor)
ROCO Romanian journalistic corpus 729 Mb ROCO is a Romanian journalistic corpus containing approximately 7.1 million tokens, the number of types being 231,626... Romanian (rum) ELRA-W0085 Details

ROCO Romanian journalistic corpus

Name ROCO Romanian journalistic corpus (ELRA-W0085)
URL http://catalog.elra.info/product_info.php?products_id=1249
Beschreibung ROCO is a Romanian journalistic corpus containing approximately 7.1 million tokens, the number of types being 231,626. It is rich in proper names, numerals and named entities. The corpus has been lemmatized and PoS annotated following the Multext-East morphosyntactic specifications, and it is XML encoded.
Sprachen Romanian (rum)
ROMBAC - Romanian balanced corpus 1.1 Gb ROMBAC is a Romanian corpus containing equal shares of texts from 5 different genres: journalism, legalese, fiction, ... Romanian (rum) ELRA-W0088 Details

ROMBAC - Romanian balanced corpus

Name ROMBAC - Romanian balanced corpus (ELRA-W0088)
URL http://catalog.elra.info/product_info.php?products_id=1253
Beschreibung ROMBAC is a Romanian corpus containing equal shares of texts from 5 different genres: journalism, legalese, fiction, medicine and biographical data for Romanian literary personalities. The entire corpus counts around 41,000,000 words, including punctuation. The corpus is annotated at paragraph, sentence, constituent group and word levels, and it provides morpho-syntactic information (MSD). It is xml encoded.
Sprachen Romanian (rum)
TRAD Pashto Monolingual text Corpus 2.2 Gb This is a monolingual text corpus in Pashto. The corpus contains about 112,000,000 tokens collected from 46 different... Pushto (pus) ELRA-W0092 Details

TRAD Pashto Monolingual text Corpus

Name TRAD Pashto Monolingual text Corpus (ELRA-W0092)
URL http://catalog.elra.info/product_info.php?products_id=1266
Beschreibung This is a monolingual text corpus in Pashto. The corpus contains about 112,000,000 tokens collected from 46 different blogs and websites.
Sprachen Pushto (pus)
TRAD Pashto-English News Articles Parallel corpus 602 Kb This is a parallel corpus, which contains 10,000 Pashto words translated into English by two different translators. T... English (eng); Pushto (pus) ... ELRA-W0097 Details

TRAD Pashto-English News Articles Parallel corpus

Name TRAD Pashto-English News Articles Parallel corpus (ELRA-W0097)
URL http://catalog.elra.info/product_info.php?products_id=1271
Beschreibung This is a parallel corpus, which contains 10,000 Pashto words translated into English by two different translators. The source texts have been collected from the following news websites: Azadiradio, Mashaal and Voice of America Pashto.
Sprachen
  • English (eng)
  • Pushto (pus)
TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data 575 Kb This is a parallel corpus, which contains 10,000 Pashto words translated into English. The source texts come from 3 b... English (eng); Pushto (pus) ... ELRA-W0095 Details

TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data

Name TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data (ELRA-W0095)
URL http://catalog.elra.info/product_info.php?products_id=1269
Beschreibung This is a parallel corpus, which contains 10,000 Pashto words translated into English. The source texts come from 3 broadcast news transcriptions of the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381).
Sprachen
  • English (eng)
  • Pushto (pus)
TRAD Pashto-French News Articles Parallel corpus 970 Kb This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. Th... French (fre); Pushto (pus) ... ELRA-W0096 Details

TRAD Pashto-French News Articles Parallel corpus

Name TRAD Pashto-French News Articles Parallel corpus (ELRA-W0096)
URL http://catalog.elra.info/product_info.php?products_id=1270
Beschreibung This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The source texts have been collected from the following news websites: Azadiradio, Mashaal and Voice of America Pashto.
Sprachen
  • French (fre)
  • Pushto (pus)
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test data 29 Mb This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. Th... French (fre); Pushto (pus) ... ELRA-W0094 Details

TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test data

Name TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test data (ELRA-W0094)
URL http://catalog.elra.info/product_info.php?products_id=1268
Beschreibung This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The source texts come from 3 broadcast news transcriptions of the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381).
Sprachen
  • French (fre)
  • Pushto (pus)
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training data 473 Mb This corpus consists of the transcription of 106 hours of recordings in Pashto from the TRAD Pashto Broadcast News Sp... French (fre); Pushto (pus) ... ELRA-W0093 Details

TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training data

Name TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training data (ELRA-W0093)
URL http://catalog.elra.info/product_info.php?products_id=1267
Beschreibung This corpus consists of the transcription of 106 hours of recordings in Pashto from the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381) translated into French. It contains about 832,000 source words and 747,000 target words.
Sprachen
  • French (fre)
  • Pushto (pus)
TSNLP (Test Suites for NLP Testing) 4.5 Mb Test Suites for Natural Language Processing. 4,000 test items (sentences or fragments of sentences) in English, Fren... English (eng); French (fre); German (ge... ELRA-W0013 Details

TSNLP (Test Suites for NLP Testing)

Name TSNLP (Test Suites for NLP Testing) (ELRA-W0013)
URL http://catalog.elra.info/product_info.php?products_id=51
Beschreibung Test Suites for Natural Language Processing. 4,000 test items (sentences or fragments of sentences) in English, French & German, useful for NL system evaluation.
Sprachen
  • English (eng)
  • French (fre)
  • German (ger)
Tagged text in French (MEMODATA) with rules of morphological disambiguation 3.1 Gb More than 170 books (classical novels, legal texts...) are tagged with rules of morphological disambiguation. A tagge... French (fre) ELRA-W0012 Details

Tagged text in French (MEMODATA) with rules of morphological disambiguation

Name Tagged text in French (MEMODATA) with rules of morphological disambiguation (ELRA-W0012)
URL http://catalog.elra.info/product_info.php?products_id=50
Beschreibung More than 170 books (classical novels, legal texts...) are tagged with rules of morphological disambiguation. A tagged corpus of 50 books is available for research. It consists of several authors of the 19th century (Balzac, Hugo, Stendhal). See also W0011.
Sprachen French (fre)
Tagged text in French (MEMODATA) with typographic tags 247 Mb Over 170 (tagged) French books (classical novels, legal texts) with typographic tags. Another tagged corpus of 50 boo... French (fre) ELRA-W0011 Details

Tagged text in French (MEMODATA) with typographic tags

Name Tagged text in French (MEMODATA) with typographic tags (ELRA-W0011)
URL http://catalog.elra.info/product_info.php?products_id=49
Beschreibung Over 170 (tagged) French books (classical novels, legal texts) with typographic tags. Another tagged corpus of 50 books is available for research only. The books consist of authors of the 19th century. See also W0012.
Sprachen French (fre)
The CINTIL Corpus International Corpus of Portuguese 20 Mb CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portug... Portuguese (por) ELRA-W0050 Details

The CINTIL Corpus International Corpus of Portuguese

Name The CINTIL Corpus International Corpus of Portuguese (ELRA-W0050)
URL http://catalog.elra.info/product_info.php?products_id=1102
Beschreibung CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portuguese. It is composed of one million annotated tokens, each one of which verified by human expert annotators. The annotation comprises information on part-of-speech, open class lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition). The corpus is developed over raw textual materials of several types, of which 30% are spoken materials.
Sprachen Portuguese (por)
The EMILLE/CIIL Corpus 1.5 Gb The EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian l... Urdu (urd); Telugu (tel); Tamil (tam); ... ELRA-W0037 Details

The EMILLE/CIIL Corpus

Name The EMILLE/CIIL Corpus (ELRA-W0037)
URL http://catalog.elra.info/product_info.php?products_id=696
Beschreibung The EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu) (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, and 20 written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode. This database is available for research use by academic organisations only. For a use by commercial organisations, a subset of the EMILLE/CIIL Corpus is available under the reference ELRA-W0038 The EMILLE Lancaster Corpus.
Sprachen
  • Urdu (urd)
  • Telugu (tel)
  • Tamil (tam)
  • Sinhalese (sin)
  • Panjabi, Punjabi (pan)
  • Oriya (ori)
  • Marathi (mar)
  • Malayalam (mal)
  • Kashmiri (kas)
  • Kannada (kan)
  • Hindi (hin)
  • Gujarati (guj)
  • Bengali (ben)
  • Assamese (asm)
The Lancaster Corpus of Mandarin Chinese (LCMC) 45 Mb The Lancaster Corpus of Mandarin Chinese (LCMC) sampled 15 written text categories including news, literary texts, ac... Chinese (chi) ELRA-W0039 Details

The Lancaster Corpus of Mandarin Chinese (LCMC)

Name The Lancaster Corpus of Mandarin Chinese (LCMC) (ELRA-W0039)
URL http://catalog.elra.info/product_info.php?products_id=715
Beschreibung The Lancaster Corpus of Mandarin Chinese (LCMC) sampled 15 written text categories including news, literary texts, academic prose and official documents etc published in P. R. China in the earlier 1990s for a total of approximately 1 million words. The same sampling frame and period as FLOB/FROWN were used in LCMC. The corpus is encoded in Unicode (UTF-8) and marked up in XML.
Sprachen Chinese (chi)
Venice Italian Treebank (VIT) 149 Mb The VIT, Venice Italian Treebank contains about 272,000 words distributed over six different domains: bureaucratic, p... Italian (ita) ELRA-W0040 Details

Venice Italian Treebank (VIT)

Name Venice Italian Treebank (VIT) (ELRA-W0040)
URL http://catalog.elra.info/product_info.php?products_id=831
Beschreibung The VIT, Venice Italian Treebank contains about 272,000 words distributed over six different domains: bureaucratic, political, economic and financial, literary, scientific, and news. In addition, some 60,000 tokens of spoken dialogues in different Italian varieties were annotated. The annotation follows general X-bar criteria with 29 constituency labels and 102 PoS tags. VIT is also made available in a broad annotation version with 10 constituency labels and 22 PoS tags for machine learning purposes. The format is plain text with square bracketing. However, a UPenn style version which is readable by the open source query language CorpusSearch is also provided.
Sprachen Italian (ita)
Wolverhampton Business English Corpus 118 Mb Produced by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the fra... English (eng) ELRA-W0028 Details

Wolverhampton Business English Corpus

Name Wolverhampton Business English Corpus (ELRA-W0028)
URL http://catalog.elra.info/product_info.php?products_id=627
Beschreibung Produced by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the framework of the European Commision project LRsPProduced by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the framework of the European Commision project LRsP&P (Language Resources Production & Packaging - LE4-8335), the Business English Corpus consists of 10.186.259 words collected from 23 different Web sites related to business.
Sprachen English (eng)
deL1L2IM corpus 2.8 Mb The deL1L2IM corpus is composed of 72 dialogues, each of them having a duration of 20 to 45 minutes. The whole corpus... German (ger) ELRA-W0083 Details

deL1L2IM corpus

Name deL1L2IM corpus (ELRA-W0083)
URL http://catalog.elra.info/product_info.php?products_id=1243
Beschreibung The deL1L2IM corpus is composed of 72 dialogues, each of them having a duration of 20 to 45 minutes. The whole corpus contains ca. 52,000 words and 4,800 messages and has a file size of 0.5 Mb. Nine pairs of participants i.e. nine learners and four native speakers were required, with 8 dialogues per pair. The interactions have undergone linguistic analysis whereby the annotation will be performed only on repair/correction sequences (incomplete learner error annotation). The corpus is delivered in one written text file (in XML format, customized under TEI P5).
Sprachen German (ger)
Name Größe Beschreibung Sprache ELRA Details Ihre Auswahl