2006 CoNLL Shared Task - Ten Languages |
85.2 Mb |
2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 2006… |
Turkish (tur); Bulgarian (bul); Dutch, Fl…
|
ELRA-W0086 |
Details
2006 CoNLL Shared Task - Ten Languages
Name |
2006 CoNLL Shared Task - Ten Languages (ELRA-W0086) |
URL |
http://catalog.elra.info/product_info.php?products_id=1250 |
Description |
2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. The languages covered in this release are: Bulgarian, Danish, Dutch, German, Japanese, Portuguese, Slovene, Spanish, Swedish and Turkish. The source data in the treebanks in this release consists principally of various texts (e.g., textbooks, news, literature) annotated in dependency format. |
Languages |
- Turkish (tur)
- Bulgarian (bul)
- Dutch, Flemish (dut)
- German (ger)
- Japanese (jpn)
- Spanish, Castilian (spa)
- Danish (dan)
- Portuguese (por)
- Swedish (swe)
- Slovenian (slv)
|
|
|
Al-Hayat Arabic Corpus |
1.1 Gb |
The corpus contains articles extracted from the newspeper Al-Hayat, organised in 7 domains, for language engineering ap… |
Arabic (ara)
|
ELRA-W0030 |
Details
|
|
Amaryllis Corpus - Evaluation Package |
505 Mb |
AMARYLLIS was organised by the Institut de l'Information Scientifique et Technique (INIST) with the support of the Agen… |
French (fre)
|
ELRA-W0029 |
Details
Amaryllis Corpus - Evaluation Package
Name |
Amaryllis Corpus - Evaluation Package (ELRA-W0029) |
URL |
http://catalog.elra.info/product_info.php?products_id=626 |
Description |
AMARYLLIS was organised by the Institut de l'Information Scientifique et Technique (INIST) with the support of the Agence francophone pour l'enseignement supérieur et la recherche (AUPELF-UREF) and the French Ministère de l'Education Nationale, de la Recherche et de la Technologie (MERT) to create document corpora, questions and answers, in the framework of the Action de Recherche Concertée (ARC A1, renamed as Amaryllis- Access to text information in French), in order to get similar works to the United States project TREC. All corpora are structured as SGML files with isolatin character-encoding. |
Languages |
French (fre)
|
|
|
Amharic-English bilingual corpus |
15 Mb |
The Amharic-English bilingual corpus contains parallel text from legal and news domains in Amharic script, in translite… |
English (eng); Amharic (amh)
…
|
ELRA-W0074 |
Details
Amharic-English bilingual corpus
Name |
Amharic-English bilingual corpus (ELRA-W0074) |
URL |
http://catalog.elra.info/product_info.php?products_id=1215 |
Description |
The Amharic-English bilingual corpus contains parallel text from legal and news domains in Amharic script, in transliterated form and in English. The size of the corpus is of 232,653 words in Amharic and 291,701 in English. |
Languages |
- English (eng)
- Amharic (amh)
|
|
|
An-Nahar Newspaper Text Corpus |
794 Mb |
The An-Nahar Newspaper Text Corpus comprises articles in Arabic (Lebanon) from 1995 to 2000 (6 years) stored as HTML fi… |
Arabic (ara)
|
ELRA-W0027 |
Details
An-Nahar Newspaper Text Corpus
Name |
An-Nahar Newspaper Text Corpus (ELRA-W0027) |
URL |
http://catalog.elra.info/product_info.php?products_id=767 |
Description |
The An-Nahar Newspaper Text Corpus comprises articles in Arabic (Lebanon) from 1995 to 2000 (6 years) stored as HTML files onCDRommedia. Each yearcontains 45000 articles and 24 million words. |
Languages |
Arabic (ara)
|
|
|
Arboretum treebank |
26 Mb |
The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences. It consists of … |
Danish (dan)
|
ELRA-W0084 |
Details
Arboretum treebank
Name |
Arboretum treebank (ELRA-W0084) |
URL |
http://catalog.elra.info/product_info.php?products_id=1248 |
Description |
The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences. It consists of about 425,000 tokens and there are ca. 22,260 sentences/utterances containing 3 or more tokens. Arboretum provides named entity categories for all proper nouns. It also contains subclass categorisation for the pronoun and adverb word classes The final version of the treebank consists of two independent versions, constituent trees and dependency trees, and is distributed in the following versions:
1. Native dependency format (Constraint Grammar format)
2. Dependency annotation converted to MALT xml format
3. Native constituent tree format (Cross-language VISL standard)
4. Constituent format converted to TIGER xml |
Languages |
Danish (dan)
|
|
|
ARCADE/ROMANSEVAL corpus |
63 Mb |
The corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission (… |
English (eng); French (fre); Italian (ita…
|
ELRA-W0018 |
Details
ARCADE/ROMANSEVAL corpus
Name |
ARCADE/ROMANSEVAL corpus (ELRA-W0018) |
URL |
http://catalog.elra.info/product_info.php?products_id=535 |
Description |
The corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050), composed of 1 million words in English and four romance languages: French, Italian, Spanish and Portuguese (Written Question and Answers from the Official Journal of the European Commission). The annotation concerns all the contexts of 60 different test words (20 nouns, 20 adjectives, 20 verbs), i.e. ca. 3700 contexts all together. It comprises: semantic tagging of all the occurrences of the test words in the JOC corpus for French and Italian; and word-level alignment of all the occurrences of the test words between French and English. |
Languages |
- English (eng)
- French (fre)
- Italian (ita)
|
|
|
A "scientific" corpus of modern French ("La Recherche" magazine) - Complete version |
23 Mb |
Produced through a funding from ELRA in the framework of the European Commission project LRsPProduced through a funding… |
French (fre)
|
ELRA-W0025-02 |
Details
A "scientific" corpus of modern French ("La Recherche" magazine) - Complete version
Name |
A "scientific" corpus of modern French ("La Recherche" magazine) - Complete version (ELRA-W0025-02) |
URL |
http://catalog.elra.info/product_info.php?products_id=595 |
Description |
Produced through a funding from ELRA in the framework of the European Commission project LRsPProduced through a funding from ELRA in the framework of the European Commission project LRsP&P (Language Resources Production & Packaging - LE4-8335), the corpus contains all articles published in La Recherche magazine in 1998, including issues 305 (January) to 315 (December), which amounts to 447,244 tokens and 30,238 types. Two versions are available: the raw data (XML format) and the complete version (XML and SGML formats) |
Languages |
French (fre)
|
|
|
Catalan Corpus of News Articles |
645 Mb |
The Catalan Corpus of News Articles comprises articles in Catalan from 1 January 1999 to 31 March 2007. These articles … |
Catalan, Valencian (cat)
|
ELRA-W0047 |
Details
Catalan Corpus of News Articles
Name |
Catalan Corpus of News Articles (ELRA-W0047) |
URL |
http://catalog.elra.info/product_info.php?products_id=990 |
Description |
The Catalan Corpus of News Articles comprises articles in Catalan from 1 January 1999 to 31 March 2007. These articles are grouped per trimester without chronological order inside. |
Languages |
Catalan, Valencian (cat)
|
|
|
Catalan-Spanish Parallel Corpus |
686 Mb |
This corpus contains more than 100 million words and it contains 10 years of bilingual articles from El Periódico de Ca… |
Spanish, Castilian (spa); Catalan, Valenc…
|
ELRA-W0053 |
Details
Catalan-Spanish Parallel Corpus
Name |
Catalan-Spanish Parallel Corpus (ELRA-W0053) |
URL |
http://catalog.elra.info/product_info.php?products_id=1122 |
Description |
This corpus contains more than 100 million words and it contains 10 years of bilingual articles from El Periódico de Catalunya. The data are aligned at sentence level and stored in text files, in a one sentence per line basis. The data are provided in plain text, with no encoding whatsoever. |
Languages |
- Spanish, Castilian (spa)
- Catalan, Valencian (cat)
|
|
|
CINTIL-DeepBank |
213 Mb |
The CINTIL-DeepBank (Branco et al., 2010) is a corpus of sentences annotated with their full-fledged deep grammatical r… |
Portuguese (por)
|
ELRA-W0062 |
Details
CINTIL-DeepBank
Name |
CINTIL-DeepBank (ELRA-W0062) |
URL |
http://catalog.elra.info/product_info.php?products_id=1181 |
Description |
The CINTIL-DeepBank (Branco et al., 2010) is a corpus of sentences annotated with their full-fledged deep grammatical representations, composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), and novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) used for regression testing of the computational grammar that supported the annotation of the corpus. |
Languages |
Portuguese (por)
|
|
|
CINTIL-DependencyBank |
1.4 Mb |
The CINTIL-DependencyBank (Silva and Branco, 2012) is a corpus of sentences annotated with their syntactic dependency g… |
Portuguese (por)
|
ELRA-W0061 |
Details
CINTIL-DependencyBank
Name |
CINTIL-DependencyBank (ELRA-W0061) |
URL |
http://catalog.elra.info/product_info.php?products_id=1180 |
Description |
The CINTIL-DependencyBank (Silva and Branco, 2012) is a corpus of sentences annotated with their syntactic dependency graphs and grammatical function tags composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) that are used for regression testing of the computational grammar that supported the annotation of the corpus. |
Languages |
Portuguese (por)
|
|
|
CINTIL-PropBank |
3.6 Mb |
The CINTIL-PropBank is a corpus of sentences annotated with their constituency structure and semantic role tags, compos… |
Portuguese (por)
|
ELRA-W0056 |
Details
CINTIL-PropBank
Name |
CINTIL-PropBank (ELRA-W0056) |
URL |
http://catalog.elra.info/product_info.php?products_id=1176 |
Description |
The CINTIL-PropBank is a corpus of sentences annotated with their constituency structure and semantic role tags, composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), and novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) used for regression testing of the computational grammar that supported the annotation of the corpus. |
Languages |
Portuguese (por)
|
|
|
CINTIL-TreeBank |
3.1 Mb |
The CINTIL-TreeBank is a corpus of syntactic constituency trees of Portuguese texts composed of 10,039 sentences and 11… |
Portuguese (por)
|
ELRA-W0055 |
Details
CINTIL-TreeBank
Name |
CINTIL-TreeBank (ELRA-W0055) |
URL |
http://catalog.elra.info/product_info.php?products_id=1174 |
Description |
The CINTIL-TreeBank is a corpus of syntactic constituency trees of Portuguese texts composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) that are used for regression testing of the computational grammar that supported the annotation of the corpus. |
Languages |
Portuguese (por)
|
|
|
Corpus of Contemporaneous Spanish Novels |
4.8 Mb |
This corpus consists of 11 novels written in Castilian Spanish by Inmaculada Ferrer-Vidal Turull, a contemporaneous aut… |
Spanish, Castilian (spa)
|
ELRA-W0041 |
Details
Corpus of Contemporaneous Spanish Novels
Name |
Corpus of Contemporaneous Spanish Novels (ELRA-W0041) |
URL |
http://catalog.elra.info/product_info.php?products_id=847 |
Description |
This corpus consists of 11 novels written in Castilian Spanish by Inmaculada Ferrer-Vidal Turull, a contemporaneous author. |
Languages |
Spanish, Castilian (spa)
|
|
|
CRATER 2 Corpus |
359 Mb |
The CRATER 2 parallel corpus is an extension of the CRATER corpus, available in the catalogue under reference W0003. It… |
English (eng); French (fre); Spanish, Cas…
|
ELRA-W0033 |
Details
CRATER 2 Corpus
Name |
CRATER 2 Corpus (ELRA-W0033) |
URL |
http://catalog.elra.info/product_info.php?products_id=636 |
Description |
The CRATER 2 parallel corpus is an extension of the CRATER corpus, available in the catalogue under reference W0003. It consists of 1,500,000 tokens for English and French and of 1,000,000 tokens for Spanish, with morphosyntactical annotations.
CRATER 2 (ref. ELRA-W0033) includes CRATER (ref. ELRA-W0003) |
Languages |
- English (eng)
- French (fre)
- Spanish, Castilian (spa)
|
|
|
deL1L2IM corpus |
2.8 Mb |
The deL1L2IM corpus is composed of 72 dialogues, each of them having a duration of 20 to 45 minutes. The whole corpus c… |
German (ger)
|
ELRA-W0083 |
Details
deL1L2IM corpus
Name |
deL1L2IM corpus (ELRA-W0083) |
URL |
http://catalog.elra.info/product_info.php?products_id=1243 |
Description |
The deL1L2IM corpus is composed of 72 dialogues, each of them having a duration of 20 to 45 minutes. The whole corpus contains ca. 52,000 words and 4,800 messages and has a file size of 0.5 Mb. Nine pairs of participants i.e. nine learners and four native speakers were required, with 8 dialogues per pair. The interactions have undergone linguistic analysis whereby the annotation will be performed only on repair/correction sequences (incomplete learner error annotation). The corpus is delivered in one written text file (in XML format, customized under TEI P5). |
Languages |
German (ger)
|
|
|
Dutch PAROLE Distributable Corpus |
70 Mb |
This Dutch corpus is a 3 million words selection built according to the specifications of the PAROLE project. Over 250,… |
Dutch, Flemish (dut)
|
ELRA-W0019 |
Details
Dutch PAROLE Distributable Corpus
Name |
Dutch PAROLE Distributable Corpus (ELRA-W0019) |
URL |
http://catalog.elra.info/product_info.php?products_id=543 |
Description |
This Dutch corpus is a 3 million words selection built according to the specifications of the PAROLE project. Over 250,000 words of corpus texts (with TEI markup suppressed) have been PoS-tagged automatically. A total of 59,798 running words has been manually corrected and checked. |
Languages |
Dutch, Flemish (dut)
|
|
|
ECI-ELSNET Italian & German tagged sub-corpus |
3 Mb |
The data is extracted from the ECI corpus (the German Frankfurter Rundschau part) and the Italian corpus of ILC/CNR. It… |
German (ger); Italian (ita)
…
|
ELRA-W0005 |
Details
ECI-ELSNET Italian & German tagged sub-corpus
Name |
ECI-ELSNET Italian & German tagged sub-corpus (ELRA-W0005) |
URL |
http://catalog.elra.info/product_info.php?products_id=86 |
Description |
The data is extracted from the ECI corpus (the German Frankfurter Rundschau part) and the Italian corpus of ILC/CNR. It contains the following domains: Economy (17,000 words), Politics (14,000 words), Culture (18,000 words), Sports (9,000 words), Local Events (8,500 words). |
Languages |
- German (ger)
- Italian (ita)
|
|
|
ECI/MCI (European Corpus Initiative/Multilingual Corpus I) |
655 Mb |
Over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, M… |
Turkish (tur); Albanian (alb); Bulgarian …
|
ELRA-W0004 |
Details
ECI/MCI (European Corpus Initiative/Multilingual Corpus I)
Name |
ECI/MCI (European Corpus Initiative/Multilingual Corpus I) (ELRA-W0004) |
URL |
http://catalog.elra.info/product_info.php?products_id=85 |
Description |
Over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, Malay and more. |
Languages |
- Turkish (tur)
- Albanian (alb)
- Bulgarian (bul)
- Chinese (chi)
- Czech (cze)
- Dutch, Flemish (dut)
- English (eng)
- Estonian (est)
- French (fre)
- Gaelic, Scottish Gaelic (gla)
- German (ger)
- Greek, Modern (1453-) (gre)
- Italian (ita)
- Japanese (jpn)
- Latin (lat)
- Lithuanian (lit)
- Malay (may)
- Spanish, Castilian (spa)
- Serbian (scc)
- Danish (dan)
- Russian (rus)
- Norwegian (nor)
- Uzbek (uzb)
- Portuguese (por)
- Swedish (swe)
|
|
|
English-Nepali Parallel Corpus |
47 Mb |
This corpus consists of a collection of national development texts in English and Nepali. A small set of data is aligne… |
English (eng); Nepali (nep)
…
|
ELRA-W0077 |
Details
English-Nepali Parallel Corpus
Name |
English-Nepali Parallel Corpus (ELRA-W0077) |
URL |
http://catalog.elra.info/product_info.php?products_id=1217 |
Description |
This corpus consists of a collection of national development texts in English and Nepali. A small set of data is aligned at the sentence level (27,060 English words; 21,756 Nepali words), and a larger set of texts at the document level (617,340 English words; 596,571 Nepali words). An additional set of monolingual data in Nepali is also provided (386,879 words in Nepali). |
Languages |
- English (eng)
- Nepali (nep)
|
|
|
English-Persian parallel Corpus |
40 Mb |
Please refer to ELRA-W0118 for the latest version of this corpus. This version consists of about 3,500,000 English and … |
English (eng); Persian (per)
…
|
ELRA-W0051 |
Details
English-Persian parallel Corpus
Name |
English-Persian parallel Corpus (ELRA-W0051) |
URL |
http://catalog.elra.info/product_info.php?products_id=1111 |
Description |
Please refer to ELRA-W0118 for the latest version of this corpus. This version consists of about 3,500,000 English and Persian (Farsi) words aligned at sentence level (about 100,000 sentences). The format of the files is Unicode. It has been originally created with SQL Server, but it is presented in access file type. |
Languages |
- English (eng)
- Persian (per)
|
|
|
EUROPARL Corpus Parallel Corpora: Portuguese-English |
2.3 Gb |
The Portuguese-English subpart of the EUROPARL Corpus was extracted from the proceedings of the European Parliament. It… |
English (eng); Portuguese (por)
…
|
ELRA-W0090 |
Details
EUROPARL Corpus Parallel Corpora: Portuguese-English
Name |
EUROPARL Corpus Parallel Corpora: Portuguese-English (ELRA-W0090) |
URL |
http://catalog.elra.info/product_info.php?products_id=1257 |
Description |
The Portuguese-English subpart of the EUROPARL Corpus was extracted from the proceedings of the European Parliament. It contains approximately 58,324,562 tokens of European Portuguese (L1) and 49,216,896 tokens of English (translation). It is composed of one text file for the English corpus and two files for the Portuguese version: a text file and an annotated file, containing a PoS tag and a lemma for each token. |
Languages |
- English (eng)
- Portuguese (por)
|
|
|
GeFRePaC - German French Reciprocal Parallel Corpus |
1.3 Gb |
GeFRePac was produced in the framework of the LRsPGeFRePac was produced in the framework of the LRsP&P project. It cont… |
French (fre); German (ger)
…
|
ELRA-W0031 |
Details
GeFRePaC - German French Reciprocal Parallel Corpus
Name |
GeFRePaC - German French Reciprocal Parallel Corpus (ELRA-W0031) |
URL |
http://catalog.elra.info/product_info.php?products_id=633 |
Description |
GeFRePac was produced in the framework of the LRsPGeFRePac was produced in the framework of the LRsP&P project. It contains 30 million words (15 million for each language) for the purpose of developing, enhancing and improving translation aids. |
Languages |
- French (fre)
- German (ger)
|
|
|
ICE-GB (British English component of the International Corpus of English) |
97 Mb |
British component of the International Corpus of English (ICE), ICE-GB consists of a million words (83,394 parse trees,… |
English (eng)
|
ELRA-W0021 |
Details
ICE-GB (British English component of the International Corpus of English)
Name |
ICE-GB (British English component of the International Corpus of English) (ELRA-W0021) |
URL |
http://catalog.elra.info/product_info.php?products_id=762 |
Description |
British component of the International Corpus of English (ICE), ICE-GB consists of a million words (83,394 parse trees, including 59,640 in the spoken part of the corpus) extracted from 200 written and 300 spoken English texts. It is fully grammatically annotated and has been fully checked. ICE-GB is distributed with the retrieval software ICECUP (the International Corpus of English Corpus Utility Program). |
Languages |
English (eng)
|
|
|
ILSP/ELEFTHEROTYPIA Corpus (Greek corpus) |
27 Mb |
This corpus contains approximately 3 million words from the daily newspaper ELEFTHEROTYPIA, classified and annotated ac… |
Greek, Modern (1453-) (gre)
…
|
ELRA-W0022 |
Details
ILSP/ELEFTHEROTYPIA Corpus (Greek corpus)
Name |
ILSP/ELEFTHEROTYPIA Corpus (Greek corpus) (ELRA-W0022) |
URL |
http://catalog.elra.info/product_info.php?products_id=763 |
Description |
This corpus contains approximately 3 million words from the daily newspaper ELEFTHEROTYPIA, classified and annotated accordingly to the common core PAROLE encoding standard. The format of the corpus is SGML files. A subset of the corpus (250,000 words) is morpho-syntactically tagged; all the words are also lemmatised and checked. |
Languages |
Greek, Modern (1453-) (gre)
|
|
|
Italian Syntactic-Semantic Treebank (ISST) |
90 Mb |
ISST comprises 89,941 tokens for the financial-domain part and 215,606 tokens for the general part. It is formatted in … |
Italian (ita)
|
ELRA-W0044 |
Details
Italian Syntactic-Semantic Treebank (ISST)
Name |
Italian Syntactic-Semantic Treebank (ISST) (ELRA-W0044) |
URL |
http://catalog.elra.info/product_info.php?products_id=887 |
Description |
ISST comprises 89,941 tokens for the financial-domain part and 215,606 tokens for the general part. It is formatted in XML. This Treebank has a five-level structure covering orthographic, morpho-syntactic, syntactic; semantic and lexico-semantic levels of linguistic description. Syntactic annotation is distributed over two different levels: the constituent structure level and the functional relations level. The fifth level deals with lexico-semantic annotation, which is carried out in terms of sense tagging of lexical heads (nouns, verbs and adjectives) augmented with other types of semantic information: ItalWordNet (see ELRA-M0018) is the reference lexical resource used for the sense tagging task . Both syntactic and lexico-semantic annotations refer to the morpho-syntactically annotated text, which in turn is linked to the orthographic file with the text and mark-up of macrotextual organisation (e.g. titles, subtitles, summary, body of article, paragraphs). |
Languages |
Italian (ita)
|
|
|
Karl May Korpus (KMK) |
77 Mb |
Karl-May-Korpus is a German monolingual corpus, available in an SGML-tagged ASCII text format. It contains the works of… |
German (ger)
|
ELRA-W0016 |
Details
Karl May Korpus (KMK)
Name |
Karl May Korpus (KMK) (ELRA-W0016) |
URL |
http://catalog.elra.info/product_info.php?products_id=450 |
Description |
Karl-May-Korpus is a German monolingual corpus, available in an SGML-tagged ASCII text format. It contains the works of the German author Karl May and consists of around 1.6 million words (divided into 9 sub-corpora of about 180,000 words each). |
Languages |
German (ger)
|
|
|
Khresmoi manually annotated reference corpus |
1.3 Gb |
This corpus is a collection of Khresmoi English web documents annotated with key entities (such as disease, drug). The … |
English (eng)
|
ELRA-W0081 |
Details
Khresmoi manually annotated reference corpus
Name |
Khresmoi manually annotated reference corpus (ELRA-W0081) |
URL |
http://catalog.elra.info/product_info.php?products_id=1237 |
Description |
This corpus is a collection of Khresmoi English web documents annotated with key entities (such as disease, drug). The corpus is divided into two parts:
1. The initial corpus: 625 documents from the Genetics Home Reference data set, automatically annotated with anatomical locations and diseases, and manually corrected by 3-4 annotators. Size of documents: between 26 and 8,306 tokens each.
2. The main corpus: 6,950 English documents from the Khresmoi crawl and 5,518 English Wikipedia pages, automatically annotated through the GATE Platform for Anatomy, Disease, Drug and Investigation. Size of documents: between 200 and 2,000 tokens each.
The corpus is using the GATE XML format. |
Languages |
English (eng)
|
|
|
"Le Monde Diplomatique" Arabic tagged corpus |
59 Mb |
This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see al… |
Arabic (ara)
|
ELRA-W0049 |
Details
"Le Monde Diplomatique" Arabic tagged corpus
Name |
"Le Monde Diplomatique" Arabic tagged corpus (ELRA-W0049) |
URL |
http://catalog.elra.info/product_info.php?products_id=1096 |
Description |
This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see also ELRA-W0036-04). To each text are associated 3 files : raw text in Arabic, vowelized text in Arabic, one XML file containing the morphological annotation of the text. |
Languages |
Arabic (ara)
|
|
|
"Le Monde Diplomatique" Text corpus in Arabic |
57 Mb |
Electronic archiving of "Le Monde Diplomatique" articles in Arabic from 2000. The corpus is available in HTML. Each HTM… |
Arabic (ara)
|
ELRA-W0036-04 |
Details
"Le Monde Diplomatique" Text corpus in Arabic
Name |
"Le Monde Diplomatique" Text corpus in Arabic (ELRA-W0036-04) |
URL |
http://catalog.elra.info/product_info.php?products_id=717 |
Description |
Electronic archiving of "Le Monde Diplomatique" articles in Arabic from 2000. The corpus is available in HTML. Each HTML file contains one article. |
Languages |
Arabic (ara)
|
|
|
"Le Monde Diplomatique" Text corpus in English |
28 Mb |
Electronic archiving of "Le Monde Diplomatique" articles in English from 1999. The corpus is available in HTML. Each HT… |
English (eng)
|
ELRA-W0036-03 |
Details
"Le Monde Diplomatique" Text corpus in English
Name |
"Le Monde Diplomatique" Text corpus in English (ELRA-W0036-03) |
URL |
http://catalog.elra.info/product_info.php?products_id=8 |
Description |
Electronic archiving of "Le Monde Diplomatique" articles in English from 1999. The corpus is available in HTML. Each HTML file contains one article. |
Languages |
English (eng)
|
|
|
"Le Monde Diplomatique" Text corpus in French - archives 1980-1998 |
233 Mb |
Electronic archiving of "Le Monde Diplomatique" articles in French from 1980 to 1998. The corpus is available in HTML. … |
French (fre)
|
ELRA-W0036-01 |
Details
"Le Monde Diplomatique" Text corpus in French - archives 1980-1998
Name |
"Le Monde Diplomatique" Text corpus in French - archives 1980-1998 (ELRA-W0036-01) |
URL |
http://catalog.elra.info/product_info.php?products_id=7 |
Description |
Electronic archiving of "Le Monde Diplomatique" articles in French from 1980 to 1998. The corpus is available in HTML. Each HTML file contains one article. |
Languages |
French (fre)
|
|
|
"Le Monde Diplomatique" Text corpus in French - archives from 1999 |
90 Mb |
Electronic archiving of "Le Monde Diplomatique" articles in French from 1999. The corpus is available in HTML. Each HTM… |
French (fre)
|
ELRA-W0036-02 |
Details
"Le Monde Diplomatique" Text corpus in French - archives from 1999
Name |
"Le Monde Diplomatique" Text corpus in French - archives from 1999 (ELRA-W0036-02) |
URL |
http://catalog.elra.info/product_info.php?products_id=9 |
Description |
Electronic archiving of "Le Monde Diplomatique" articles in French from 1999. The corpus is available in HTML. Each HTML file contains one article. |
Languages |
French (fre)
|
|
|
LT Corpus |
43 Mb |
The LT Corpus is composed of 70 fiction texts from Portuguese renowned authors. The corpus contains 1,781,083 tokens. T… |
Portuguese (por)
|
ELRA-W0059 |
Details
LT Corpus
Name |
LT Corpus (ELRA-W0059) |
URL |
http://catalog.elra.info/product_info.php?products_id=1178 |
Description |
The LT Corpus is composed of 70 fiction texts from Portuguese renowned authors. The corpus contains 1,781,083 tokens. The texts date from before 1940. The corpus is delivered in one file, in two different formats. The txt version has one sentence per line, an identification number for each text and no further annotation. The cqpweb file is one token per line, followed by pos tag and lemma, and is annotated for NP chunks. |
Languages |
Portuguese (por)
|
|
|
MLCC Multilingual and Parallel Corpora |
915 Mb |
The first set contains articles from 6 European newspapers: Het Financieele Dagblad (Dutch, 8.5 million words), The Fin… |
Dutch, Flemish (dut); English (eng); Fren…
|
ELRA-W0023 |
Details
MLCC Multilingual and Parallel Corpora
Name |
MLCC Multilingual and Parallel Corpora (ELRA-W0023) |
URL |
http://catalog.elra.info/product_info.php?products_id=764 |
Description |
The first set contains articles from 6 European newspapers: Het Financieele Dagblad (Dutch, 8.5 million words), The Financial Times (English, 30 million words), Le Monde (French, 10 million words), Handelsblatt (German, 33 million words), Il sole 24 Ore (Italian, 1.88 million words), Expansion (Spanish, 10 million words).
The second set consists of a parallel corpus of translated data in the nine European official languages (1992-1994) divided into 2 sub-corpora: written questions (10.2 million words) and parliamentary debates (5 to 8 million words per language). |
Languages |
- Dutch, Flemish (dut)
- English (eng)
- French (fre)
- German (ger)
- Italian (ita)
- Spanish, Castilian (spa)
|
|
|
Modern French Corpus including Anaphors Tagging |
13 Mb |
This modern French corpus contains over 1 million words with a tagging of the anaphors, and cover many different aspect… |
French (fre)
|
ELRA-W0032 |
Details
Modern French Corpus including Anaphors Tagging
Name |
Modern French Corpus including Anaphors Tagging (ELRA-W0032) |
URL |
http://catalog.elra.info/product_info.php?products_id=634 |
Description |
This modern French corpus contains over 1 million words with a tagging of the anaphors, and cover many different aspects of the French language (scientific and human sciences articles, extracts from newspapers and magazines, legal texts, etc.). The annotation scheme was defined in XML. |
Languages |
French (fre)
|
|
|
Monolingual Greek corpus |
5.1 Mb |
Corpus of 1 million words consisting of articles written in 1996 from the Greek daily newspaper ELEFTHEROTIPIA. |
Greek, Modern (1453-) (gre)
…
|
ELRA-W0014 |
Details
Monolingual Greek corpus
|
|
MTP Annotated German corpus - tagged version |
35 Mb |
A 500,000 German words corpus of SGML-formatted texts from two German newspapers, the Frankfurter Allgemeine Zeitung an… |
German (ger)
|
ELRA-W0008-02 |
Details
MTP Annotated German corpus - tagged version
Name |
MTP Annotated German corpus - tagged version (ELRA-W0008-02) |
URL |
http://catalog.elra.info/product_info.php?products_id=480 |
Description |
A 500,000 German words corpus of SGML-formatted texts from two German newspapers, the Frankfurter Allgemeine Zeitung and Die Zeit, for the years 1990 to 1992. |
Languages |
German (ger)
|
|
|
MTP Annotated German corpus - untagged version |
283 Mb |
A 500,000 German words corpus of SGML-formatted texts from two German newspapers, the Frankfurter Allgemeine Zeitung an… |
German (ger)
|
ELRA-W0008-01 |
Details
MTP Annotated German corpus - untagged version
Name |
MTP Annotated German corpus - untagged version (ELRA-W0008-01) |
URL |
http://catalog.elra.info/product_info.php?products_id=47 |
Description |
A 500,000 German words corpus of SGML-formatted texts from two German newspapers, the Frankfurter Allgemeine Zeitung and Die Zeit, for the years 1990 to 1992. |
Languages |
German (ger)
|
|
|
MULTEXT JOC Corpus |
114 Mb |
This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-… |
English (eng); French (fre); German (ger)…
|
ELRA-W0017 |
Details
MULTEXT JOC Corpus
Name |
MULTEXT JOC Corpus (ELRA-W0017) |
URL |
http://catalog.elra.info/product_info.php?products_id=534 |
Description |
This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050). This part contains raw, tagged and aligned data from the Written Questions and Answers of the Official Journal of the European Community. The corpus contains ca. 5 million words in English, French, German, Italian and Spanish (ca. 1 million words par language). About 800,000 words were grammatically tagged and manually checked for English, French, Italian and Spanish, i.e. roughly 200,000 words per language. The same subset for French, German, Italian and Spanish was aligned to English at the sentence level. |
Languages |
- English (eng)
- French (fre)
- German (ger)
- Italian (ita)
- Spanish, Castilian (spa)
|
|
|
Multilingual Corpus |
9.9 Mb |
Multilingual parallel corpus produced by Kaist Korterm containing 60 000 expressions in Korean, Chinese and English. |
Chinese (chi); English (eng); Korean (kor…
|
ELRA-W0035 |
Details
|
|
NE3L named entities Arabic corpus |
3 Mb |
The Arabic corpus contains 103,363 words coming from articles extracted from Le Monde Diplomatique newspaper, and publi… |
Arabic (ara)
|
ELRA-W0078 |
Details
NE3L named entities Arabic corpus
Name |
NE3L named entities Arabic corpus (ELRA-W0078) |
URL |
http://catalog.elra.info/product_info.php?products_id=1226 |
Description |
The Arabic corpus contains 103,363 words coming from articles extracted from Le Monde Diplomatique newspaper, and published in 2004. 2 named entity categories were taken into account: Time and Amount. |
Languages |
Arabic (ara)
|
|
|
NE3L named entities Chinese corpus |
4.8 Mb |
The Chinese corpus contains 79,302 words coming from articles extracted from Le Monde Diplomatique newspaper, and publi… |
Chinese (chi)
|
ELRA-W0079 |
Details
NE3L named entities Chinese corpus
Name |
NE3L named entities Chinese corpus (ELRA-W0079) |
URL |
http://catalog.elra.info/product_info.php?products_id=1227 |
Description |
The Chinese corpus contains 79,302 words coming from articles extracted from Le Monde Diplomatique newspaper, and published in 2001. 3 named entity categories were taken into account: Person, Place and Organisation. |
Languages |
Chinese (chi)
|
|
|
NE3L named entities Russian corpus |
2.7 Mb |
The Russian corpus contains 75,784 words coming from articles extracted from Izvestia newspaper, and published in 1995.… |
Russian (rus)
|
ELRA-W0080 |
Details
NE3L named entities Russian corpus
Name |
NE3L named entities Russian corpus (ELRA-W0080) |
URL |
http://catalog.elra.info/product_info.php?products_id=1228 |
Description |
The Russian corpus contains 75,784 words coming from articles extracted from Izvestia newspaper, and published in 1995. 2 named entity categories were taken into account: Time and Amount. |
Languages |
Russian (rus)
|
|
|
NEMLAR Written Corpus |
136 Mb |
The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories. The corpus is pr… |
Arabic (ara)
|
ELRA-W0042 |
Details
NEMLAR Written Corpus
Name |
NEMLAR Written Corpus (ELRA-W0042) |
URL |
http://catalog.elra.info/product_info.php?products_id=873 |
Description |
The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories. The corpus is provided in 4 different versions: raw text, fully vowelized text, text with Arabic lexical analysis, text with Arabic POS-tags. |
Languages |
Arabic (ara)
|
|
|
Nepali Monolingual written corpus |
683 Mb |
The Nepali Monolingual written corpus comprises the core corpus (core sample) and the general corpus. The core sample (… |
Nepali (nep)
|
ELRA-W0076 |
Details
Nepali Monolingual written corpus
Name |
Nepali Monolingual written corpus (ELRA-W0076) |
URL |
http://catalog.elra.info/product_info.php?products_id=1216 |
Description |
The Nepali Monolingual written corpus comprises the core corpus (core sample) and the general corpus. The core sample (CS) represents the collection of Nepali written texts from 15 different genres with 2000 words each published between 1990 and 1992. It is based on FLOB/FROWN corpora and contains 802,000 words. The general corpus (GC) consists of written texts collected opportunistically from a wide range of sources such as the internet webs, newspapers, books, publishers and authors. It contains 1,400,000 words. |
Languages |
Nepali (nep)
|
|
|
NPChunks |
412 Kb |
NPChunks is a training corpus containing approximately 1,000 sentences, with a total of 24,243 tokens, selected randoml… |
Portuguese (por)
|
ELRA-W0089 |
Details
NPChunks
Name |
NPChunks (ELRA-W0089) |
URL |
http://catalog.elra.info/product_info.php?products_id=1256 |
Description |
NPChunks is a training corpus containing approximately 1,000 sentences, with a total of 24,243 tokens, selected randomly from the written part of the CINTIL corpus. The corpus is PoS-annotated at token level, including punctuation. Noun Phrases were annotated with specific tags. It was automatically PoS-tagged with MBT tagger, and lemmatized with MBLEM, following the annotation scheme of the Corpus of Reference of Contemporary Portuguese. |
Languages |
Portuguese (por)
|
|
|
PANACEA English-French and English-Greek parallel corpus acquired for Environment domain |
11 Mb |
This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Environment doma… |
English (eng); French (fre)
…
|
ELRA-W0057 |
Details
PANACEA English-French and English-Greek parallel corpus acquired for Environment domain
Name |
PANACEA English-French and English-Greek parallel corpus acquired for Environment domain (ELRA-W0057) |
URL |
http://catalog.elra.info/product_info.php?products_id=1182 |
Description |
This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Environment domain automatically acquired from the web during 2010 and 2011. It was acquired in the framework of the PANACEA project. Data and language pairs are split into training, test and development test sets. |
Languages |
- English (eng)
- French (fre)
|
|
|
PANACEA English-French and English-Greek parallel corpus acquired for Labour Legislation domain |
16 Mb |
This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Labour Legislati… |
English (eng); Greek, Modern (1453-) (gre…
|
ELRA-W0058 |
Details
PANACEA English-French and English-Greek parallel corpus acquired for Labour Legislation domain
Name |
PANACEA English-French and English-Greek parallel corpus acquired for Labour Legislation domain (ELRA-W0058) |
URL |
http://catalog.elra.info/product_info.php?products_id=1183 |
Description |
This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Labour Legislation domain automatically acquired from the web during 2010 and 2011. It was acquired in the framework of the PANACEA project. Data and language pairs are split into training, test and development test sets. |
Languages |
- English (eng)
- Greek, Modern (1453-) (gre)
|
|
|
PANACEA Environment English monolingual corpus |
2.7 Gb |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the English lan… |
English (eng)
|
ELRA-W0063 |
Details
PANACEA Environment English monolingual corpus
Name |
PANACEA Environment English monolingual corpus (ELRA-W0063) |
URL |
http://catalog.elra.info/product_info.php?products_id=1184 |
Description |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the English language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 50,541,538 tokens, divided into a total of 28,071 documents that were crawled from 3,121 web sites. |
Languages |
English (eng)
|
|
|
PANACEA Environment French monolingual corpus |
2.1 Gb |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the French lang… |
French (fre)
|
ELRA-W0065 |
Details
PANACEA Environment French monolingual corpus
Name |
PANACEA Environment French monolingual corpus (ELRA-W0065) |
URL |
http://catalog.elra.info/product_info.php?products_id=1186 |
Description |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the French language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 47,364,125 tokens, divided into a total of 23,514 documents that were crawled from 1,969 web sites. |
Languages |
French (fre)
|
|
|
PANACEA Environment Greek monolingual corpus |
2 Gb |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek langu… |
Greek, Modern (1453-) (gre)
…
|
ELRA-W0067 |
Details
PANACEA Environment Greek monolingual corpus
Name |
PANACEA Environment Greek monolingual corpus (ELRA-W0067) |
URL |
http://catalog.elra.info/product_info.php?products_id=1188 |
Description |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 27,958,530 tokens, divided into a total of 16,073 documents that were crawled from 1,063 web sites. |
Languages |
Greek, Modern (1453-) (gre)
|
|
|
PANACEA Environment Italian monolingual corpus |
1.8 Gb |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian lan… |
Italian (ita)
|
ELRA-W0069 |
Details
PANACEA Environment Italian monolingual corpus
Name |
PANACEA Environment Italian monolingual corpus (ELRA-W0069) |
URL |
http://catalog.elra.info/product_info.php?products_id=1190 |
Description |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 40,044,852 tokens, divided into a total of 16,159 documents that were crawled from 1,211 web sites. |
Languages |
Italian (ita)
|
|
|
PANACEA Environment Spanish monolingual corpus |
2.3 Gb |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish lan… |
Spanish, Castilian (spa)
|
ELRA-W0071 |
Details
PANACEA Environment Spanish monolingual corpus
Name |
PANACEA Environment Spanish monolingual corpus (ELRA-W0071) |
URL |
http://catalog.elra.info/product_info.php?products_id=1192 |
Description |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 46,225,624 tokens, divided into a total of 26,009 documents that were crawled from 2,053 web sites. |
Languages |
Spanish, Castilian (spa)
|
|
|
PANACEA Labour English monolingual corpus |
1.6 Gb |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the English lan… |
English (eng)
|
ELRA-W0064 |
Details
PANACEA Labour English monolingual corpus
Name |
PANACEA Labour English monolingual corpus (ELRA-W0064) |
URL |
http://catalog.elra.info/product_info.php?products_id=1185 |
Description |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the English language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 46,431,351 tokens, divided into a total of 15,197 documents that were crawled from 1,558 web sites. |
Languages |
English (eng)
|
|
|
PANACEA Labour French monolingual corpus |
2.5 Gb |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the French lang… |
French (fre)
|
ELRA-W0066 |
Details
PANACEA Labour French monolingual corpus
Name |
PANACEA Labour French monolingual corpus (ELRA-W0066) |
URL |
http://catalog.elra.info/product_info.php?products_id=1187 |
Description |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the French language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 56,440,425 tokens, divided into a total of 26,675 documents that were crawled from 1,391 web sites. |
Languages |
French (fre)
|
|
|
PANACEA Labour Greek monolingual corpus |
1.4 Gb |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek langu… |
Greek, Modern (1453-) (gre)
…
|
ELRA-W0068 |
Details
PANACEA Labour Greek monolingual corpus
Name |
PANACEA Labour Greek monolingual corpus (ELRA-W0068) |
URL |
http://catalog.elra.info/product_info.php?products_id=1189 |
Description |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 21,077,196 tokens, divided into a total of 7,124 documents that were crawled from 598 web sites. |
Languages |
Greek, Modern (1453-) (gre)
|
|
|
PANACEA Labour Italian monolingual corpus |
2.4 Gb |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian lan… |
Italian (ita)
|
ELRA-W0070 |
Details
PANACEA Labour Italian monolingual corpus
Name |
PANACEA Labour Italian monolingual corpus (ELRA-W0070) |
URL |
http://catalog.elra.info/product_info.php?products_id=1191 |
Description |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 70,563,320 tokens, divided into a total of 12,706 documents that were crawled from 864 web sites. |
Languages |
Italian (ita)
|
|
|
PANACEA Labour Spanish monolingual corpus |
1.9 Gb |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish lan… |
Spanish, Castilian (spa)
|
ELRA-W0072 |
Details
PANACEA Labour Spanish monolingual corpus
Name |
PANACEA Labour Spanish monolingual corpus (ELRA-W0072) |
URL |
http://catalog.elra.info/product_info.php?products_id=1193 |
Description |
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 53,922,118 tokens, divided into a total of 13,188 documents that were crawled from 1,015 web sites. |
Languages |
Spanish, Castilian (spa)
|
|
|
PAROLE French Corpus |
349 Mb |
The PAROLE French corpus contains the following data:
Miscellaneous: Data provided by ELRA (CRATER, MLCC Multilingual … |
French (fre)
|
ELRA-W0020 |
Details
PAROLE French Corpus
Name |
PAROLE French Corpus (ELRA-W0020) |
URL |
http://catalog.elra.info/product_info.php?products_id=565 |
Description |
The PAROLE French corpus contains the following data:
Miscellaneous: Data provided by ELRA (CRATER, MLCC Multilingual and Parallel Corpora) 2 025 964 words
Books: CNRS Editions 3 267 409 words
Periodicals: CNRS Info, Hermès 942 963 words
Newspapers: Le Monde,
provided by ELRA 13 856 763 words
Total 20 093 099 words |
Languages |
French (fre)
|
|
|
PAROLE Irish Distributable Corpus |
25 Mb |
This corpus consists of over 8 million words The text is marked-up in accordance with the PAROLE encoding standard. All… |
Irish (gle)
|
ELRA-W0026 |
Details
PAROLE Irish Distributable Corpus
Name |
PAROLE Irish Distributable Corpus (ELRA-W0026) |
URL |
http://catalog.elra.info/product_info.php?products_id=597 |
Description |
This corpus consists of over 8 million words The text is marked-up in accordance with the PAROLE encoding standard. All the files are in SGML format with a detailed header and the body of the text tagged to paragraph level. A subset of the corpus is morpho-syntactically tagged. Included in this distribution is approximately 3,000 manually checked words. |
Languages |
Irish (gle)
|
|
|
PAROLE Italian Corpus |
44 Mb |
The PAROLE Italian Corpus comprises 3,135,651 words collected from four different domains: newspapers (2,179,800 words)… |
Italian (ita)
|
ELRA-W0043 |
Details
PAROLE Italian Corpus
Name |
PAROLE Italian Corpus (ELRA-W0043) |
URL |
http://catalog.elra.info/product_info.php?products_id=886 |
Description |
The PAROLE Italian Corpus comprises 3,135,651 words collected from four different domains: newspapers (2,179,800 words), periodicals (143,810 words), books (564,964 words), miscellaneous (247,077 words). About 250,000 words were morphosyntactically annotated and lemmatized. |
Languages |
Italian (ita)
|
|
|
PAROLE Portuguese Corpus - complete version |
57 Mb |
The parole Portuguese corpus contains approximately 3 million running words of European Portuguese distributed by Mediu… |
Portuguese (por)
|
ELRA-W0024-01 |
Details
PAROLE Portuguese Corpus - complete version
Name |
PAROLE Portuguese Corpus - complete version (ELRA-W0024-01) |
URL |
http://catalog.elra.info/product_info.php?products_id=765 |
Description |
The parole Portuguese corpus contains approximately 3 million running words of European Portuguese distributed by Medium (Newspaper, Book, Periodical, Miscellaneous).
The corpus was classified and encoded according to the common core parole encoding standard. The file format of this corpus is SGML.
Also availabe, a subcorpus consists of about 250,000 words morpho-syntactically tagged. Disambiguation was manually checked. |
Languages |
Portuguese (por)
|
|
|
Persian 1984 corpus (Multext-East framework) |
5.9 Mb |
This corpus contains the Persian (Farsi) translation of a part of the novel 1984 (G. Orwell) annotated in the Multext-E… |
Persian (per)
|
ELRA-W0054 |
Details
Persian 1984 corpus (Multext-East framework)
Name |
Persian 1984 corpus (Multext-East framework) (ELRA-W0054) |
URL |
http://catalog.elra.info/product_info.php?products_id=1124 |
Description |
This corpus contains the Persian (Farsi) translation of a part of the novel 1984 (G. Orwell) annotated in the Multext-East framework (Multilingual Text Tools and Corpora for Eastern and Central European Languages). The corpus contains approximately 100,000 words (6,604 sentences, 13,247 lemmas), with extensive headers and markup for document structure, sentences, and various sub-sentence annotations in the XML-format following the TEI guidelines. Annotation includes POS (part-of-speech) and lemmas. |
Languages |
Persian (per)
|
|
|
PRESS 65 |
6.3 Mb |
Over 1 million running words taken from Swedish newspapers from year 65. |
Swedish (swe)
|
ELRA-W0010 |
Details
|
|
PTPARL Corpus |
25 Mb |
The PTPARL Corpus contains 1,076 texts consisting of adapted transcriptions of the Portuguese Parliament sessions. The … |
Portuguese (por)
|
ELRA-W0060 |
Details
PTPARL Corpus
Name |
PTPARL Corpus (ELRA-W0060) |
URL |
http://catalog.elra.info/product_info.php?products_id=1179 |
Description |
The PTPARL Corpus contains 1,076 texts consisting of adapted transcriptions of the Portuguese Parliament sessions. The corpus contains 1,000,441 tokens. The corpus is delivered in one file, in two different formats. The txt version has one sentence per line, an identification number for each text and no further annotation. The cqpweb file is one token per line, followed by pos tag and lemma, and is annotated for NP chunks. |
Languages |
Portuguese (por)
|
|
|
Quaero Old Press Extended Named Entity corpus |
6.8 Gb |
This corpus consists of the manual annotation of 76 newspaper issues published in 1890-1891 and provided by the French … |
French (fre)
|
ELRA-W0073 |
Details
Quaero Old Press Extended Named Entity corpus
Name |
Quaero Old Press Extended Named Entity corpus (ELRA-W0073) |
URL |
http://catalog.elra.info/product_info.php?products_id=1194 |
Description |
This corpus consists of the manual annotation of 76 newspaper issues published in 1890-1891 and provided by the French National Library (Bibliothèque Nationale de France). Three different titles are used (Le Temps, La Croix and Le Figaro) for a total of 295 pages. The corpus is fully manually annotated according to the Quaero extended and structured named entity definition. |
Languages |
French (fre)
|
|
|
Qualified POS Tagged Corpus |
66 Mb |
Monolingual corpus in a .txt format, produced by KAIST KORTERM, containing 1020000 eojeols (Korean terms) in Korean. Th… |
Korean (kor)
|
ELRA-W0034 |
Details
Qualified POS Tagged Corpus
Name |
Qualified POS Tagged Corpus (ELRA-W0034) |
URL |
http://catalog.elra.info/product_info.php?products_id=654 |
Description |
Monolingual corpus in a .txt format, produced by KAIST KORTERM, containing 1020000 eojeols (Korean terms) in Korean. This corpus is morphologically analyzed, POS tagged, and rectified 3 times by specialists. |
Languages |
Korean (kor)
|
|
|
ROCO Romanian journalistic corpus |
729 Mb |
ROCO is a Romanian journalistic corpus containing approximately 7.1 million tokens, the number of types being 231,626. … |
Romanian (rum)
|
ELRA-W0085 |
Details
ROCO Romanian journalistic corpus
Name |
ROCO Romanian journalistic corpus (ELRA-W0085) |
URL |
http://catalog.elra.info/product_info.php?products_id=1249 |
Description |
ROCO is a Romanian journalistic corpus containing approximately 7.1 million tokens, the number of types being 231,626. It is rich in proper names, numerals and named entities. The corpus has been lemmatized and PoS annotated following the Multext-East morphosyntactic specifications, and it is XML encoded. |
Languages |
Romanian (rum)
|
|
|
ROMBAC - Romanian balanced corpus |
1.1 Gb |
ROMBAC is a Romanian corpus containing equal shares of texts from 5 different genres: journalism, legalese, fiction, me… |
Romanian (rum)
|
ELRA-W0088 |
Details
ROMBAC - Romanian balanced corpus
Name |
ROMBAC - Romanian balanced corpus (ELRA-W0088) |
URL |
http://catalog.elra.info/product_info.php?products_id=1253 |
Description |
ROMBAC is a Romanian corpus containing equal shares of texts from 5 different genres: journalism, legalese, fiction, medicine and biographical data for Romanian literary personalities. The entire corpus counts around 41,000,000 words, including punctuation. The corpus is annotated at paragraph, sentence, constituent group and word levels, and it provides morpho-syntactic information (MSD). It is xml encoded. |
Languages |
Romanian (rum)
|
|
|
Tagged text in French (MEMODATA) with rules of morphological disambiguation |
3.1 Gb |
More than 170 books (classical novels, legal texts...) are tagged with rules of morphological disambiguation. A tagged … |
French (fre)
|
ELRA-W0012 |
Details
Tagged text in French (MEMODATA) with rules of morphological disambiguation
Name |
Tagged text in French (MEMODATA) with rules of morphological disambiguation (ELRA-W0012) |
URL |
http://catalog.elra.info/product_info.php?products_id=50 |
Description |
More than 170 books (classical novels, legal texts...) are tagged with rules of morphological disambiguation. A tagged corpus of 50 books is available for research. It consists of several authors of the 19th century (Balzac, Hugo, Stendhal).
See also W0011. |
Languages |
French (fre)
|
|
|
Tagged text in French (MEMODATA) with typographic tags |
247 Mb |
Over 170 (tagged) French books (classical novels, legal texts) with typographic tags. Another tagged corpus of 50 books… |
French (fre)
|
ELRA-W0011 |
Details
Tagged text in French (MEMODATA) with typographic tags
Name |
Tagged text in French (MEMODATA) with typographic tags (ELRA-W0011) |
URL |
http://catalog.elra.info/product_info.php?products_id=49 |
Description |
Over 170 (tagged) French books (classical novels, legal texts) with typographic tags. Another tagged corpus of 50 books is available for research only. The books consist of authors of the 19th century.
See also W0012. |
Languages |
French (fre)
|
|
|
Text corpus of "Le Monde" (1987-2012) |
3.9 Gb |
Corpus from "Le Monde" newspaper. Each year contains some 10 Mbytes of data per month (circa 120 Mbytes per year). Data… |
French (fre)
|
ELRA-W0015 |
Details
Text corpus of "Le Monde" (1987-2012)
Name |
Text corpus of "Le Monde" (1987-2012) (ELRA-W0015) |
URL |
http://catalog.elra.info/product_info.php?products_id=438 |
Description |
Corpus from "Le Monde" newspaper. Each year contains some 10 Mbytes of data per month (circa 120 Mbytes per year). Data ranging from 1987 until 2012 are available (total 1,199,143 articles). |
Languages |
French (fre)
|
|
|
The CINTIL Corpus International Corpus of Portuguese |
20 Mb |
CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portugue… |
Portuguese (por)
|
ELRA-W0050 |
Details
The CINTIL Corpus International Corpus of Portuguese
Name |
The CINTIL Corpus International Corpus of Portuguese (ELRA-W0050) |
URL |
http://catalog.elra.info/product_info.php?products_id=1102 |
Description |
CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portuguese. It is composed of one million annotated tokens, each one of which verified by human expert annotators. The annotation comprises information on part-of-speech, open class lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition). The corpus is developed over raw textual materials of several types, of which 30% are spoken materials. |
Languages |
Portuguese (por)
|
|
|
The EMILLE/CIIL Corpus |
1.5 Gb |
The EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian lan… |
Urdu (urd); Telugu (tel); Tamil (tam); Si…
|
ELRA-W0037 |
Details
The EMILLE/CIIL Corpus
Name |
The EMILLE/CIIL Corpus (ELRA-W0037) |
URL |
http://catalog.elra.info/product_info.php?products_id=696 |
Description |
The EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu) (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, and 20 written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode.
This database is available for research use by academic organisations only. For a use by commercial organisations, a subset of the EMILLE/CIIL Corpus is available under the reference ELRA-W0038 The EMILLE Lancaster Corpus. |
Languages |
- Urdu (urd)
- Telugu (tel)
- Tamil (tam)
- Sinhalese (sin)
- Panjabi, Punjabi (pan)
- Oriya (ori)
- Marathi (mar)
- Malayalam (mal)
- Kashmiri (kas)
- Kannada (kan)
- Hindi (hin)
- Gujarati (guj)
- Bengali (ben)
- Assamese (asm)
|
|
|
The Lancaster Corpus of Mandarin Chinese (LCMC) |
45 Mb |
The Lancaster Corpus of Mandarin Chinese (LCMC) sampled 15 written text categories including news, literary texts, acad… |
Chinese (chi)
|
ELRA-W0039 |
Details
The Lancaster Corpus of Mandarin Chinese (LCMC)
Name |
The Lancaster Corpus of Mandarin Chinese (LCMC) (ELRA-W0039) |
URL |
http://catalog.elra.info/product_info.php?products_id=715 |
Description |
The Lancaster Corpus of Mandarin Chinese (LCMC) sampled 15 written text categories including news, literary texts, academic prose and official documents etc published in P. R. China in the earlier 1990s for a total of approximately 1 million words. The same sampling frame and period as FLOB/FROWN were used in LCMC. The corpus is encoded in Unicode (UTF-8) and marked up in XML. |
Languages |
Chinese (chi)
|
|
|
TRAD Pashto-English News Articles Parallel corpus |
602 Kb |
This is a parallel corpus, which contains 10,000 Pashto words translated into English by two different translators. The… |
English (eng); Pushto (pus)
…
|
ELRA-W0097 |
Details
TRAD Pashto-English News Articles Parallel corpus
Name |
TRAD Pashto-English News Articles Parallel corpus (ELRA-W0097) |
URL |
http://catalog.elra.info/product_info.php?products_id=1271 |
Description |
This is a parallel corpus, which contains 10,000 Pashto words translated into English by two different translators. The source texts have been collected from the following news websites: Azadiradio, Mashaal and Voice of America Pashto. |
Languages |
- English (eng)
- Pushto (pus)
|
|
|
TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data |
575 Kb |
This is a parallel corpus, which contains 10,000 Pashto words translated into English. The source texts come from 3 bro… |
English (eng); Pushto (pus)
…
|
ELRA-W0095 |
Details
TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data
Name |
TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data (ELRA-W0095) |
URL |
http://catalog.elra.info/product_info.php?products_id=1269 |
Description |
This is a parallel corpus, which contains 10,000 Pashto words translated into English. The source texts come from 3 broadcast news transcriptions of the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381). |
Languages |
- English (eng)
- Pushto (pus)
|
|
|
TRAD Pashto-French News Articles Parallel corpus |
970 Kb |
This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The … |
French (fre); Pushto (pus)
…
|
ELRA-W0096 |
Details
TRAD Pashto-French News Articles Parallel corpus
Name |
TRAD Pashto-French News Articles Parallel corpus (ELRA-W0096) |
URL |
http://catalog.elra.info/product_info.php?products_id=1270 |
Description |
This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The source texts have been collected from the following news websites: Azadiradio, Mashaal and Voice of America Pashto. |
Languages |
- French (fre)
- Pushto (pus)
|
|
|
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test data |
29 Mb |
This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The … |
French (fre); Pushto (pus)
…
|
ELRA-W0094 |
Details
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test data
Name |
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test data (ELRA-W0094) |
URL |
http://catalog.elra.info/product_info.php?products_id=1268 |
Description |
This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The source texts come from 3 broadcast news transcriptions of the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381). |
Languages |
- French (fre)
- Pushto (pus)
|
|
|
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training data |
473 Mb |
This corpus consists of the transcription of 106 hours of recordings in Pashto from the TRAD Pashto Broadcast News Spee… |
French (fre); Pushto (pus)
…
|
ELRA-W0093 |
Details
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training data
Name |
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training data (ELRA-W0093) |
URL |
http://catalog.elra.info/product_info.php?products_id=1267 |
Description |
This corpus consists of the transcription of 106 hours of recordings in Pashto from the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381) translated into French. It contains about 832,000 source words and 747,000 target words. |
Languages |
- French (fre)
- Pushto (pus)
|
|
|
TRAD Pashto Monolingual text Corpus |
2.2 Gb |
This is a monolingual text corpus in Pashto. The corpus contains about 112,000,000 tokens collected from 46 different b… |
Pushto (pus)
|
ELRA-W0092 |
Details
TRAD Pashto Monolingual text Corpus
|
|
TSNLP (Test Suites for NLP Testing) |
4.5 Mb |
Test Suites for Natural Language Processing.
4,000 test items (sentences or fragments of sentences) in English, French… |
English (eng); French (fre); German (ger)…
|
ELRA-W0013 |
Details
TSNLP (Test Suites for NLP Testing)
Name |
TSNLP (Test Suites for NLP Testing) (ELRA-W0013) |
URL |
http://catalog.elra.info/product_info.php?products_id=51 |
Description |
Test Suites for Natural Language Processing.
4,000 test items (sentences or fragments of sentences) in English, French & German, useful for NL system evaluation. |
Languages |
- English (eng)
- French (fre)
- German (ger)
|
|
|
Venice Italian Treebank (VIT) |
149 Mb |
The VIT, Venice Italian Treebank contains about 272,000 words distributed over six different domains: bureaucratic, pol… |
Italian (ita)
|
ELRA-W0040 |
Details
Venice Italian Treebank (VIT)
Name |
Venice Italian Treebank (VIT) (ELRA-W0040) |
URL |
http://catalog.elra.info/product_info.php?products_id=831 |
Description |
The VIT, Venice Italian Treebank contains about 272,000 words distributed over six different domains: bureaucratic, political, economic and financial, literary, scientific, and news. In addition, some 60,000 tokens of spoken dialogues in different Italian varieties were annotated.
The annotation follows general X-bar criteria with 29 constituency labels and 102 PoS tags. VIT is also made available in a broad annotation version with 10 constituency labels and 22 PoS tags for machine learning purposes. The format is plain text with square bracketing. However, a UPenn style version which is readable by the open source query language CorpusSearch is also provided. |
Languages |
Italian (ita)
|
|
|
Wolverhampton Business English Corpus |
118 Mb |
Produced by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the frame… |
English (eng)
|
ELRA-W0028 |
Details
Wolverhampton Business English Corpus
Name |
Wolverhampton Business English Corpus (ELRA-W0028) |
URL |
http://catalog.elra.info/product_info.php?products_id=627 |
Description |
Produced by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the framework of the European Commision project LRsPProduced by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the framework of the European Commision project LRsP&P (Language Resources Production & Packaging - LE4-8335), the Business English Corpus consists of 10.186.259 words collected from 23 different Web sites related to business. |
Languages |
English (eng)
|
|
|