Home Catalogue search

eng

Refine your search:
- Keyword:
- Creator / Publisher
- Year:
  - 2018 (4)
  - 2017 (12)
  - 2016 (12)
  - 2015 (18)
  - 2014 (10)
  - 2013 (16)
  - 2011 (6)
  - 2009 (10)
  - 2007 (4)
  - 2006 (11)
  - more
- Medium:
  - Online (115)
  - Print (2)
- Type
- BLLDB-Access:
  - free (117)
  - subject to license (0)

Search in the Catalogues and Directories






	Sort by
Simple Search

Page: 1 2 3 4 5 6

Hits 81 – 100 of 117

81	Chinese Gigaword Fourth Edition
	Parker, Robert; Graff, David; Chen, Ke. - : Linguistic Data Consortium, 2009. : https://www.ldc.upenn.edu, 2009
	BASE
	Show details

82	English Gigaword Fourth Edition
	Parker, Robert; Graff, David; Kong, Junbo. - : Linguistic Data Consortium, 2009. : https://www.ldc.upenn.edu, 2009
	BASE
	Show details

83	Arabic Gigaword Fourth Edition ...
	Parker, Robert; Graff, David; Chen, Ke. - : Linguistic Data Consortium, 2009
	BASE
	Show details

84	English Gigaword Fourth Edition ...
	Parker, Robert; Graff, David; Kong, Junbo. - : Linguistic Data Consortium, 2009
	BASE
	Show details

85	REFLEX Entity Translation Training/DevTest ...
	Walker, Christopher; Chen, Song; Strassel, Stephanie. - : Linguistic Data Consortium, 2009
	BASE
	Show details

86	Chinese Gigaword Fourth Edition ...
	Parker, Robert; Graff, David; Chen, Ke. - : Linguistic Data Consortium, 2009
	BASE
	Show details

87	Syntax Sensitive And Language Independent Detection Of Code Clones ...
	Maeda, Kazuaki. - : Zenodo, 2009
	BASE
	Show details

88	Syntax Sensitive And Language Independent Detection Of Code Clones ...
	Maeda, Kazuaki. - : Zenodo, 2009
	BASE
	Show details

89	GALE Phase 1 Distillation Training
	Babko-Malaya, Olga; Chen, Song; Zakhary, Ramez. - : Linguistic Data Consortium, 2007. : https://www.ldc.upenn.edu, 2007
	BASE
	Show details

90	English Gigaword Third Edition
	Graff, David; Kong, Junbo; Chen, Ke. - : Linguistic Data Consortium, 2007. : https://www.ldc.upenn.edu, 2007
	BASE
	Show details

91	English Gigaword Third Edition ...
	Graff, David; Kong, Junbo; Chen, Ke. - : Linguistic Data Consortium, 2007
	BASE
	Show details

92	GALE Phase 1 Distillation Training ...
	Babko-Malaya, Olga; Chen, Song; Zakhary, Ramez. - : Linguistic Data Consortium, 2007
	BASE
	Show details

93	Speech Controlled Computing
	Cieri, Christopher; Miller, David; Martey, Nii O.. - : Linguistic Data Consortium, 2006. : https://www.ldc.upenn.edu, 2006
	BASE
	Show details

94	TDT5 Topics and Annotations
	Glenn, Meghan; Strassel, Stephanie; Kong, Junbo. - : Linguistic Data Consortium, 2006. : https://www.ldc.upenn.edu, 2006
	BASE
	Show details

95	TDT5 Multilingual Text
	Graff, David; Kong, Junbo; Maeda, Kazuaki. - : Linguistic Data Consortium, 2006. : https://www.ldc.upenn.edu, 2006
	BASE
	Show details

96	Arabic Gigaword Second Edition
	Graff, David; Chen, Ke; Kong, Junbo; Maeda, Kazuaki. - : Linguistic Data Consortium, 2006. : https://www.ldc.upenn.edu, 2006
	Abstract: Introduction Arabic Gigaword Second Edition was developed by the Linguistic Data Consortium (LDC) and contains 1.6 million documents of Arabic newswire text collected by LDC. This second edition includes all of the content of the first edition of Arabic Gigaword (LDC2003T12) as well as new data. Data The following table contains information for this corpus, broken down by source. The information includes source codes represented in the corpus as well as their codes from the first edition, the collection span and number of documents new to this edition, the number of documents total, and the K-words (thousands of words) for each source. Ummah Press is a new source included in the second edition and therefore has no first edition info. Source Second Edition Codes First Edition Codes Second Edition Collection Span New Docs Total Docs K-words Agence France Presse afp_arb afa 01/2003 - 12/2004 143,766 660,621 123,594 Al Hayat New Agency hyt_arb alh 01/2002 - 12/2003 64,308 369,555 169,100 An Nahar News Agency nhr_arb ann 01/2003 - 01/2004 16,316 344,084 151,078 Ummah Press umh_arb 01/2003 - 12/2004 4,641 4,641 1,201 Xinhua News Agency xin_arb xia 06/2003 - 12/2004 106,236 213,082 36,933 Total 335,267 1,591,983 481,906 Further statistics for each source are included in the corpus documentation. All text files in this corpus have been converted to UTF-8 character encoding. Owing to the use of UTF-8, the SGML tagging within each file shows up as lines of single-byte-per-character (ASCII) text, whereas lines of actual text data, including article headlines and datelines, contain a mixture of single-byte and multi-byte characters. In general, single-byte characters in the text data will consist of digits and punctuation marks (where the original source relied on ASCII punctuation codes, rather than Arabic-specific punctuation), whereas multi-byte characters consist of Arabic letters and a small number of special punctuation or other symbols. This variable-width character encoding is intrinsic to UTF-8, and all UTF-8 capable processes will handle the data appropriately. Each data file name consists of the seven-letter prefix, an underscore character ("_"), and a six-digit date representing the year and month during which the file contents were generated by the respective news source. Therefore, each file contains all the usable data received by LDC for the given month from the given news source. All text data are presented in SGML form, using a very simple, minimal markup structure. The file gigaword_a.dtd in the "dtd" directory provides the formal "Document Type Declaration" for parsing the SGML content. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using this DTD file. Unlike older corpora, the present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs). All sources have received a uniform treatment in terms of quality control, and have been categorized into three distinct "types": story this type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences multi this type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," "news briefs in . (some general area like finance or sports)" and so on other these DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on Samples For an example of the data in this corpus, please view this sample (TXT).
	URL: https://catalog.ldc.upenn.edu/LDC2006T02
	BASE
	Hide details

97	ACE 2005 Multilingual Training Corpus
	Walker, Christopher; Strassel, Stephanie; Medero, Julie. - : Linguistic Data Consortium, 2006. : https://www.ldc.upenn.edu, 2006
	BASE
	Show details

98	Speech Controlled Computing ...
	Cieri, Christopher; Miller, David; Martey, Nii O.. - : Linguistic Data Consortium, 2006
	BASE
	Show details

99	Arabic Gigaword Second Edition ...
	Graff, David; Chen, Ke; Kong, Junbo. - : Linguistic Data Consortium, 2006
	BASE
	Show details

100	TDT5 Topics and Annotations ...
	Glenn, Meghan; Strassel, Stephanie; Kong, Junbo. - : Linguistic Data Consortium, 2006
	BASE
	Show details

Page: 1 2 3 4 5 6

© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern