DE eng

Search in the Catalogues and Directories

Page: 1 2 3 4
Hits 1 – 20 of 73

1
Das ZDL-Regionalkorpus: Ein Korpus für die lexikografische Beschreibung der diatopischen Variation im Standarddeutschen
Nolda, Andreas (VerfasserIn); Barbaresi, Adrien (VerfasserIn)
IDS Mannheim
2
A Reproducible IT-Blog Corpus
In: Journal of Open Humanities Data; Vol 7 (2021); 17 ; 2059-481X (2021)
BASE
Show details
3
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021 (Online-Event) ...
Lüngen, Harald; Kupietz, Marc; Bański, Piotr. - : Leibniz-Institut für Deutsche Sprache, 2021
BASE
Show details
4
Trafilatura: {A} Web Scraping Library and Command-Line Tool for Text Discovery and Extraction ...
BASE
Show details
5
Addressing Cha(lle)nges in Long-Term Archiving of Large Corpora
Arnold, Denis [Verfasser]; Fisseni, Bernhard [Verfasser]; Kamocki, Paweł [Verfasser]. - Mannheim : Leibniz-Institut für Deutsche Sprache (IDS), Bibliothek, 2020
DNB Subject Category Language
Show details
6
Using Full Text Indices for Querying Spoken Language Data
Frick, Elena [Verfasser]; Schmidt, Thomas [Verfasser]; Bański, Piotr [Herausgeber]. - Mannheim : Leibniz-Institut für Deutsche Sprache (IDS), Bibliothek, 2020
DNB Subject Category Language
Show details
7
Proceedings of the LREC 2020 Workshop, Language Resources and Evaluation Conference, 11–16 May 2020, 8th Workshop on Challenges in the Management of Large Corpora (CMLC-8)
Bański, Piotr [Herausgeber]; Barbaresi, Adrien [Herausgeber]; Clematide, Simon [Herausgeber]. - Mannheim : Leibniz-Institut für Deutsche Sprache (IDS), Bibliothek, 2020
DNB Subject Category Language
Show details
8
Evaluating a Dependency Parser on DeReKo
Fankhauser, Peter [Verfasser]; Do, Bich-Ngoc [Verfasser]; Kupietz, Marc [Verfasser]. - Mannheim : Leibniz-Institut für Deutsche Sprache (IDS), Bibliothek, 2020
DNB Subject Category Language
Show details
9
Out-of-the-Box and Into the Ditch? Multilingual Evaluation of Generic Text Extraction Tools
In: Language Resources and Evaluation Conference (LREC 2020) ; https://hal.archives-ouvertes.fr/hal-02732851 ; Language Resources and Evaluation Conference (LREC 2020), 2020, pp.5-13 (2020)
Abstract: International audience ; This article examines extraction methods designed to retain the main text content of web pages and discusses how the extraction could be oriented and evaluated: can and should it be as generic as possible to ensure opportunistic corpus construction? The evaluation grounds on a comparative benchmark of open-source tools used on pages in five different languages (Chinese, English, Greek, Polish and Russian), it features several metrics to obtain more fine-grained differentiations. Our experiments highlight the diversity of web page layouts across languages or publishing countries. These discrepancies are reflected by diverging performances so that the right tool has to be chosen accordingly.
Keyword: [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing; Boilerplate removal; Cleaneval; Evaluation metrics; Web Content Extraction; Web corpus construction
URL: https://hal.archives-ouvertes.fr/hal-02732851/document
https://hal.archives-ouvertes.fr/hal-02732851/file/2020.wac-1.2.pdf
https://hal.archives-ouvertes.fr/hal-02732851
BASE
Hide details
10
htmldate: A Python package to extract publication dates from web pages ...
Barbaresi, Adrien. - : Zenodo, 2020
BASE
Show details
11
Proceedings of the LREC 2020: 8th Workshop on Challenges in the Management of Large Corpora (CMLC-8)
In: Proceedings of the LREC 2020: 8th Workshop on Challenges in the Management of Large Corpora (CMLC-8). Edited by: Bański, Piotr; Barbaresi, Adrien; Clematide, Simon; Kupietz, Marc; Lüngen, Harald; Pisetta, Ines (2020). Marseille, France: European Language Ressources Association. (2020)
BASE
Show details
12
What's New in EuReCo? Interoperability, Comparable Corpora, Licensing
Kupietz, Marc [Verfasser]; Margaretha, Eliza [Verfasser]; Diewald, Nils [Verfasser]. - Mannheim : Leibniz-Institut für Deutsche Sprache (IDS), Bibliothek, 2019
DNB Subject Category Language
Show details
13
The Vast and the Focused: On the need for domain-focused web corpora
Barbaresi, Adrien [Verfasser]; Bański, Piotr [Herausgeber]; Barbaresi, Adrien [Herausgeber]. - Mannheim : Leibniz-Institut für Deutsche Sprache (IDS), Bibliothek, 2019
DNB Subject Category Language
Show details
14
Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures
Ortiz Suárez, Pedro Javier [Verfasser]; Sagot, Benoît [Verfasser]; Romary, Laurent [Verfasser]. - Mannheim : Leibniz-Institut für Deutsche Sprache (IDS), Bibliothek, 2019
DNB Subject Category Language
Show details
15
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22 July 2019
Bański, Piotr [Herausgeber]; Barbaresi, Adrien [Herausgeber]; Biber, Hanno [Herausgeber]. - Mannheim : Leibniz-Institut für Deutsche Sprache (IDS), Bibliothek, 2019
DNB Subject Category Language
Show details
16
Modelling large parallel corpora. The Zurich Parallel Corpus Collection
Graën, Johannes [Verfasser]; Kew, Tannon [Verfasser]; Shaitarova, Anastassia [Verfasser]. - Mannheim : Leibniz-Institut für Deutsche Sprache (IDS), Bibliothek, 2019
DNB Subject Category Language
Show details
17
Deduplication in large web corpora
Benko, Vladimír [Verfasser]; Bański, Piotr [Herausgeber]; Barbaresi, Adrien [Herausgeber]. - Mannheim : Leibniz-Institut für Deutsche Sprache (IDS), Bibliothek, 2019
DNB Subject Category Language
Show details
18
The best of both worlds: Multi-billion word “dynamic” corpora
Lüngen, Harald [Herausgeber]; Breiteneder, Evelyn [Herausgeber]; Barbaresi, Adrien [Herausgeber]. - Mannheim : Leibniz-Institut für Deutsche Sprache (IDS), Bibliothek, 2019
DNB Subject Category Language
Show details
19
Diving Into The Complexities Of The Tech Blog Sphere
In: Digital Humanities 2019 ; https://hal.archives-ouvertes.fr/hal-02201532 ; Digital Humanities 2019, ADHO, Jul 2019, Utrecht, Netherlands ; https://dev.clariah.nl/files/dh2019/boa/0964.html (2019)
BASE
Show details
20
German Political Speeches Corpus ...
Barbaresi, Adrien. - : Zenodo, 2019
BASE
Show details

Page: 1 2 3 4

Catalogues
0
3
0
0
18
0
0
Bibliographies
1
0
1
1
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
46
0
3
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern