DE eng

Search in the Catalogues and Directories

Hits 1 – 3 of 3

1
For a fistful of blogs: Discovery and comparative benchmarking of republishable German content
BASE
Show details
2
Challenges in the linguistic exploitation of specialized republishable web corpora
In: RESAW conference 2015 ; https://halshs.archives-ouvertes.fr/halshs-01167324 ; RESAW conference 2015, Jun 2015, Aarhus, Denmark (2015)
Abstract: Short paper talk at RESAW 2015 conference (Aarhus, Denmark). ; International audience ; I would like to present work on texts corpora in German, gathered on the Web and processed in order to be made available to linguists and a broader user community via a web interface. The corpora are specialized in the sense that they only address a particular text genre or source at a time. Web crawling techniques are used to download the documents, then they are stored roughly in the way web archives do. More precisely, I would like to talk about two cases where texts are expected to be republishable: a "standard" case, political speeches, and a "borderline" case, German blogs under CC license.The work is performed in the context of a digital dictionary of German. The primary user base consists of lexicographers, who need valuable or at least exploitable evidence, in the form of precise quotes or definition elements.The actual gathering and processing of the corpora is described elsewhere (anonymized references). In this talk I would like to focus on a series of challenges that are to be solved in order to make data from web archives accessible to researchers and to study web text corpora: metadata extraction, quality assurance, licensing, and "scientificity".1. A proper metadata extraction is needed in order to make further downstream applications possible. It has to be performed meticulously, since experience shows that even small or rare mistakes in date encoding for instance may cause the application to be disregarded or discarded by researchers in the humanities, since linguistic trends cannot be identified properly if the content is not ordered in time. Easily available metadata in the case of speeches constrast with different content types, encodings, and markup patterns concerning the blogs. Compromises have to be made without sacrificing recall, since republishable texts are rather rare.2. Regarding the content, quality assurance is paramount, since a high quality is expected by users, all the more since they may feel reluctant to use web texts for their studies. In fact, providing "Hi-Fi" web corpora also means promoting the cause of web sources and modernization of research methodology.3. The results are hosted in Germany, and thus German copyright laws apply, which can be considered to be more restrictive than others. Additionally, there are a number of issues with licensing in general and CC licenses in particular, even with manual verification: the CC ND and (to a lesser extent) NC predicates can hinder proper republication. There are also potential copyright issues regarding blog comments.To sum up the issues described above, much work flows into ensuring the "scientificity" of web texts and making the texts not only available but also citable in a scholarly sense.
Keyword: [SHS.LANGUE]Humanities and Social Sciences/Linguistics; Creative Commons licenses; Linguistic Corpus; Quality Assessment; Web Archives
URL: https://halshs.archives-ouvertes.fr/halshs-01167324/document
https://halshs.archives-ouvertes.fr/halshs-01167324/file/ABarbaresi_RESAW15_article.pdf
https://halshs.archives-ouvertes.fr/halshs-01167324
BASE
Hide details
3
Collection, Description, and Visualization of the German Reddit Corpus
In: 2nd Workshop on Natural Language Processing for Computer-Mediated Communication ; https://hal.archives-ouvertes.fr/hal-01207311 ; 2nd Workshop on Natural Language Processing for Computer-Mediated Communication, Sep 2015, Essen, Germany. pp.7-11 ; https://sites.google.com/site/nlp4cmc2015/program (2015)
BASE
Show details

Catalogues
0
0
0
0
0
0
0
Bibliographies
0
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
3
0
0
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern