DE eng

Search in the Catalogues and Directories

Page: 1 2 3 4 5 6 7...14
Hits 41 – 60 of 267

41
TEI Lex-0: A Target Format for TEI-Encoded Dictionaries and Lexical Resources ...
Romary, Laurent; Tasovac, Toma. - : Zenodo, 2019
BASE
Show details
42
TEI and the Mixtepec-Mixtec corpus: data integration, annotation and normalization of heterogeneous data for an under-resourced language
BASE
Show details
43
TEI and the Mixtepec-Mixtec corpus: data integration, annotation and normalization of heterogeneous data for an under-resourced language
BASE
Show details
44
MKM – ein Metamodell für Korpusmetadaten
Odebrecht, Carolin [Verfasser]; Lüdeling, Anke [Gutachter]; Romary, Laurent [Gutachter]. - Berlin : Humboldt-Universität zu Berlin, 2018
DNB Subject Category Language
Show details
45
Tutoring Systems and Computer-Assisted Language Learning (CALL)
Mehler, Alexander [Herausgeber]; Lobin, Henning [Verfasser]; Rösler, Dietmar [Verfasser]. - Mannheim : Institut für Deutsche Sprache, Bibliothek, 2018
DNB Subject Category Language
Show details
46
[tiger2] As a standardized serialisation for ISO 24615 - SynAF
Pareja-Lora, Antonio [Verfasser]; Zeldes, Amir [Verfasser]; Romary, Laurent [Verfasser]. - Mannheim : Institut für Deutsche Sprache, Bibliothek, 2018
DNB Subject Category Language
Show details
47
Representing human and machine dictionaries in markup languages (SGML, XML)
Witt, Andreas [Verfasser]; Romary, Laurent [Verfasser]; Schweickard, Wolfgang [Herausgeber]. - Mannheim : Institut für Deutsche Sprache, Bibliothek, 2018
DNB Subject Category Language
Show details
48
Bridging the Gaps between Digital Humanities, Lexicography, and Linguistics: A TEI Dictionary for the Documentation of Mixtepec-Mixtec
In: ISSN: 2160-5076 ; Dictionaries: Journal of the Dictionary Society of North America ; https://hal.inria.fr/hal-01968871 ; Dictionaries: Journal of the Dictionary Society of North America, Dictionary Society of North America, 2018, 39 (2), pp.79-106 (2018)
BASE
Show details
49
TEI Lex-0: A Target Format for TEI-Encoded Dictionaries and Lexical Resources
In: TEI Conference and Members' Meeting ; https://hal.inria.fr/hal-02265312 ; TEI Conference and Members' Meeting, Sep 2018, Tokyo, Japan (2018)
BASE
Show details
50
Enhancing Usability for Automatically Structuring Digitised Dictionaries
In: GLOBALEX workshop at LREC 2018 ; https://hal.archives-ouvertes.fr/hal-01708137 ; GLOBALEX workshop at LREC 2018, May 2018, Miyazaki, Japan (2018)
BASE
Show details
51
Retro-digitizing and Automatically Structuring a Large Bibliography Collection
In: European Association for Digital Humanities (EADH) Conference ; https://hal.archives-ouvertes.fr/hal-01941534 ; European Association for Digital Humanities (EADH) Conference, EADH, Dec 2018, Galway, Ireland (2018)
BASE
Show details
52
A stand-off XML-TEI representation of reference annotation
In: DGfS 2018: 40. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft ; https://hal.inria.fr/hal-01876327 ; DGfS 2018: 40. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft, Mar 2018, Stuttgart, Germany. 2017 (2018)
BASE
Show details
53
A Diachronic Digital Edition of the Petit Larousse illustré
In: Journée d'étude CORLI : Traitements et standardisation des corpus multimodaux et web 2.0. ; https://hal.archives-ouvertes.fr/hal-01873805 ; Journée d'étude CORLI : Traitements et standardisation des corpus multimodaux et web 2.0., May 2018, Paris, France (2018)
BASE
Show details
54
Automatically Encoding Encyclopedic-like Resources in TEI
In: The annual TEI Conference and Members Meeting ; https://hal.inria.fr/hal-01819505 ; The annual TEI Conference and Members Meeting, Sep 2018, Tokyo, Japan ; https://tei2018.dhii.asia/ (2018)
BASE
Show details
55
TEI-Lex0 Etym -towards terse(r) recommendations for the encoding of etymological information
In: TEI Conference and Members' Meeting ; https://hal.inria.fr/hal-02075506 ; TEI Conference and Members' Meeting, Sep 2018, Tokyo, Japan (2018)
BASE
Show details
56
Encoding Mixtepec-Mixtec Etymology in TEI
In: TEI Conference and Members' Meeting ; https://hal.inria.fr/hal-02003975 ; TEI Conference and Members' Meeting, Sep 2018, Tokyo, Japan (2018)
BASE
Show details
57
Presenting the Nénufar Project: a Diachronic Digital Edition of the Petit Larousse Illustré
In: GLOBALEX 2018 - Globalex workshop at LREC2018 ; https://hal.archives-ouvertes.fr/hal-01728328 ; GLOBALEX 2018 - Globalex workshop at LREC2018, May 2018, Miyazaki, Japan. pp.1-6 ; https://globalex.link/globalex2018/ (2018)
BASE
Show details
58
MKM – ein Metamodell für Korpusmetadaten
Odebrecht, Carolin. - : Humboldt-Universität zu Berlin, 2018
BASE
Show details
59
TBX in ODD: Schema-agnostic specification and documentation for TermBase eXchange
In: LOTKS 2017- Workshop on Language, Ontology, Terminology and Knowledge Structures ; https://hal.inria.fr/hal-01581440 ; LOTKS 2017- Workshop on Language, Ontology, Terminology and Knowledge Structures, Sep 2017, Montpellier, France ; https://langandonto.github.io/LangOnto-TermiKS-2017/ (2017)
BASE
Show details
60
Automatic Extraction of TEI Structures in Digitized Lexical Resources using Conditional Random Fields
In: electronic lexicography, eLex 2017 ; https://hal.archives-ouvertes.fr/hal-01508868 ; electronic lexicography, eLex 2017, Sep 2017, Leiden, Netherlands (2017)
Abstract: International audience ; An important number of digitized lexical resources remain unexploited due to their unstructured content. Manually structuring such resources is a costly task given their multifold complexity. Our goal is to find an approach to automatically structure digitized dictionaries, independently from the language or the lexicographic school or style. In this paper we present a first version of GROBID-Dictionaries1, an open source machine learning system for lexical information extraction.Our approach is twofold: we perform a cascading structure extraction, while we select at each level specific features for training.We followed a ”divide to conquer” strategy to dismantle text constructs in a digitized dictionary, based on the observation of their layout. Main pages (see Figure 1) in almost any dictionary share three blocks: a header (green), a footer (blue) and a body (orange). The body is, in its turn, constituted by several entries (red). Each lexical entry can be further decomposed (see Figure 2) as: form (green), etymology (blue), sense (red) or/and related entry. The same logic could be applied further for each extracted block but in the scope of this paper we focus just on the first three levels.The cascading approach ensures a better understanding of the learning process’s output and consequently simplifies the feature selection process. Limited exclusive text blocks per level helps significantly in diagnosing the cause of prediction errors. It allows an early detection and replacement of irrelevant selected features that can bias a trained model. In such a segmentation, it becomes more straightforward to notice that, for instance, the token position in the page is very relevant to detect headers and footers and has almost no pertinence for capturing a sense in a lexical entry which is very often split on two pages.To implement our approach, we took up the available infrastructure from GROBID [7], a machine learning system for the extraction of bibliographic metadata. GROBID adopts the same cascading approach and uses Conditional Random Fields (CRF) [6] to label text sequences. The output of Grobid dictionary is planned to generate a TEI compliant encoding [2, 9] where the various segmentation levels are associated with an appropriate XML tessellation. Collaboration with COST ENeL are ongoing to ensure maximal compatibility with existing dictionary projects.Our experiments justify so far our choices, where models for the first two levels trained on two different dictionary samples have given a high precision and recall with a small amount of annotated data. Relying mainly on the text layout, we tried to diversify the selected features for each model, on the token and line levels. We are working on tuning features and annotating more data to maintain the good results with new samples and to improve the third segmentation level.While just few task specific attempts [1] have been using machine learning in this research direction, the landscape remains dominated by rule based techniquess [4, 3, 8] which are ad-hoc and costly, even impossible, to adapt for new lexical resources.
Keyword: [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]; [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC]; [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing; [SHS.LANGUE]Humanities and Social Sciences/Linguistics; [STAT.ML]Statistics [stat]/Machine Learning [stat.ML]; automatic structuring; CRF; digitized dictionaries; machine learning; TEI
URL: https://hal.archives-ouvertes.fr/hal-01508868v2/document
https://hal.archives-ouvertes.fr/hal-01508868
https://hal.archives-ouvertes.fr/hal-01508868v2/file/eLex-2017-Template.pdf
BASE
Hide details

Page: 1 2 3 4 5 6 7...14

Catalogues
6
3
2
0
8
0
0
Bibliographies
21
0
0
2
0
0
3
0
2
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
221
0
7
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern