DE eng

Search in the Catalogues and Directories

Page: 1 2
Hits 1 – 20 of 38

1
Assessing the impact of OCR noise on multilingual event detection over digitised documents
In: ISSN: 1432-5012 ; EISSN: 1432-1300 ; International Journal on Digital Libraries ; https://hal.archives-ouvertes.fr/hal-03635985 ; International Journal on Digital Libraries, Springer Verlag, 2022, ⟨10.1007/s00799-022-00325-2⟩ (2022)
Abstract: International audience ; Event detection (ED) is a crucial task for natural language processing (NLP) and it involves the identification of instances of specified types of events in text and their classification into event types. The detection of events from digitised documents could enable historians to gather and combine a large amount of information into an integrated whole, a panoramic interpretation of the past. However, the level of degradation of digitised documents and the quality of the optical character recognition (OCR) tools might hinder the performance of an event detection system. While several studies have been performed in detecting events from historical documents, the transcribed documents needed to be hand-validated which implied a great effort of human expertise and manual labor-intensive work. Thus, in this study, we explore the robustness of two different event detection language-independent models to OCR noise, over two datasets that cover different event types and multiple languages. We aim at analysing their ability to mitigate problems caused by the low quality of the digitised documents and we simulate the existence of transcribed data, synthesised from clean annotated text, by injecting synthetic noise. For creating the noisy synthetic data, we chose to utilise four main types of noise that commonly occur after the digitisation process: Character Degradation, Bleed Through, Blur, and Phantom Character. Finally, we conclude that the imbalance of the datasets, the richness of the different annotation styles, and the language characteristics are the most important factors that can influence event detection in digitised documents.
Keyword: [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]; [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]; [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC]; [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]; [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG]; [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing; Digitised Documents; Event Detection; Information Extraction
URL: https://hal.archives-ouvertes.fr/hal-03635985/file/IJDL2022-Assessing%20the%20Impact%20of%20OCR%20Noise%20on%20Multilingual%20Event%20Detection%20over%20Digitised%20Documents.pdf
https://doi.org/10.1007/s00799-022-00325-2
https://hal.archives-ouvertes.fr/hal-03635985/document
https://hal.archives-ouvertes.fr/hal-03635985
BASE
Hide details
2
Introducing the HIPE 2022 Shared Task: Named Entity Recognition and Linking in Multilingual Historical Documents
In: Advances in Information Retrieval. 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II ; https://hal.archives-ouvertes.fr/hal-03635971 ; Matthias Hagen; Suzan Verberne; Craig Macdonald; Christin Seifert; Krisztian Balog; Kjetil Nørvåg; Vinay Setty. Advances in Information Retrieval. 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II, 13186, Springer International Publishing, pp.347-354, 2022, Lecture Notes in Computer Science, 978-3-030-99738-0. ⟨10.1007/978-3-030-99739-7_44⟩ (2022)
BASE
Show details
3
État de l'art du changement sémantique à partir de plongements contextualisés
In: COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference ; https://hal.archives-ouvertes.fr/hal-03320337 ; COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference, Apr 2021, Grenoble (virtuel), France (2021)
BASE
Show details
4
Atténuer les erreurs de numérisation dans la reconnaissance d'entités nommées pour les documents historiques
In: Conférence en Recherche d'Informations et Applications (CORIA 2021) ; https://hal.archives-ouvertes.fr/hal-03320332 ; Conférence en Recherche d'Informations et Applications (CORIA 2021), ARIA : Association Francophone de Recherche d’Information (RI) et Applications, Apr 2021, Grenoble (virtuel), France. pp.1 - 7 ; http://coria.asso-aria.org/2021/articles/mini_24/main.pdf (2021)
BASE
Show details
5
Multilingual Epidemic Event Extraction
In: Towards Open and Trustworthy Digital Societies. 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings ; https://hal.archives-ouvertes.fr/hal-03480551 ; Hao-Ren Ke; Chei Sian Lee; Kazunari Sugiyama. Towards Open and Trustworthy Digital Societies. 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings, 13133, Springer, pp.139-156, 2021, Lecture Notes in Computer Science, 978-3-030-91668-8. ⟨10.1007/978-3-030-91669-5_12⟩ (2021)
BASE
Show details
6
Étude comparative de méthodes de classification multilingue appliquées à l'épidémiologie
In: COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference ; https://hal.archives-ouvertes.fr/hal-03320343 ; COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference, Apr 2021, Grenoble (virtuel), France (2021)
BASE
Show details
7
A Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers
In: SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval ; https://hal.archives-ouvertes.fr/hal-03418387 ; SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Jul 2021, Virtual Event, Canada. pp.2328-2334, ⟨10.1145/3404835.3463255⟩ (2021)
BASE
Show details
8
Identification et gestion des données personnelles dans les textes ; Identification et gestion des données personnelles dans les textes: modèle sémantique et applications
In: CiDE.22 : 22éme édition du Colloque International sur le Document Electronique Données Documents Connaissances : Perspectives de recherche et d’enseignement ; https://hal.archives-ouvertes.fr/hal-03506075 ; CiDE.22 : 22éme édition du Colloque International sur le Document Electronique Données Documents Connaissances : Perspectives de recherche et d’enseignement, Dec 2021, Paris, France (2021)
BASE
Show details
9
Data Papers et dissémination des données de la recherche : quelles pratiques en SHS ?
In: Colloque DHNord2021 : Publier, partager, réutiliser les données de la recherche : les data papers et leurs enjeux ; https://hal.archives-ouvertes.fr/hal-03506077 ; Colloque DHNord2021 : Publier, partager, réutiliser les données de la recherche : les data papers et leurs enjeux, Nov 2021, virtuelle, France (2021)
BASE
Show details
10
Building An Automated Gesture Imitation Game For Teenagers with ASD
In: ISSN: 0973-7006 ; Far East Journal of Electronics and Communications ; https://hal-imt-atlantique.archives-ouvertes.fr/hal-02894314 ; Far East Journal of Electronics and Communications, 2020, 23 (1), pp.1 - 10. ⟨10.17654/EC023010001⟩ (2020)
BASE
Show details
11
Dataset for Temporal Analysis of English-French Cognates
In: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) ; 12th Conference on Language Resources and Evaluation (LREC 2020) ; https://hal.archives-ouvertes.fr/hal-03026957 ; 12th Conference on Language Resources and Evaluation (LREC 2020), May 2020, Marseille, France. pp.855-859, ⟨10.5281/zenodo.3693650⟩ (2020)
BASE
Show details
12
A Dataset for Multi-lingual Epidemiological Event Extraction
In: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) ; https://hal.archives-ouvertes.fr/hal-02732848 ; Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), May 2020, Marseille, France. pp.4139-4144 (2020)
BASE
Show details
13
Impact Analysis of Document Digitization on Event Extraction
In: CEUR Workshop Proceedings ; 4th Workshop on Natural Language for Artificial Intelligence (NL4AI 2020) co-located with the 19th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2020) ; https://hal.archives-ouvertes.fr/hal-03026148 ; 4th Workshop on Natural Language for Artificial Intelligence (NL4AI 2020) co-located with the 19th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2020), Nov 2020, Virtual, Italy. pp.17-28 ; http://sag.art.uniroma2.it/NL4AI/ (2020)
BASE
Show details
14
Entity Linking for Historical Documents: Challenges and Solutions
In: 22nd International Conference on Asia-Pacific Digital Libraries, ICADL 2020 ; https://hal.archives-ouvertes.fr/hal-03034492 ; 22nd International Conference on Asia-Pacific Digital Libraries, ICADL 2020, 12504, Springer, pp.215-231, 2020, Lecture Notes in Computer Science, 978-3-030-64452-9. ⟨10.1007/978-3-030-64452-9_19⟩ (2020)
BASE
Show details
15
Robust Named Entity Recognition and Linking on Historical Multilingual Documents
In: Conference and Labs of the Evaluation Forum (CLEF 2020) ; https://hal.archives-ouvertes.fr/hal-03026969 ; Conference and Labs of the Evaluation Forum (CLEF 2020), Sep 2020, Thessaloniki, Greece. pp.1-17, ⟨10.5281/zenodo.4068074⟩ ; http://ceur-ws.org/Vol-2696/paper_171.pdf (2020)
BASE
Show details
16
Linking Named Entities across Languages using Multilingual Word Embeddings
In: JCDL '20: The ACM/IEEE Joint Conference on Digital Libraries in 2020 ; ACM/IEEE Joint Conference on Digital Libraries - JCDL 2020 ; https://hal.archives-ouvertes.fr/hal-03026933 ; ACM/IEEE Joint Conference on Digital Libraries - JCDL 2020, Aug 2020, Wuhan, Hubei - Virtual event, China. pp.329-332, ⟨10.1145/3383583.3398597⟩ ; https://dl.acm.org/doi/10.1145/3383583.3398597 (2020)
BASE
Show details
17
Concevoir un dispositif innovant pour professionaliser la formation au référencement web: le projet SEO-ELP
In: ACFAS ; https://hal.archives-ouvertes.fr/hal-03506079 ; ACFAS, 2020, Sherbrooke, Canada (2020)
BASE
Show details
18
L'école est morte, vive l'école
In: Langues romanes : interaction entre littératures et cultures nationales ; https://hal.archives-ouvertes.fr/hal-03554956 ; Langues romanes : interaction entre littératures et cultures nationales, Université d’Etat de la Région de Moscou, Jun 2020, Moscou, Russie. pp.403-422 (2020)
BASE
Show details
19
Beyond Metada: the New Challenges in Mining Scientific Papers
In: the Eighth workshop on Bibliometric-enhanced Information Retrieval (BIR 2019) co-located with the 41st European Conference on Information Retrieval (ECIR 2019) ; https://hal.archives-ouvertes.fr/hal-03506084 ; the Eighth workshop on Bibliometric-enhanced Information Retrieval (BIR 2019) co-located with the 41st European Conference on Information Retrieval (ECIR 2019), Apr 2019, Cologne, Germany (2019)
BASE
Show details
20
Semantically-driven Competitive Intelligence Information Extraction : Linguistic Model and Application
In: CONTENT ; https://hal.archives-ouvertes.fr/hal-03506081 ; CONTENT, 2019, Venise, Italy (2019)
BASE
Show details

Page: 1 2

Catalogues
0
0
0
0
0
0
0
Bibliographies
0
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
38
0
0
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern