1 |
Assessing the impact of OCR noise on multilingual event detection over digitised documents
|
|
|
|
In: ISSN: 1432-5012 ; EISSN: 1432-1300 ; International Journal on Digital Libraries ; https://hal.archives-ouvertes.fr/hal-03635985 ; International Journal on Digital Libraries, Springer Verlag, 2022, ⟨10.1007/s00799-022-00325-2⟩ (2022)
|
|
Abstract:
International audience ; Event detection (ED) is a crucial task for natural language processing (NLP) and it involves the identification of instances of specified types of events in text and their classification into event types. The detection of events from digitised documents could enable historians to gather and combine a large amount of information into an integrated whole, a panoramic interpretation of the past. However, the level of degradation of digitised documents and the quality of the optical character recognition (OCR) tools might hinder the performance of an event detection system. While several studies have been performed in detecting events from historical documents, the transcribed documents needed to be hand-validated which implied a great effort of human expertise and manual labor-intensive work. Thus, in this study, we explore the robustness of two different event detection language-independent models to OCR noise, over two datasets that cover different event types and multiple languages. We aim at analysing their ability to mitigate problems caused by the low quality of the digitised documents and we simulate the existence of transcribed data, synthesised from clean annotated text, by injecting synthetic noise. For creating the noisy synthetic data, we chose to utilise four main types of noise that commonly occur after the digitisation process: Character Degradation, Bleed Through, Blur, and Phantom Character. Finally, we conclude that the imbalance of the datasets, the richness of the different annotation styles, and the language characteristics are the most important factors that can influence event detection in digitised documents.
|
|
Keyword:
[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]; [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]; [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC]; [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]; [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG]; [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing; Digitised Documents; Event Detection; Information Extraction
|
|
URL: https://hal.archives-ouvertes.fr/hal-03635985/file/IJDL2022-Assessing%20the%20Impact%20of%20OCR%20Noise%20on%20Multilingual%20Event%20Detection%20over%20Digitised%20Documents.pdf https://doi.org/10.1007/s00799-022-00325-2 https://hal.archives-ouvertes.fr/hal-03635985/document https://hal.archives-ouvertes.fr/hal-03635985
|
|
BASE
|
|
Hide details
|
|
2 |
Assessing the Impact of OCR Noise on Multilingual Event Detection over Digitised Documents ...
|
|
|
|
BASE
|
|
Show details
|
|
3 |
Assessing the Impact of OCR Noise on Multilingual Event Detection over Digitised Documents ...
|
|
|
|
BASE
|
|
Show details
|
|
4 |
L3i_LBPAM at the FinSim-2 task: Learning Financial Semantic Similarities with Siamese Transformers
|
|
|
|
In: WWW '21: Companion Proceedings of the Web Conference 2021 ; WWW '21: The Web Conference 2021 ; https://hal.sorbonne-universite.fr/hal-03256324 ; WWW '21: The Web Conference 2021, Apr 2021, Ljubljana (virtual), Slovenia. pp.302-306, ⟨10.1145/3442442.3451384⟩ (2021)
|
|
BASE
|
|
Show details
|
|
5 |
Discovering Spatial Relations in Litterature: what is the influence of OCR noise ?
|
|
|
|
In: NewsEye’s international conference ; https://hal.archives-ouvertes.fr/hal-03199729 ; NewsEye’s international conference, Mar 2021, Paris, France (2021)
|
|
BASE
|
|
Show details
|
|
6 |
Multilingual Epidemic Event Extraction
|
|
|
|
In: Towards Open and Trustworthy Digital Societies. 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings ; https://hal.archives-ouvertes.fr/hal-03480551 ; Hao-Ren Ke; Chei Sian Lee; Kazunari Sugiyama. Towards Open and Trustworthy Digital Societies. 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings, 13133, Springer, pp.139-156, 2021, Lecture Notes in Computer Science, 978-3-030-91668-8. ⟨10.1007/978-3-030-91669-5_12⟩ (2021)
|
|
BASE
|
|
Show details
|
|
7 |
Étude comparative de méthodes de classification multilingue appliquées à l'épidémiologie
|
|
|
|
In: COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference ; https://hal.archives-ouvertes.fr/hal-03320343 ; COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference, Apr 2021, Grenoble (virtuel), France (2021)
|
|
BASE
|
|
Show details
|
|
8 |
« Exploiter un corpus de données textuelles sans post-traitement : l’écriture burlesque de la Fronde »
|
|
|
|
In: ISSN: 2736-2337 ; Humanités numériques ; https://hal.archives-ouvertes.fr/hal-03500616 ; Humanités numériques, Bruxelles: Humanistica, 2021 (2021)
|
|
BASE
|
|
Show details
|
|
9 |
Étude comparative de méthodes de classification multilingue appliquées à l'épidémiologie ...
|
|
|
|
BASE
|
|
Show details
|
|
11 |
Impact Analysis of Document Digitization on Event Extraction ...
|
|
|
|
BASE
|
|
Show details
|
|
12 |
Token-level Multilingual Epidemic Dataset for Event Extraction ...
|
|
|
|
BASE
|
|
Show details
|
|
13 |
Impact Analysis of Document Digitization on Event Extraction ...
|
|
|
|
BASE
|
|
Show details
|
|
14 |
Token-level Multilingual Epidemic Dataset for Event Extraction ...
|
|
|
|
BASE
|
|
Show details
|
|
16 |
Étude comparative de méthodes de classification multilingue appliquées à l'épidémiologie ...
|
|
|
|
BASE
|
|
Show details
|
|
17 |
Multilingual Epidemiological Text Classification: A Comparative Study ...
|
|
|
|
BASE
|
|
Show details
|
|
18 |
Multilingual Epidemiological Text Classification: A Comparative Study ...
|
|
|
|
BASE
|
|
Show details
|
|
19 |
SinNer@Clef-Hipe2020 : Sinful adaptation of SotA models for Named Entity Recognition in French and German
|
|
|
|
In: CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum ; https://hal.inria.fr/hal-02984746 ; CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Sep 2020, Thessaloniki / Virtual, Greece ; https://impresso.github.io/CLEF-HIPE-2020/ (2020)
|
|
BASE
|
|
Show details
|
|
20 |
Daniel@FinTOC’2 Shared Task: Title Detection and Structure Extraction
|
|
|
|
In: st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation @COLING’2020 ; 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation @COLING’2020 ; https://hal.archives-ouvertes.fr/hal-03024867 ; 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation @COLING’2020, Dec 2020, Barcelone, Spain (2020)
|
|
BASE
|
|
Show details
|
|
|
|