1 |
Between History and Natural Language Processing: Study, Enrichment and Online Publication of French Parliamentary Debates of the Early Third Republic (1881-1899)
|
|
|
|
In: ParlaCLARIN III at LREC2022 - Workshop on Creating, Enriching and Using Parliamentary Corpora ; https://hal.archives-ouvertes.fr/hal-03623351 ; ParlaCLARIN III at LREC2022 - Workshop on Creating, Enriching and Using Parliamentary Corpora, Jun 2022, Marseille, France ; https://www.clarin.eu/ParlaCLARIN-III (2022)
|
|
Abstract:
International audience ; We present the AGODA (Analyse sémantique et Graphes relationnels pour l'Ouverture des Débats à l'Assemblée nationale) project, which aims to create a platform for consulting and exploring digitised French parliamentary debates (1881-1940) available in the digital library of the National Library of France. This project brings together historians and NLP specialists: parliamentary debates are indeed an essential source for French history of the contemporary period, but also for linguistics. This project therefore aims to produce a corpus of texts that can be easily exploited with computational methods, and that respect the TEI standard. Ancient parliamentary debates are also an excellent case study for the development and application of tools for publishing and exploring large historical corpora. In this paper, we present the steps necessary to produce such a corpus. We detail the processing and publication chain of these documents, in particular by mentioning the problems linked to the extraction of texts from digitised images. We also introduce the first analyses that we have carried out on this corpus with "bag-of-words" techniques not too sensitive to OCR quality (namely topic modelling and word embedding).
|
|
Keyword:
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]; [INFO.INFO-CY]Computer Science [cs]/Computers and Society [cs.CY]; [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]; [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing; [SHS.HIST]Humanities and Social Sciences/History; France; OCR; Parliamentary debates; Third Republic; Topic modelling; Word embedding; XML-TEI
|
|
URL: https://hal.archives-ouvertes.fr/hal-03623351/document https://hal.archives-ouvertes.fr/hal-03623351 https://hal.archives-ouvertes.fr/hal-03623351/file/puren_bourgeois_pellet_vernus_agoda2022.pdf
|
|
BASE
|
|
Hide details
|
|
2 |
Ensemble of Opinion Dynamics Models to Understand the Role of the Undecided in the Vaccination Debate ...
|
|
|
|
BASE
|
|
Show details
|
|
3 |
Zum Ungleichgewicht digital vermittelten Sachunterrichts und sprachlich-kommunikativer Anforderungen ...
|
|
|
|
BASE
|
|
Show details
|
|
4 |
Zum Ungleichgewicht digital vermittelten Sachunterrichts und sprachlich-kommunikativer Anforderungen
|
|
|
|
In: Sachunterricht in der Informationsgesellschaft. Bad Heilbrunn : Verlag Julius Klinkhardt 2022, S. 114-121. - (Probleme und Perspektiven des Sachunterrichts; 32) (2022)
|
|
BASE
|
|
Show details
|
|
6 |
Cross-Lingual Query-Based Summarization of Crisis-Related Social Media: An Abstractive Approach Using Transformers ...
|
|
|
|
BASE
|
|
Show details
|
|
7 |
MMTAfrica: Multilingual Machine Translation for African Languages ...
|
|
|
|
BASE
|
|
Show details
|
|
8 |
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers ...
|
|
|
|
BASE
|
|
Show details
|
|
9 |
MuMiN: A Large-Scale Multilingual Multimodal Fact-Checked Misinformation Social Network Dataset ...
|
|
|
|
BASE
|
|
Show details
|
|
10 |
Korean Online Hate Speech Dataset for Multilabel Classification: How Can Social Science Improve Dataset on Hate Speech? ...
|
|
|
|
BASE
|
|
Show details
|
|
11 |
Quantifying knowledge synchronisation in the 21st century ...
|
|
|
|
BASE
|
|
Show details
|
|
12 |
An NLP Solution to Foster the Use of Information in Electronic Health Records for Efficiency in Decision-Making in Hospital Care ...
|
|
|
|
BASE
|
|
Show details
|
|
13 |
Networks and Identity Drive Geographic Properties of the Diffusion of Linguistic Innovation ...
|
|
|
|
BASE
|
|
Show details
|
|
14 |
Using Pre-Trained Language Models for Producing Counter Narratives Against Hate Speech: a Comparative Study ...
|
|
|
|
BASE
|
|
Show details
|
|
15 |
Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations ...
|
|
|
|
BASE
|
|
Show details
|
|
17 |
Towards Responsible Natural Language Annotation for the Varieties of Arabic ...
|
|
|
|
BASE
|
|
Show details
|
|
18 |
Polling Latent Opinions: A Method for Computational Sociolinguistics Using Transformer Language Models ...
|
|
|
|
BASE
|
|
Show details
|
|
19 |
Who will share Fake-News on Twitter? Psycholinguistic cues in online post histories discriminate Between actors in the misinformation ecosystem ...
|
|
|
|
BASE
|
|
Show details
|
|
|
|