1 |
Preparing Legal Documents for NLP Analysis: Improving the Classification of Text Elements by Using Page Features
|
|
|
|
Abstract:
Legal documents often have a complex layout with many different headings, headers and footers, side notes, etc. For the further processing, it is important to extract these individual components correctly from a legally binding document, for example a signed PDF. A common approach to do so is to classify each (text) region of a page using its geometric and textual features. This approach works well, when the training and test data have a similar structure and when the documents of a collection to be analyzed have a rather uniform layout. We show that the use of global page properties can improve the accuracy of text element classification: we first classify each page into one of three layout types. After that, we can train a classifier for each of the three page types and thereby improve the accuracy on a manually annotated collection of 70 legal documents consisting of 20,938 text elements. When we split by page type, we achieve an improvement from 0.95 to 0.98 for single-column pages with left marginalia and from 0.95 to 0.96 for double-column pages. We developed our own feature-based method for page layout detection, which we benchmark against a standard implementation of a CNN image classifier. The approach presented here is based on corpus of freely available German contracts and general terms and conditions. Both the corpus and all manual annotations are made freely available. The method is language agnostic.
|
|
Keyword:
Automatische Klassifikation; Bilderkennung; ddc:020; Dokumentanalyse; Maschinelles Lernen; Rechtswissenschaften; Sachtext; Text Mining
|
|
URL: https://serwiss.bib.hs-hannover.de/files/2161/csit120102.pdf https://serwiss.bib.hs-hannover.de/frontdoor/index/index/docId/2161 http://nbn-resolving.org/urn:nbn:de:bsz:960-opus4-21618 https://doi.org/10.25968/opus-2161 https://nbn-resolving.org/urn:nbn:de:bsz:960-opus4-21618
|
|
BASE
|
|
Hide details
|
|
4 |
Detecting Paraphrases of Standard Clause Titles in Insurance Contracts
|
|
|
|
BASE
|
|
Show details
|
|
6 |
A taxonomy of user guidance devices for e-lexicography
|
|
|
|
In: Lexicographica. Internationales Jahrbuch für Lexikographie. International annual for lexicography. Revue internationale de lexicographie 33 (2018), 391-422
|
|
IDS OBELEX meta
|
|
Show details
|
|
7 |
Semi-automating the Reading Programme for a Historical Dictionary Project
|
|
|
|
In: Lexikos; Vol. 28 (2018) ; 2224-0039 (2018)
|
|
BASE
|
|
Show details
|
|
8 |
Direct User Guidance in e-Dictionaries for Text Production and Text Reception - The Verbal Relative in Sepedi as a Case Study
|
|
|
|
In: Lexikos. Journal of the African Association for Lexicography 27 (2017), 403-426
|
|
IDS OBELEX meta
|
|
Show details
|
|
9 |
Direct User Guidance in e-Dictionaries for Text Production and Text Reception — The Verbal Relative in Sepedi as a Case Study
|
|
|
|
In: Lexikos; Vol. 27 (2017) ; 2224-0039 (2017)
|
|
BASE
|
|
Show details
|
|
17 |
Recent Initiatives towards New Standards for Language Resources
|
|
|
|
In: International Conference of the German Society for Computational Linguistics and Language Technology ; https://hal.inria.fr/hal-01464476 ; International Conference of the German Society for Computational Linguistics and Language Technology, Sep 2015, Essen, Germany (2015)
|
|
BASE
|
|
Show details
|
|
20 |
Natural Language Processing Techniques for Improved User-friendliness of Electronic Dictionaries
|
|
|
|
In: Proceedings of the 16th EURALEX International Congress: The User in Focus, Bolzano/Bozen, Italien 15 - 19 July 2014 (2014), 47-61
|
|
IDS OBELEX meta
|
|
Show details
|
|
|
|