1 |
Measuring the quality of unstructured text in routinely collected electronic health data: a review and application
|
|
|
|
Abstract:
Introduction: Routinely collected electronic health data (RCEHD), can be comprised of structured, semi-structured, or unstructured information. Electronic medical records (EMRs), one type of RCEHD, often contain unstructured text data (UTD), which are typically prepared for analysis (i.e., preprocessed) and analyzed using natural language processing (NLP) techniques. At present, there are few studies about the specific types of NLP methods used to preprocess UTD to address data quality issues prior to analysis or modelling. Purpose & Objectives: The purpose was to examine preprocessing methods for UTD and evaluate the quality of UTD in EMRs. The objectives were to: 1) systematically document current research and practices for preprocessing UTD to describe or improve its quality, and 2) apply data quality indicators identified from current research and practices to UTD in EMRs from the Manitoba Primary Care Research Network and describe the quality of these data. Methods: Objective 1 involved a scoping review. Scopus, Web of Science, ProQuest, and EBSCOhost were searched for literature on current research and practices to prepare UTD for analysis, up to and including 2021. For objective 2, a case study was undertaken where data quality indicators and preprocessing methods identified in the scoping review were applied to UTD from EMRs. Results: 41 articles were included in the scoping review for objective 1; over 50% were published between 2016 and 2021 and over 90% were empirical research articles. Data quality indicator topics for UTD in EMRs included misspelled words, security, word variability, sources of noise, quality of annotations, ambiguous abbreviations, and manual annotations. For objective 2, we selected 193,206 clinical encounter notes from EMRs between 1985 and 2020. Overall, the clinical encounter notes contained an average (standard deviation [SD]) of 27.3 (27.0) stop words, 25.7 (27.8) punctuation symbols, 12.1 (11.1) spelling errors, and 2.9 (2.6) special characters. The average (SD) length of a clinical encounter note was 555.8 (551.1) characters, and 71.5 (59.7) words. Lexical diversity, had a mean (SD) of 86.2 (11.9). Conclusion: This study identified multiple data quality indicators that have been used to preprocess UTD in published literature and demonstrated their application to real-world data. ; February 2022
|
|
Keyword:
Data quality; Electronic Medical Records; Health research; Natural language processing; pre-processing unstructured text data
|
|
URL: http://hdl.handle.net/1993/36163
|
|
BASE
|
|
Hide details
|
|
2 |
LEXICON BASED RULE EXTRACTION FOR SENTIMENT ANALYSIS UNDER BIG DATA ENVIRONMENT ...
|
|
|
|
BASE
|
|
Show details
|
|
3 |
LEXICON BASED RULE EXTRACTION FOR SENTIMENT ANALYSIS UNDER BIG DATA ENVIRONMENT ...
|
|
|
|
BASE
|
|
Show details
|
|
4 |
NgramPOS: A Bigram-based Linguistic and Statistical Feature Process Model for Unstructured Text Classification
|
|
|
|
BASE
|
|
Show details
|
|
5 |
Big Data Text Summarization: Using Deep Learning to Summarize Theses and Dissertations
|
|
|
|
BASE
|
|
Show details
|
|
6 |
Face value of companies: deep learning for nonverbal communication ...
|
|
|
|
BASE
|
|
Show details
|
|
7 |
Face value of companies: deep learning for nonverbal communication
|
|
|
|
BASE
|
|
Show details
|
|
8 |
Supervised Process of Un-structured Data Analysis for Knowledge Chaining
|
|
|
|
In: Procedia CIRP ; CIRP design conference ; https://hal.archives-ouvertes.fr/hal-01347030 ; CIRP design conference, KTH, Jun 2016, Stockholm, Sweden. pp.436-441, ⟨10.1016/j.procir.2016.04.123⟩ ; http://cirpdesign2016.org/ (2016)
|
|
BASE
|
|
Show details
|
|
9 |
Leveraging Lexical Link Analysis (LLA) To Discover New Knowledge
|
|
|
|
In: Military Cyber Affairs (2016)
|
|
BASE
|
|
Show details
|
|
10 |
A Corpus Driven Computational Intelligence Framework for Deception Detection in Financial Text
|
|
|
|
BASE
|
|
Show details
|
|
12 |
Sentiment Big Data Flow Analysis by Means of Dynamic Linguistic Patterns
|
|
|
|
BASE
|
|
Show details
|
|
13 |
Lexical Link Analysis Application: Improving Web Service to Acquisition Visibility Portal
|
|
|
|
In: DTIC (2013)
|
|
BASE
|
|
Show details
|
|
14 |
Automated Extraction and Characterisation of Social Network Data from Unstructured Sources -- An Ontology-Based Approach
|
|
|
|
In: DTIC (2013)
|
|
BASE
|
|
Show details
|
|
15 |
Applications of Lexical Link Analysis Web Service for Large-Scale Automation, Validation, Discovery, Visualization, and Real-Time Program Awareness
|
|
|
|
In: DTIC (2012)
|
|
BASE
|
|
Show details
|
|
16 |
System Self-Awareness and Related Methods for Improving the Use and Understanding of Data within DoD
|
|
|
|
BASE
|
|
Show details
|
|
17 |
Collective knowledge systems: Where the social web meets the semantic web
|
|
|
|
In: http://www.websemanticsjournal.org/papers/2007119/CollectiveKnowledgeSystemsGruberV6I1.pdf (2008)
|
|
BASE
|
|
Show details
|
|
18 |
A conceptual-modeling approach to extracting data from the web
|
|
|
|
In: http://www.deg.byu.edu/papers/er98.pdf (1998)
|
|
BASE
|
|
Show details
|
|
19 |
A Conceptual-Modeling Approach to Extracting Data from the Web
|
|
|
|
In: http://osm7.cs.byu.edu/deg/papers/er98.ps (1998)
|
|
BASE
|
|
Show details
|
|
20 |
A Conceptual-Modeling Approach to Extracting Data from the Web
|
|
|
|
In: http://lantern.cs.byu.edu/papers/er98.ps (1998)
|
|
BASE
|
|
Show details
|
|
|
|