Cohort selection for construction of a clinical natural language processing corpus

Computer Methods and Programs in Biomedicine Update

Naga Lalitha ValliALLA, AipengCHEN, SeanBATONGBACAL, Chandini NEKKANTTI, Hong-JieDai, Jitendra JONNAGADDALA

Abstract

In Electronic Health Record (EHR) systems, key patient information is often captured in the form of unstructured clinical notes. The information from these notes can be extracted using Clinical Natural Language Processing (NLP). Training corpus is a key factor in development of efficient clinical NLP models. Clinical NLP corpus construction is complex and multifaceted. There are several challenges in corpus construction, but one challenge often not researched well is cohort selection aspect. In this study, we present methods employed and challenges encountered in cohort selection for construction of a clinical NLP corpus. In specific we present our methods in selection of cancer pathology reports to construct a corpus for automatic deidentification. 2100 pathology reports were extracted from 1833 (518 male and 1313 female) cancer patients using Health Level-7 (HL7) message standard. In terms of the age group distribution, the age group 60–70 years was highest with 872 patients. Our findings suggest deciphering the segment information from HL7 messages that are collected from different hospitals is a challenging task. The quality of HL7 messages also varied significantly with inconsistent tags making it difficult to identify reports that meet criteria set a priori. One of key lessons learned is linking the HL7 reports data with additional EMR data such as admissions, would help in identifying high quality reports and resolve duplicates. Also, our findings suggest that, in general the EHR data quality is poor with varying clinical coding and metadata standards between different hospitals. It is vital to identify and address these challenges for development of a high-quality corpus.

comments powered by Disqus