Journal of Medical Internet Research (SCI)
You-Qian Lee, Ching-Tai Chen, Chien-Chang Chen, Chung-Hong Lee, Pei-Tsz Chen, Chi-Shin Wu, Hong-Jie Dai
Abstract
Background: The widespread use of electronic health records in clinical and biomedical fields makes the removal of protected health information (PHI) essential to maintain privacy. However, a significant portion of information is recorded in unstructured textual form posing a challenge to de-identify. In multilingual countries, medical records could be written in a mixture of more than one language, referred to as code-mixing (CM). Most current clinical natural language processing techniques are designed for monolingual texts, and there is a need to address the de-identification of CM texts. Objective: The aim of this study was to investigate the effectiveness and underlying mechanism of fine-tuned PLMs in identifying PHIs in CM context. Additionally, we also aimed to evaluate the potential of prompting LLMs in recognizing PHIs in a zero-shot manner. Methods: We compiled the first clinical CM de-identification dataset consisting of texts written in Chinese and English. We explored the effectiveness of fine-tuning pre-trained language models (PLMs) in recognizing PHIs in CM content, focusing on whether PLMs exploit naming regularity and mention coverage to achieve superior performance by probing the developed models’ outputs to examine their decision-making process. Furthermore, we investigated the potential of prompt-based in-context learning of large language models (LLMs) in recognizing PHIs in CM text. Results: The developed methods were evaluated on a CM de-identification corpus of 1,700 discharge summaries. We observed that different PHI types had their preference in their occurrences within the different types of language-mixed sentences, and PLMs could effectively recognize PHIs by exploiting the learned name regularity. However, the models may exhibit suboptimal results when regularity was weak or mentions contain unknown words that the representations cannot generate well. We also found that the availability of CM training instances is essential for the model’s performance. Furthermore, LLM-based de-identification method is a feasible and appealing approach that can be controlled and enhanced through natural language prompts. Conclusions: The study contributes to understanding the underlying mechanism of PLMs in addressing the de-identification process in CM context and highlights the significance of incorporating CM training instances into the model training phase. To support the advancement of research, we have made a manipulated subset of the resynthesized dataset available for research purposes. Based on the compiled dataset, we find that the LLM-based de-identification method is a feasible approach, but carefully crafted prompts are essential to avoid unwanted output. However, the use of such methods in the hospital setting requires careful consideration of data security and privacy concerns. While this study has advanced the understanding of de-identifying clinical text in CM languages, several limitations should be acknowledged. Firstly, our research is primarily based on a Chinese-English CM corpus from Taiwan, which possesses unique writing conventions and style. Consequently, the fine-tuned models may not generalize seamlessly to other multilingual contexts. Second, we recognize that imbalanced training sets and challenges associated with machine translation have influenced our model’s performance. Additionally, the sensitivity of the LLM-based framework to prompt crafting, the dynamic nature of model versions, and the choice of repetitions in prompts are factors that affect the reported performance for specific PHI categories. Further research could explore the augmentation of PLMs and LLMs with external knowledge to improve their strength in recognizing rare PHIs.
comments powered by Disqus