September 10, 2025
Integrative analysis of multi-institutional Electronic Health Record (EHR) data enhances the reliability and generalizability of translational research by leveraging larger, more diverse patient cohorts and incorporating multiple data modalities. However, harmonizing EHR data across institutions poses major challenges due to data heterogeneity, semantic differences, and privacy concerns. To address these challenges, we introduce PEHRT, a standardized pipeline for efficient EHR data harmonization consisting of two core modules: (1) data pre-processing and (2) representation learning. PEHRT maps EHR data to standard coding systems and uses advanced machine learning to generate research-ready datasets without requiring individual-level data sharing. Our pipeline is also data model agnostic and designed for streamlined execution across institutions based on our extensive real-world experience. We provide a complete suite of open source software, accompanied by a user-friendly tutorial, and demonstrate the utility of PEHRT in a variety of tasks using data from diverse healthcare systems.
0
0
PEHRT: A Common Pipeline for Harmonizing Electronic Health Record data for Translational Research
Jessica Gronsbell, Vidul Ayakulangara Panickan, Chris Lin, Thomas Charlon,
Chuan Hong, Doudou Zhou, Linshanshan Wang, Jianhui Gao, Shirley Zhou,
Yuan Tian, Yaqi Shi, Ziming Gan, Tianxi Cai
Data harmonization, Data pre-processing, Electronic health records, Integrative analysis, Representation learning
The growing availability of data from Electronic Health Records (EHRs) has transformed translational biomedical research. In the past decade, EHR data has been harnessed in a wide range of applications that have improved healthcare delivery and deepened our understanding of human health. These applications include dynamic risk prediction of diseases, real-world treatment comparisons, development of medical knowledge graphs, and a broad range of genomic studies [1]–[11]. To fully leverage the potential of these applications, integrative analysis of EHR data across diverse healthcare settings has emerged as a key strategy to enhance the generalizability of scientific findings, boost statistical power, and support the development of robust models for precision medicine. The COVID-19 pandemic, in particular, catalyzed a new era of multi-institutional EHR-based research as several international collaborative networks were rapidly established to conduct large-scale, federated studies [12]–[14]. These initiatives significantly accelerated knowledge generation and amplified the impact of EHR-based research on the treatment and management of COVID-19.
Progress not withstanding, there are numerous barriers to effectively utilizing multi-institutional EHR data in translational applications. A key challenge is the lack of semantic interoperability across EHR systems, which results in substantial heterogeneity in clinical documentation and medical coding practices [15]–[19]. The foundation of any collaborative research study therefore rests on careful standardization of data elements across different data sources, a process known as data harmonization. Currently, there are no universally accepted or standardized procedures for harmonizing EHR data for an integrative analysis, despite the importance of such standards for ensuring the validity, transparency, and reproducibility of research findings [20]. The significance of proper data preparation became particularly evident during the COVID-19 pandemic when two high-profile studies published in The Lancet and The New England Journal of Medicine were retracted within months of publication [21], [22]. In spite of passing some of the most rigorous peer review, the authors could not verify the data or processing procedures that underscored the validity of their conclusions. These incidents highlight the need for comprehensive and rigorous standards for harmonization to ensure the scientific integrity and credibility of collaborative research.
To address this need, we developed PEHRT, an efficient and comprehensive pipeline for harmonizing EHR data for translational biomedical research. PEHRT consists of two core modules: (1) data pre-processing and (2) representation
learning. Our pipeline maps raw EHR data to standardized coding systems and uses advanced machine learning techniques to efficiently curate a multi-institutional EHR dataset without the sharing of individual-level data or requiring that the data be
represented in any particular data model. The output of PEHRT is a robust, research-ready dataset suitable for a wide range of scientific studies across healthcare institutions, including medical knowledge graph construction, phenotyping, predictive
modeling, clinical studies, and federated learning. Importantly, PEHRT is available in open source R
and Python software that is fully documented and executed within a user-friendly online tutorial (https://celehs.github.io/PEHRT/). Additionally, we further illustrate the utility and execution of PEHRT in several downstream tasks using diverse EHR data from multiple healthcare systems.
PEHRT was motivated by recent efforts in establishing federated networks of EHR data for translational and Artificial Intelligence (AI) research, including the Consortium for Clinical Characterization of COVID-19 by EHR (4CE) and the Artificial Intelligence/Machine Learning Consortium to Advance Health Equity and Researcher Diversity (AIM-AHEAD) program [12], [14]. 4CE is an international research collaboration that was established in 2021 to study COVID-19 [12]. With nearly 100 hospitals across seven countries, 4CE successfully harmonized EHR data to investigate the epidemiology and clinical characteristics of COVID-19 across healthcare systems [23]. The consortium’s work provided critical insights into temporal trends in laboratory values, demographic variations, and the effects of pre-existing conditions on patient outcomes [12], [24], [25]. Successful federated EHR networks such as 4CE have set a new standard for managing the complexities of diverse EHR data in collaborative research by demonstrating the importance of high-quality data processing for producing trustworthy scientific results [26]. AIM-AHEAD is pursuing a similar strategy by developing its own federated network, with a focus on leveraging AI and machine learning applied to EHR data to help reduce health disparities.
PEHRT was informed by lessons learned from conducting translational studies across multiple EHR systems within these networks. Our pipeline improves the efficiency and thoroughness of EHR data harmonization to provide researchers with a strong framework for conducting valid, transparent, and reproducible collaborative biomedical research. PEHRT is equipped with a suite of resources, including several R and Python packages that include detailed documentation together with example notebooks, web Application Programming Interfaces (APIs) for data visualization, and a dataset that has been used to assist researchers in applying PEHRT for their own purposes. The only requirement to use PEHRT is that the EHR data of interest are available in a relational database.
The output of PEHRT is a research-ready dataset that integrates EHR data from multiple institutions without the sharing of individual-level data to adhere to data privacy standards [27]–[30]. Datasets from PEHRT can be used for many of the same purposes as data from a single healthcare institution, but with the goal of reaching more generalizable scientific conclusions. For example, PEHRT enables the construction of medical knowledge graphs as well as the precise identification of patient cohorts with specific phenotypes for applications in risk prediction, drug efficacy assessment, and epidemiological studies [6], [10], [31]–[33]. When EHR data are linked to specimen biorepositories, PEHRT can be applied upstream of genetic studies, such as Phenome-Wide Association Studies (PheWAS) that uncover the association between a novel biomarker and a set of clinical or demographic phenotypes [34]–[39]. Additionally, our pipeline can be used to curate data for real-world evidence generation, post-marketing device surveillance, and clinical decision support tool development [29], [30], [40]–[43]. Federated learning, which enables statistical inference and machine learning across multiple decentralized data sources, can also be implemented downstream of PEHRT [44].
Existing research has primarily focused on specific aspects of EHR data preparation within individual institutions, including data cleaning, data standardization, medical code aggregation, and quality assessment [45]–[48]. Data cleaning involves transforming and normalizing raw EHR data, such as converting relational databases into flat file formats, conducting exploratory data analysis, detecting anomalies, and scaling and transforming data [20], [49]–[51]. Standardization involves mapping raw data to common data models and aligning medical codes with established medical coding systems or ontologies. Open-source tools, including Electronic Health Record Quality Control (EHR-QC), Cohort Migrator Toolkit (CMToolkit), and the Observation Health Data Sciences and Informatics (OHDSI) network’s Themis, are available to convert data to the widely used Observational Medical Outcomes Partnership-Common Data Model (OMOP-CDM) [20], [48], [52], [53].
Following standardization, medical codes are aggregated or “rolled up” into broader medical concepts to represent clinically meaningful variables as disaggregated data are often too granular for research purposes. For example, codes within standard medical coding systems, such as the International Classification of Diseases (ICD) for diagnoses or National Drug Codes (NDC) for medications, are typically rolled-up into higher-level concepts using established ontologies. Code roll-up can be done manually or with machine learning approaches [54]. Lastly, quality assessment of EHR data is conducted using established criteria or open-source tools, such as the Automated Characterization of Health Information at Large-scale Longitudinal Evidence Systems (ACHILLES) or the Data Quality Dashboard (DQD) from the OHDSI network [20], [55]–[57]. To the best of our knowledge, ehrapy is the only end-to-end tool currently available for the curation and analysis of EHR data [58]. ehrapy is a modular, open-source Python framework designed for exploratory data analysis and consists of modules for data preprocessing and ontology mapping as well as analysis tools for causal inference, survival analysis, and patient stratification.
In spite of the large volume of work devoted to EHR data preparation, significant gaps remain when working with multi-institutional EHR data [59]. Existing tools designed for data from a single institution fail to address the variability in coding practices across institutions, which is a key challenge of an integrative analyses. Many health systems use local medical codes (i.e., codes specific to their system) that are not mapped to standardized coding systems. To enable analysis, these local codes must first be standardized and harmonized across datasets. Traditionally, standardization has been achieved by mapping local codes within specific domains (e.g., diagnostic or medication codes) to standard coding systems, either manually or using automated tools like Medication Extraction and Normalization (MedXN) [20], [49], [60]. However, recent advances in Large Language Models (LLMs) and representation learning have facilitated the generation of semantic embeddings, which are vector representations that capture the meanings of EHR codes and their relationships. Embeddings significantly enhance the efficiency and accuracy of data standardization, which is a critical aspect of preparing multi-institutional EHR data. To the best of our knowledge, existing tools, such as ehrapy, lack user-friendly modules for code roll-up or standardizing local codes and do not incorporate advances in representation learning for this purpose.
A key innovation of PEHRT is its inclusion of code and documentation for state-of-the-art methods for representation learning and harmonization that generate semantic embeddings from summary-level EHR data from multiple institutions and from LLMs. PEHRT also includes detailed protocols for data pre-processing that are not fully integrated into any existing tools. For example, rolling up medical codes to higher-level concepts is especially challenging for researchers unfamiliar with EHR data, as multiple ontologies may represent a single concept. PEHRT provides researchers with detailed guidance on medical code roll-up as well as general instructions for processing a broad range of structured data (e.g., diagnostic codes, medication prescriptions, laboratory tests, procedures) and unstructured data in the form of free-text (e.g., progress notes, radiology reports).
PEHRT consists of 2 modules: (1) data pre-processing and (2) representation learning. The inputs of PEHRT are original EHR datasets from one or more institutions and the outputs are robust, research-ready datasets that are suitable for a wide range of scientific purposes. In the setting of multi-institution data, PEHRT outputs a harmonized dataset that harnesses information across the different data sources. One of our key contributions is an online tutorial (see Figure 1), which guides users through each step of PEHRT using publicly available EHR data from the Medical Information Mart for Intensive Care IV (MIMIC-IV) database.
Prior to utilizing PEHRT, it is important for researchers to familiarize themselves with their EHR data sources, including relevant documentation, data structure, and coding systems. The data must be stored in a relational database, but it is not necessary that data is represented in a common data model. Additional details about the equipment and software requirements for PEHRT can be found in the tutorial introduction.
Figure 1: PEHRT enables users to prepare multi-institution Electronic Health Record (EHR) data for a variety of scientific purposes with 2 modules: (1) data pre-processing, and (2) representation learning. Each step of PEHRT is implemented in our user-friendly online tutorial using publicly available EHR data..
Data pre-processing is a meticulous process that involves several sub-steps, including (1.1) data cleaning, (1.2) code mapping and roll-up, (1.3) Natural Language Processing (NLP) of free-text data, and (1.4) cohort creation. Pre-processing is performed on each EHR dataset that is input to PEHRT. The goals of data pre-processing are to transform raw EHR-data to a more usable format and to standardize data across institutions to support integrative analysis and consistent data interpretation. The PEHRT pipeline enables processing of a broad range of structured data, including diagnostic codes, medication prescriptions, laboratory tests, and procedure codes. Prior to pre-processing, it is necessary to set up the computing environment and extract the desired data; details are provided in Module 1 of the tutorial.
Step 1.1: Data Cleaning. PEHRT employs a multi-step data cleaning process to enhance the quality of noisy and fragmented EHR data. Data cleaning generally begins by merging relevant data tables and standardizing data format across the tables. For example, standardizing how time is represented across the EHR system is often necessary. Some data tables may include exact timestamps while others only contain dates. Time entries are often standardized by retaining only the date component, creating a consistent format for daily-level analysis. Next, variables irrelevant to downstream analytical tasks are excluded to improve computational efficiency and reduce memory demands. Additionally, since EHR data frequently include errors, particularly in the time-related field, records with implausible dates, such as those prior to the 1980s or beyond the current year, are identified and removed. Lastly, exact duplicate records, which may arise during the aggregation of timestamp-level data into a daily format, are removed to produce a cleaned dataset. When large EHR datasets of interest, we recommend processing the data in batches, which is illustrated in our online tutorial.
Step 1.2: Code Mapping and Roll-Up. Medical codes are often too specific for research studies. To address this issue, PEHRT standardizes codes by mapping them to recognized coding systems and then aggregates or “rolls-up” the codes into higher-level categories across the domains of interest. Code roll-up provides consistency across diverse EHR datasets while also ensuring data is at an appropriate level of granularity. PEHRT focuses on the implementation of the standardization and roll-up process by mapping more granular level codes to higher level concepts across four domains: diagnoses, medications, laboratory tests, and procedures.
To standardize the four coding domains, we use established medical coding systems: (1) International Classification of Diseases, Ninth or Tenth Revision (ICD-9 or ICD-10, respectively) for diagnoses, (2) Prescription Normalized Names and Codes (RxNorm) for medications, (3) Logical Observation Identifiers Names and Codes (LOINC) for laboratory measurements, and (4) Current Procedural Terminology, Fourth Edition (CPT-4), Healthcare Common Procedure Coding System (HCPCS), and ICD, Ninth or Tenth Revision, Procedure Coding System (ICD-9-PCS and ICD-10-PCS, respectively) for procedures. Due to the transition of ICD-9 codes to ICD-10 codes in 2015, older diagnosis data are largely represented by ICD-9 while recent data use ICD-10. It is critical to use mappings that synchronize the two versions when using longitudinal data before and after 2015 [61].
Following standardization, codes are rolled-up to higher level medical concepts according to common ontologies. For diagnoses, we recommend the Phenotype Code (PheCode) hierarchy for ICD codes [61]. The PheCode hierarchy provides a total of 1875 integer, 1-digit, and 2-digit level codes that capture a wide range of disease conditions with sufficient granularity while maintaining a reasonable number of distinct codes. The hierarchy also provides parent-child relationships that characterize associations between PheCodes. For medications, we recommend rolling up RxNorm codes to RxNorm ingredient codes unless the study specifically requires dosage information. For studies involving drug classes, these ingredient level codes can be further rolled up into drug classes according to existing ontologies including the Anatomical Therapeutic Chemical (ATC) classification, the Accrual to Clinical Trials (ACT) ontology, or the Veteran’s Affairs (VA) drug class, depending on the researchers’ needs [62], [63]. Laboratory measurements for the same analyte can vary due to differences in the specimen, time of measurement, method, or scale, resulting in multiple LOINC codes. We recommend rolling up LOINC codes to the lowest level of LOINC part (LP) according to the LOINC component hierarchy [64]. Note that PEHRT only supports the usage of laboratory codes. Preparing laboratory result data is an involved process that requires informatics experts familiar with the EHR datasets of interest as it would require unit harmonization and specialized quality control.
Unfortunately, few established hierarchies exist for procedure codes. We recommend rolling up procedure codes into categories according to the Clinical Classification Software (CCS). Many institutions use both CPT and ICD procedure codes. It is thus important to include both when rolling up codes. Additionally, medications are sometimes coded as procedures in EHRs due to the way certain treatments are administered or billed. As such, it is necessary to map medication procedure codes to relevant RxNorm codes. PEHRT includes visualizations within a searchable and downloadable web API for ICD, LOINC and RxNorm hierarchies (see the visualizations here).
Step 1.3: Natural language processing. When free-text clinical notes are also available, one may employ natural language processing (NLP) tools to extract clinical concepts from unstructured clinical notes by identifying and mapping terms such as diseases, symptoms, and medications to Concept Unique Identifiers (CUIs) in the UMLS. Existing NLP software tools like NILE, cTAKES, or MetaMap enable this extraction, allowing for semantic analysis and structured representation of clinical text, which is then integrated into the dataset for downstream analysis [65]–[67]. We previously introduced a pipeline for EHR phenotyping, which contains detailed steps for running NLP as well as an online tutorial (https://celehs.github.io/PheCAP/) [68].
Step 1.4: Cohort Creation. EHR-based studies are typically conducted on a group of patients who meet specific inclusion/exclusion criteria, such as those with certain diagnoses, medications, procedures, or implanted devices. PEHRT streamlines cohort identification by leveraging the standardized and rolled-up codes from Step 1.2. For example, when the cohort is identified based on a particular disease diagnosis, a common strategy for identifying the patient cohort is to use corresponding ICD codes [18], [69], [70]. However, ICD codes can be overly granular, which often leads to different studies using inconsistent sets of ICD codes to capture the condition of interest. To address this issue, PEHRT utilizes PheCodes from Step 1.2 to identify patients associated with the condition of interest. For the identified cohort of patients, PEHRT then aggregates the structured data from Steps 1.1–1.2 as well as the CUIs derived from unstructured notes in Step 1.3 if free-text data is available for analysis. For studies involving temporal analysis, we recommend further aggregating patient-level longitudinal data into time windows, such as monthly counts or averages. For chronic conditions like rheumatoid arthritis, monthly aggregation typically provides sufficient granularity while simplifying downstream analysis.
Following data pre-processing, representation learning is used to develop institution-specific embeddings. Embeddings are vector representations of the EHR data that capture the semantic and relational properties of codes from structured data and CUIs from free-text. The embeddings can be used for a variety of downstream tasks within each institution, including medical knowledge graph construction, phenotyping, and predictive modeling [49], [71]. If multi-institutional EHR data is available, PEHRT also contains a module that implements a novel matrix-completion technique to train a joint embedding [72]. The joint embedding leverages information across the data sources without requiring the sharing of individual level data and can be used for collaborative analyses. Additionally, PEHRT incorporates embeddings from pre-trained language models (PLMs) into its representation learning module to further enhance the quality of the learned data representations. Using similar strategies as in [71], we structure Module 2 to have four sub-steps: (2.1) EHR embedding training, (2.2) PLM-based embedding generation, (2.3) joint multi-institutional EHR embedding training, and (2.4) embedding validation.
Step 2.1: EHR Embedding training. PEHRT first generates EHR embeddings from summary-level data using the Singular Value Decomposition of the Pointwise Mutual Information (SVD-PMI) algorithm [73]. This method factorizes a PMI matrix constructed from co-occurrence counts of codes and CUIs. As a variant of the widely adopted word2vec algorithm, SVD-PMI has proven to be highly effective in learning meaningful and interpretable clinical embeddings [74].
The SVD-PMI algorithm consists of three steps. First, a co-occurrence matrix \({\boldsymbol{C}} = [C(w, c)]\) is constructed, where each element represents the number of patients in which a target code or CUI \(w\) co-occurs with a context code or CUI \(c\) within a predefined time window (e.g., 30 days). This matrix captures the local context of clinical concepts and provides a foundation for computing semantic similarity. Because calculating \({\boldsymbol{C}}\) at scale is computationally intensive, we developed an optimized algorithm for efficient co-occurrence computation in our prior work, enabling scalable training of PEHRT embeddings for large EHR datasets [75]–[77].
Next, the co-occurrence matrix is used to calculate the shifted positive PMI (SPPMI) matrix, which represents the relationships among codes and CUIs. The SPPMI matrix is defined as \[\text{SPPMI}(w, c) = \max \left\{ \log \frac{C(w, c) \cdot |D|}{C(w, \cdot) C(c, \cdot)} -\log k, 0 \right\}\] where \(C(w, \cdot)\) is the row sum of \(C(w,c)\), \(C( \cdot, c)\) is the column sum of \(C(w,c)\), \(|D|\) is the total sum of the co-occurrence, and \(k\) is the negative sampling rate. We have found that the embedding quality is generally not sensitive to the the length of the time window, but is best when \(k = 1\) [49]. Lastly, the SPPMI matrix is decomposed with its rank-\(d\) SVD, represented as \({\boldsymbol{Q}}_{d} {\boldsymbol{\Lambda}} {\boldsymbol{Q}}_{d}\). PEHRT outputs the \(d\)-dimensional embedding vectors as \(\mathbf{X}_{\scriptscriptstyle \sf EHR}= {\boldsymbol{Q}}_d {\boldsymbol{\Lambda}}^{1/2}\). To select \(d\), we recommend retaining a large amount of variation in the SVD (e.g., 95%) by evaluating the eigenvalue decay [75], [78]. Alternatively, \(d\) can be selected by maximizing the area under the receiver operating characteristic curve (AUC) for discriminating between pairs of codes and CUIs with known relationships against randomly selected pairs (see Step 2.3 for further details) [79], [80].
Step 2.2: PLM-based embeddings. The SVD-PMI embeddings are derived from the co-occurence matrix and therefore capture the meaning of codes and CUIs based on how they are used within a healthcare system. Additional semantic information about the meaning of codes and CUIs can also be obtained from their textual descriptions to complement this system-level perspective. To leverage textual descriptions in embedding training, PEHRT produces a second set of embeddings using PLMs. PLMs are trained on large text corpora and, in some cases, further fine-tuned with biomedical knowledge sources such as PubMed articles, clinical notes, and knowledge graphs. Commonly used PLMs include Self-Aligned Pre-trained Bidirectional Encoder Representations from Transformers (SapBERT), ClinicalBERT, Cross-lingual knowledge-infused medical term embedding (CODER), PubMedBERT, BERT for Biomedical Text Mining (BioBERT), and BAAI General Embeddings (BGE), many of which were fine tuned from the original BERT model [81]–[86]. Given the text string of a code or CUI, a PLM produces a corresponding embedding vector. PEHRT contains embeddings from many common PLMs, including CODER, SapBERT, PubMedBERT, and BioBERT. Obtaining the PLM-based embeddings is generally not computationally burdensome, though users can alternatively utilize text-embedding-3-small model 2 via the OpenAI API.
When working with data from a single institution, the PLM-based embeddings can be integrated with the SVD-PMI embeddings from Step 2.1 to enhance overall embedding quality. A simple yet effective approach is to create a weighted concatenation of the two embeddings, with the weighting adjusted to the specific downstream task. Specifically, we let \[\mathbf{X}_{\scriptscriptstyle \sf INT}= [ w \mathbf{X}_{\scriptscriptstyle \sf EHR}, (1-w)\mathbf{X}_{\scriptscriptstyle \sf PLM}] \label{eq:w}\tag{1}\] where \(\mathbf{X}_{\scriptscriptstyle \sf PLM}\) is the PLM-based embedding and \(w \in [0,1]\) is the weight. The integrated embedding captures the complementary strengths of SVD-PMI and PLM-based embeddings: SVD-PMI excels at identifying clinically related codes (e.g., drug-disease pairs), while PLMs capture semantic similarity between codes [33], [84].
Step 2.3: Joint multi-institution EHR embedding training. When data from multiple institutions are available, PEHRT uses the BONMI algorithm [72] to derive a shared representation of EHR concepts by aligning and completing institution-specific SPPMI matrices efficiently with near-optimal error bounds. BONMI constructs an aggregated matrix covering all unique codes and CUIs, assigning weighted averages to overlapping pairs and marking others as missing. The weights are based on data quality using user-defined or data-driven metrics and the missing values are imputed by aligning institution-specific embeddings via orthogonal transformations. The completed matrix is then factorized with a SVD to generate the joint embedding, with rank selection as in Step 2.1. The joint embedding can be integrated with PLM-based embeddings through weighted concatenation, following the procedure in Step 2.2.
Step 2.4: Embedding validation. To evaluate the quality of the trained embeddings, PEHRT provides simple metrics quantifying their performance in discriminating between concept pairs with known relationships against randomly selected pairs. The relationships can be curated from existing ontologies and the UMLS. For each pair under consideration, the cosine similarity of the corresponding embedding vectors is calculated to measure their degree of relatedness. Embedding quality is then quantified based on the AUC of the cosine similarity in distinguishing between the related and random pairs (i.e., the probability that a randomly selected related pair will have a higher cosine similarity than a randomly selected random pair). These metrics can be used to evaluate the performance of the institution-specific EHR-based embeddings, the PLM-based embeddings, as well as the joint embedding when data multi-institional EHR data is available. In the latter case, we have found that the BONMI embeddings generally achieve the highest performance in a wide variety of applications, but recommend comparing their performance with PLM-based embeddings and institution-specific embeddings for thorough evaluation [71], [77], [87].
Below we illustrate how to use PEHRT to: (i) obtain pre-processed EHR data, (ii) develop embeddings using EHR data from multiple institutions for simple predictive modeling tasks, and (iii) perform an integrative predictive modeling task leveraging a joint embedding. We use data from the Mass General Brigham (MGB), the Veterans Health Administration (VA), Boston Children’s Hospital (BCH), the University of Pittsburgh Medical Center (UPMC), and MIMIC-IV. In our analysis, the MGB EHR data contains 2.5 million patients from 1998 to 2018. The VA Corporate Data Warehouse (CDW) aggregates data from 150 VA facilities into a single data warehouse, with records from 1999 to 2019 covering 12.6 million patients. The BCH contains 251K patients from 2009 to 2022 and the UPMC EHR data includes 95K patients from 2004 to 2022, focusing on individuals with at least one occurrence of ICD codes related to Alzheimer’s disease and dementia or multiple sclerosis. The MIMIC-IV dataset contains data on over 65K ICU admissions and over 200K emergency department admissions at Beth Israel Deaconess Medical Center in Boston, Massachusetts, spanning 2008 to 2019.
We used Module 1 of PEHRT to pre-process EHR data from all of the institutions. For illustrative purposes, all of the pre-processing steps are implemented in our online tutorial for the MIMIC-IV dataset, beginning with instructions on how to set up your workspace and gain access to the MIMIC-IV data, as well as how to become familiar with the data structure and content. The input of Module 1 are the original data tables from the MIMIC-IV database and the output is a pre-processed dataset. The pre-processing begins with data cleaning, which involves merging the appropriate data tables, standardizing data format across tables, removing irrelevant and redundant information, constraining the data to the relevant time window, and processing the data in batches. Next, code roll-up is performed for diagnosis, procedure, and medication codes. MIMIC-IV uses common coding systems so that code mapping is not required in the pre-processing steps. We generally recommend NILE for text processing, but show a lightweight example using a custom NLP module for the purposes of illustration in our tutorial. Following the structured and unstructured data pre-processing, we also illustrate how to refine the data to a cohort for a specific analysis, using a study of asthma as an example.
Within each institution, we obtained EHR embeddings following the procedure in Module 2 based on PheCodes, CCS categories, RXNorm codes, LOINC codes, and local codes specific to a particular institution. Table 1 shows the number of different codes across the 5 institutions and the various coding domains. As expected, substantial heterogeneity exists in terms of the numbers of unique codes with in each domain. We also obtained PLM-based embeddings using CODER, SapBERT, BioBERT PubMedBERT, BGE, and OpenAI’s text-embedding-3-small model.
Institution | PheCode | CCS | RxNorm | LOINC | Local Codes | Total |
---|---|---|---|---|---|---|
MGB | 1772 | 243 | 1235 | 6370 | 0 | 9620 |
VA | 1776 | 224 | 1257 | 1034 | 2673 | 6964 |
BCH | 1543 | 209 | 1509 | 1942 | 0 | 5203 |
UPMC | 1841 | 245 | 1987 | 5833 | 8080 | 17986 |
MIMIC IV | 637 | 129 | 959 | 0 | 2894 | 4619 |
Total | 1869 | 248 | 4103 | 11198 | 13366 | 30784 |
We evaluated the quality of the individual PLM embeddings, the joint embeddings trained with BONMI, and joint embeddings integrated with CODER embeddings (BONMI+). The quality of the embeddings derived from the various methods was evaluated in detecting related versus random pairs of codes as described in Step 2.4. We also assessed embedding quality in mapping local lab codes in the VA data to LOINC/LP codes using \(11,808\) curated mappings from OMOP [88]. We reported the top \(k\) accuracy of the codes for each set of embeddings, defined as the proportion of test cases in which the correct mapping for a given code appears among the top \(k\) predictions generated by the embeddings.
To highlight the practical utility of PEHRT, we used the trained embeddings to improve the identification and selection of relevant features for predictive modeling. We focused on eleven diseases: Type 1 Diabetes (T1D), Type 2 Diabetes (T2D), Alzheimer’s Disease (AD), Depression (DP), Coronary Atherosclerosis (CA), Congestive Heart Failure (CHF), Congestive Heart Failure - Nonhypertensive (CHFN), Regional Enteritis (RE), Ulcerative Colitis (UC), Rheumatoid Arthritis (RA), and Rheumatoid Arthritis and Other Inflammatory Polyarthropathies (RAO). For each disease, we identified the top \(100\) features with the highest cosine similarities to the disease’s PheCode using each embedding method. Additionally, we randomly selected negative features from the complement of the union of features identified by all methods. To evaluate the accuracy of identifying relevant features, we assigned relevance scores (ranging from \(0\) to \(1\)) to each feature using GPT-4. We then computed the AUC for each method, treating the top \(100\) features as positive cases and the randomly selected features as negative cases, with the GPT-4 relevance scores serving as probabilities. A higher AUC indicates greater accuracy in selecting relevant features.
We also considered two predictive modeling tasks: predicting future disability status in multiple sclerosis (MS) patients and predicting time to nursing home admission or death in Alzheimer’s disease (AD) patients. Both predictive modeling tasks were evaluated at UPMC and MGB based on models incorporating demographics (age at baseline, sex, race/ethnicity), healthcare utilization, and selected features using the procedure described in the previous paragraph. For model training, counts of the selected features and number of visits, a measure of healthcare utilization, were aggregated over the pre-specified time period at baseline (i.e., 1 year for predicting future disability status and 2 years for predicting time to nursing home admission or death). We also log-transformed (\(x \mapsto \log(x+1)\)) the count features to improve stability of model fitting. A lasso-penalized logistic regression model was trained for the disability status outcome and a lasso-penalized Cox proportional hazards model was trained for the time to nursing home admission outcome. The hyperparameter was tuned through five-fold cross-validation.
BONMI | BONMI+ | CODER | SapBERT | BioBERT | PubMedBERT | openAI | BGE | |
---|---|---|---|---|---|---|---|---|
Similarity | 0.916 | 0.966 | 0.950 | 0.755 | 0.537 | 0.565 | 0.951 | 0.801 |
Relatedness | 0.815 | 0.842 | 0.811 | 0.682 | 0.477 | 0.547 | 0.832 | 0.690 |
The embedding validation results for detecting known relationship pairs are summarized in Table 2. Overall, PEHRT-based embeddings outperform most PLM-based embeddings in terms of discrimination. Among PLM-based methods, OpenAI and CODER achieve the strongest results, yet they still fall short of BONMI+. This gap arises because PLM-based embeddings are primarily trained on biomedical text corpora and therefore fail to capture the nuanced disease patterns and clinical associations reflected in real-world EHR data. By contrast, the PEHRT-based BONMI+ embedding achieves the highest performance on both tasks as it draws on the representational strengths of PLMs while also integrating information across EHR data from multiple institutions. For code mapping, the results in Table 3 show that the PLM-based embeddings from SapBERT and openAI are superior to BONMI+. This result is expected since this code mapping relies heavily on the semantic meaning of code descriptions and underscores our recommendation to validate the PLM-based embeddings individually as they may be more appropriate for some tasks.
BONMI | BONMI+ | CODER | SapBERT | BioBERT | PubMedBERT | openAI | BGE | |
---|---|---|---|---|---|---|---|---|
Top-1 | 0.20 | 23.55 | 30.83 | 44.26 | 6.45 | 7.18 | 49.34 | 34.98 |
Top-5 | 2.64 | 50.66 | 58.38 | 66.05 | 8.94 | 12.60 | 79.92 | 52.96 |
Top-10 | 4.49 | 62.53 | 69.27 | 72.45 | 11.48 | 16.32 | 85.00 | 59.36 |
Top-20 | 8.70 | 74.55 | 76.70 | 76.84 | 14.70 | 21.25 | 88.72 | 65.36 |
For the predictive modeling tasks, Table 4 presents the rank correlation between the cosine similarities of the candidate features and the GPT-4 scores for the 11 target diseases. BONMI+ consistently outperforms the other embeddings in selecting features for all of the diseases. In partiular, both BONMI and BONMI+ outperform the PLM-based embeddings as feature selection inherently depends on relationships between codes and CUIs, which are well documented in real-world EHR data.
Disease | BONMI | BONMI+ | CODER | SapBERT | BioBERT | PubMedBERT | openAI | BGE |
---|---|---|---|---|---|---|---|---|
T1D | 0.385 | 0.429 | 0.295 | 0.144 | -0.072 | 0.045 | 0.369 | 0.138 |
T2D | 0.479 | 0.497 | 0.303 | 0.087 | -0.045 | 0.057 | 0.477 | 0.179 |
AD | 0.313 | 0.362 | 0.289 | 0.164 | -0.079 | 0.021 | 0.382 | 0.289 |
DP | 0.449 | 0.489 | 0.361 | 0.024 | 0.014 | 0.002 | 0.440 | 0.216 |
CA | 0.448 | 0.478 | 0.343 | 0.055 | -0.028 | 0.033 | 0.426 | 0.007 |
CHF | 0.484 | 0.540 | 0.444 | 0.377 | 0.035 | -0.032 | 0.444 | 0.113 |
CHFN | 0.687 | 0.735 | 0.607 | 0.464 | 0.035 | 0.174 | 0.642 | 0.078 |
RE | 0.289 | 0.252 | 0.115 | 0.059 | 0.080 | 0.005 | 0.206 | 0.107 |
UC | 0.262 | 0.215 | 0.067 | 0.048 | -0.006 | 0.029 | 0.267 | 0.163 |
RA | 0.328 | 0.291 | 0.184 | 0.030 | 0.034 | -0.002 | 0.338 | 0.073 |
RAO | 0.499 | 0.463 | 0.249 | 0.223 | 0.054 | 0.000 | 0.490 | 0.044 |
AVE. | 0.420 | 0.432 | 0.296 | 0.152 | 0.002 | 0.030 | 0.407 | 0.128 |
Figures 2 and 3 present the AUC for the models for MS disability prediction and time to nursing home admission and death for AD patients at MGB and UPMC, respectively. Consistent with our results measuring the quality of feature selection, models incorporating the BONMI and BONMI+ selected features have the strongest performance. Interestingly, models with features selected by institution-specific EHR embeddings achieved better performance than PLM embeddings at MGB, but not at UPMC. This finding underscores our recommendation to validate multiple embeddings as results can vary across tasks and institutions.
Data harmonization is essential for ensuring the validity, transparency, and reproducibility of multi-institutional EHR-based research. However, significant heterogeneity across data sources complicates harmonization and no comprehensive and standardized procedures currently exist to address this challenge. To fill this gap, we introduced PEHRT, a common pipeline for harmonization of EHR data for translational applications. PEHRT operates entirely on summary-level data and preserves data privacy. We designed our pipeline for easy implementation through our online tutorial and suite of resources, including R and Python modules, notebooks, and APIs. We also demonstrated the utility of our pipeline in several modeling tasks using data from five healthcare systems. Beyond these applications, PEHRT supports a wide range of scientific objectives, including phenotyping, cross-institutional clinical studies, knowledge graph construction, and federated learning, making it a versatile tool for advancing clinical research and practice [87].
The authors gratefully acknowledge please remember to list all relevant funding sources in the unblinded version↩︎
https://platform.openai.com/docs/guides/embeddings/embedding-models↩︎