Abstract

Integrative analysis of multi-institutional Electronic Health Record (EHR) data enhances the reliability and generalizability of translational research by leveraging larger, more diverse patient cohorts and incorporating multiple data modalities. However, harmonizing EHR data across institutions poses major challenges due to data heterogeneity, semantic differences, and privacy concerns. To address these challenges, we introduce PEHRT, a standardized pipeline for efficient EHR data harmonization consisting of two core modules: (1) data pre-processing and (2) representation learning. PEHRT maps EHR data to standard coding systems and uses advanced machine learning to generate research-ready datasets without requiring individual-level data sharing. Our pipeline is also data model agnostic and designed for streamlined execution across institutions based on our extensive real-world experience. We provide a complete suite of open source software, accompanied by a user-friendly tutorial, and demonstrate the utility of PEHRT in a variety of tasks using data from diverse healthcare systems.

PEHRT: A Common Pipeline for Harmonizing Electronic Health Record data for Translational Research

Jessica Gronsbell, Vidul Ayakulangara Panickan, Chris Lin, Thomas Charlon,
Chuan Hong, Doudou Zhou, Linshanshan Wang, Jianhui Gao, Shirley Zhou,
Yuan Tian, Yaqi Shi, Ziming Gan, Tianxi Cai

Data harmonization, Data pre-processing, Electronic health records, Integrative analysis, Representation learning

1 Introduction↩︎

The growing availability of data from Electronic Health Records (EHRs) has transformed translational biomedical research. In the past decade, EHR data has been harnessed in a wide range of applications that have improved healthcare delivery and deepened our understanding of human health. These applications include dynamic risk prediction of diseases, real-world treatment comparisons, development of medical knowledge graphs, and a broad range of genomic studies [1]–[11]. To fully leverage the potential of these applications, integrative analysis of EHR data across diverse healthcare settings has emerged as a key strategy to enhance the generalizability of scientific findings, boost statistical power, and support the development of robust models for precision medicine. The COVID-19 pandemic, in particular, catalyzed a new era of multi-institutional EHR-based research as several international collaborative networks were rapidly established to conduct large-scale, federated studies [12]–[14]. These initiatives significantly accelerated knowledge generation and amplified the impact of EHR-based research on the treatment and management of COVID-19.

Progress not withstanding, there are numerous barriers to effectively utilizing multi-institutional EHR data in translational applications. A key challenge is the lack of semantic interoperability across EHR systems, which results in substantial heterogeneity in clinical documentation and medical coding practices [15]–[19]. The foundation of any collaborative research study therefore rests on careful standardization of data elements across different data sources, a process known as data harmonization. Currently, there are no universally accepted or standardized procedures for harmonizing EHR data for an integrative analysis, despite the importance of such standards for ensuring the validity, transparency, and reproducibility of research findings [20]. The significance of proper data preparation became particularly evident during the COVID-19 pandemic when two high-profile studies published in The Lancet and The New England Journal of Medicine were retracted within months of publication [21], [22]. In spite of passing some of the most rigorous peer review, the authors could not verify the data or processing procedures that underscored the validity of their conclusions. These incidents highlight the need for comprehensive and rigorous standards for harmonization to ensure the scientific integrity and credibility of collaborative research.

To address this need, we developed PEHRT, an efficient and comprehensive pipeline for harmonizing EHR data for translational biomedical research. PEHRT consists of two core modules: (1) data pre-processing and (2) representation learning. Our pipeline maps raw EHR data to standardized coding systems and uses advanced machine learning techniques to efficiently curate a multi-institutional EHR dataset without the sharing of individual-level data or requiring that the data be represented in any particular data model. The output of PEHRT is a robust, research-ready dataset suitable for a wide range of scientific studies across healthcare institutions, including medical knowledge graph construction, phenotyping, predictive modeling, clinical studies, and federated learning. Importantly, PEHRT is available in open source R and Python software that is fully documented and executed within a user-friendly online tutorial (https://celehs.github.io/PEHRT/). Additionally, we further illustrate the utility and execution of PEHRT in several downstream tasks using diverse EHR data from multiple healthcare systems.

2 Motivation for PEHRT↩︎

PEHRT was motivated by recent efforts in establishing federated networks of EHR data for translational and Artificial Intelligence (AI) research, including the Consortium for Clinical Characterization of COVID-19 by EHR (4CE) and the Artificial Intelligence/Machine Learning Consortium to Advance Health Equity and Researcher Diversity (AIM-AHEAD) program [12], [14]. 4CE is an international research collaboration that was established in 2021 to study COVID-19 [12]. With nearly 100 hospitals across seven countries, 4CE successfully harmonized EHR data to investigate the epidemiology and clinical characteristics of COVID-19 across healthcare systems [23]. The consortium’s work provided critical insights into temporal trends in laboratory values, demographic variations, and the effects of pre-existing conditions on patient outcomes [12], [24], [25]. Successful federated EHR networks such as 4CE have set a new standard for managing the complexities of diverse EHR data in collaborative research by demonstrating the importance of high-quality data processing for producing trustworthy scientific results [26]. AIM-AHEAD is pursuing a similar strategy by developing its own federated network, with a focus on leveraging AI and machine learning applied to EHR data to help reduce health disparities.

PEHRT was informed by lessons learned from conducting translational studies across multiple EHR systems within these networks. Our pipeline improves the efficiency and thoroughness of EHR data harmonization to provide researchers with a strong framework for conducting valid, transparent, and reproducible collaborative biomedical research. PEHRT is equipped with a suite of resources, including several R and Python packages that include detailed documentation together with example notebooks, web Application Programming Interfaces (APIs) for data visualization, and a dataset that has been used to assist researchers in applying PEHRT for their own purposes. The only requirement to use PEHRT is that the EHR data of interest are available in a relational database.

3 Applications of PEHRT for translational research↩︎

The output of PEHRT is a research-ready dataset that integrates EHR data from multiple institutions without the sharing of individual-level data to adhere to data privacy standards [27]–[30]. Datasets from PEHRT can be used for many of the same purposes as data from a single healthcare institution, but with the goal of reaching more generalizable scientific conclusions. For example, PEHRT enables the construction of medical knowledge graphs as well as the precise identification of patient cohorts with specific phenotypes for applications in risk prediction, drug efficacy assessment, and epidemiological studies [6], [10], [31]–[33]. When EHR data are linked to specimen biorepositories, PEHRT can be applied upstream of genetic studies, such as Phenome-Wide Association Studies (PheWAS) that uncover the association between a novel biomarker and a set of clinical or demographic phenotypes [34]–[39]. Additionally, our pipeline can be used to curate data for real-world evidence generation, post-marketing device surveillance, and clinical decision support tool development [29], [30], [40]–[43]. Federated learning, which enables statistical inference and machine learning across multiple decentralized data sources, can also be implemented downstream of PEHRT [44].

4 Comparison of PEHRT with other methods↩︎

Existing research has primarily focused on specific aspects of EHR data preparation within individual institutions, including data cleaning, data standardization, medical code aggregation, and quality assessment [45]–[48]. Data cleaning involves transforming and normalizing raw EHR data, such as converting relational databases into flat file formats, conducting exploratory data analysis, detecting anomalies, and scaling and transforming data [20], [49]–[51]. Standardization involves mapping raw data to common data models and aligning medical codes with established medical coding systems or ontologies. Open-source tools, including Electronic Health Record Quality Control (EHR-QC), Cohort Migrator Toolkit (CMToolkit), and the Observation Health Data Sciences and Informatics (OHDSI) network’s Themis, are available to convert data to the widely used Observational Medical Outcomes Partnership-Common Data Model (OMOP-CDM) [20], [48], [52], [53].

Following standardization, medical codes are aggregated or “rolled up” into broader medical concepts to represent clinically meaningful variables as disaggregated data are often too granular for research purposes. For example, codes within standard medical coding systems, such as the International Classification of Diseases (ICD) for diagnoses or National Drug Codes (NDC) for medications, are typically rolled-up into higher-level concepts using established ontologies. Code roll-up can be done manually or with machine learning approaches [54]. Lastly, quality assessment of EHR data is conducted using established criteria or open-source tools, such as the Automated Characterization of Health Information at Large-scale Longitudinal Evidence Systems (ACHILLES) or the Data Quality Dashboard (DQD) from the OHDSI network [20], [55]–[57]. To the best of our knowledge, ehrapy is the only end-to-end tool currently available for the curation and analysis of EHR data [58]. ehrapy is a modular, open-source Python framework designed for exploratory data analysis and consists of modules for data preprocessing and ontology mapping as well as analysis tools for causal inference, survival analysis, and patient stratification.

In spite of the large volume of work devoted to EHR data preparation, significant gaps remain when working with multi-institutional EHR data [59]. Existing tools designed for data from a single institution fail to address the variability in coding practices across institutions, which is a key challenge of an integrative analyses. Many health systems use local medical codes (i.e., codes specific to their system) that are not mapped to standardized coding systems. To enable analysis, these local codes must first be standardized and harmonized across datasets. Traditionally, standardization has been achieved by mapping local codes within specific domains (e.g., diagnostic or medication codes) to standard coding systems, either manually or using automated tools like Medication Extraction and Normalization (MedXN) [20], [49], [60]. However, recent advances in Large Language Models (LLMs) and representation learning have facilitated the generation of semantic embeddings, which are vector representations that capture the meanings of EHR codes and their relationships. Embeddings significantly enhance the efficiency and accuracy of data standardization, which is a critical aspect of preparing multi-institutional EHR data. To the best of our knowledge, existing tools, such as ehrapy, lack user-friendly modules for code roll-up or standardizing local codes and do not incorporate advances in representation learning for this purpose.

A key innovation of PEHRT is its inclusion of code and documentation for state-of-the-art methods for representation learning and harmonization that generate semantic embeddings from summary-level EHR data from multiple institutions and from LLMs. PEHRT also includes detailed protocols for data pre-processing that are not fully integrated into any existing tools. For example, rolling up medical codes to higher-level concepts is especially challenging for researchers unfamiliar with EHR data, as multiple ontologies may represent a single concept. PEHRT provides researchers with detailed guidance on medical code roll-up as well as general instructions for processing a broad range of structured data (e.g., diagnostic codes, medication prescriptions, laboratory tests, procedures) and unstructured data in the form of free-text (e.g., progress notes, radiology reports).

5 Overview of PEHRT↩︎

PEHRT consists of 2 modules: (1) data pre-processing and (2) representation learning. The inputs of PEHRT are original EHR datasets from one or more institutions and the outputs are robust, research-ready datasets that are suitable for a wide range of scientific purposes. In the setting of multi-institution data, PEHRT outputs a harmonized dataset that harnesses information across the different data sources. One of our key contributions is an online tutorial (see Figure 1), which guides users through each step of PEHRT using publicly available EHR data from the Medical Information Mart for Intensive Care IV (MIMIC-IV) database.

Prior to utilizing PEHRT, it is important for researchers to familiarize themselves with their EHR data sources, including relevant documentation, data structure, and coding systems. The data must be stored in a relational database, but it is not necessary that data is represented in a common data model. Additional details about the equipment and software requirements for PEHRT can be found in the tutorial introduction.

Figure 1: PEHRT enables users to prepare multi-institution Electronic Health Record (EHR) data for a variety of scientific purposes with 2 modules: (1) data pre-processing, and (2) representation learning. Each step of PEHRT is implemented in our user-friendly online tutorial using publicly available EHR data..

5.1 Module 1: Data Pre-processing↩︎

Data pre-processing is a meticulous process that involves several sub-steps, including (1.1) data cleaning, (1.2) code mapping and roll-up, (1.3) Natural Language Processing (NLP) of free-text data, and (1.4) cohort creation. Pre-processing is performed on each EHR dataset that is input to PEHRT. The goals of data pre-processing are to transform raw EHR-data to a more usable format and to standardize data across institutions to support integrative analysis and consistent data interpretation. The PEHRT pipeline enables processing of a broad range of structured data, including diagnostic codes, medication prescriptions, laboratory tests, and procedure codes. Prior to pre-processing, it is necessary to set up the computing environment and extract the desired data; details are provided in Module 1 of the tutorial.

Step 1.1: Data Cleaning. PEHRT employs a multi-step data cleaning process to enhance the quality of noisy and fragmented EHR data. Data cleaning generally begins by merging relevant data tables and standardizing data format across the tables. For example, standardizing how time is represented across the EHR system is often necessary. Some data tables may include exact timestamps while others only contain dates. Time entries are often standardized by retaining only the date component, creating a consistent format for daily-level analysis. Next, variables irrelevant to downstream analytical tasks are excluded to improve computational efficiency and reduce memory demands. Additionally, since EHR data frequently include errors, particularly in the time-related field, records with implausible dates, such as those prior to the 1980s or beyond the current year, are identified and removed. Lastly, exact duplicate records, which may arise during the aggregation of timestamp-level data into a daily format, are removed to produce a cleaned dataset. When large EHR datasets of interest, we recommend processing the data in batches, which is illustrated in our online tutorial.

Step 1.2: Code Mapping and Roll-Up. Medical codes are often too specific for research studies. To address this issue, PEHRT standardizes codes by mapping them to recognized coding systems and then aggregates or “rolls-up” the codes into higher-level categories across the domains of interest. Code roll-up provides consistency across diverse EHR datasets while also ensuring data is at an appropriate level of granularity. PEHRT focuses on the implementation of the standardization and roll-up process by mapping more granular level codes to higher level concepts across four domains: diagnoses, medications, laboratory tests, and procedures.

To standardize the four coding domains, we use established medical coding systems: (1) International Classification of Diseases, Ninth or Tenth Revision (ICD-9 or ICD-10, respectively) for diagnoses, (2) Prescription Normalized Names and Codes (RxNorm) for medications, (3) Logical Observation Identifiers Names and Codes (LOINC) for laboratory measurements, and (4) Current Procedural Terminology, Fourth Edition (CPT-4), Healthcare Common Procedure Coding System (HCPCS), and ICD, Ninth or Tenth Revision, Procedure Coding System (ICD-9-PCS and ICD-10-PCS, respectively) for procedures. Due to the transition of ICD-9 codes to ICD-10 codes in 2015, older diagnosis data are largely represented by ICD-9 while recent data use ICD-10. It is critical to use mappings that synchronize the two versions when using longitudinal data before and after 2015 [61].

Following standardization, codes are rolled-up to higher level medical concepts according to common ontologies. For diagnoses, we recommend the Phenotype Code (PheCode) hierarchy for ICD codes [61]. The PheCode hierarchy provides a total of 1875 integer, 1-digit, and 2-digit level codes that capture a wide range of disease conditions with sufficient granularity while maintaining a reasonable number of distinct codes. The hierarchy also provides parent-child relationships that characterize associations between PheCodes. For medications, we recommend rolling up RxNorm codes to RxNorm ingredient codes unless the study specifically requires dosage information. For studies involving drug classes, these ingredient level codes can be further rolled up into drug classes according to existing ontologies including the Anatomical Therapeutic Chemical (ATC) classification, the Accrual to Clinical Trials (ACT) ontology, or the Veteran’s Affairs (VA) drug class, depending on the researchers’ needs [62], [63]. Laboratory measurements for the same analyte can vary due to differences in the specimen, time of measurement, method, or scale, resulting in multiple LOINC codes. We recommend rolling up LOINC codes to the lowest level of LOINC part (LP) according to the LOINC component hierarchy [64]. Note that PEHRT only supports the usage of laboratory codes. Preparing laboratory result data is an involved process that requires informatics experts familiar with the EHR datasets of interest as it would require unit harmonization and specialized quality control.

Unfortunately, few established hierarchies exist for procedure codes. We recommend rolling up procedure codes into categories according to the Clinical Classification Software (CCS). Many institutions use both CPT and ICD procedure codes. It is thus important to include both when rolling up codes. Additionally, medications are sometimes coded as procedures in EHRs due to the way certain treatments are administered or billed. As such, it is necessary to map medication procedure codes to relevant RxNorm codes. PEHRT includes visualizations within a searchable and downloadable web API for ICD, LOINC and RxNorm hierarchies (see the visualizations here).

Step 1.3: Natural language processing. When free-text clinical notes are also available, one may employ natural language processing (NLP) tools to extract clinical concepts from unstructured clinical notes by identifying and mapping terms such as diseases, symptoms, and medications to Concept Unique Identifiers (CUIs) in the UMLS. Existing NLP software tools like NILE, cTAKES, or MetaMap enable this extraction, allowing for semantic analysis and structured representation of clinical text, which is then integrated into the dataset for downstream analysis [65]–[67]. We previously introduced a pipeline for EHR phenotyping, which contains detailed steps for running NLP as well as an online tutorial (https://celehs.github.io/PheCAP/) [68].

Step 1.4: Cohort Creation. EHR-based studies are typically conducted on a group of patients who meet specific inclusion/exclusion criteria, such as those with certain diagnoses, medications, procedures, or implanted devices. PEHRT streamlines cohort identification by leveraging the standardized and rolled-up codes from Step 1.2. For example, when the cohort is identified based on a particular disease diagnosis, a common strategy for identifying the patient cohort is to use corresponding ICD codes [18], [69], [70]. However, ICD codes can be overly granular, which often leads to different studies using inconsistent sets of ICD codes to capture the condition of interest. To address this issue, PEHRT utilizes PheCodes from Step 1.2 to identify patients associated with the condition of interest. For the identified cohort of patients, PEHRT then aggregates the structured data from Steps 1.1–1.2 as well as the CUIs derived from unstructured notes in Step 1.3 if free-text data is available for analysis. For studies involving temporal analysis, we recommend further aggregating patient-level longitudinal data into time windows, such as monthly counts or averages. For chronic conditions like rheumatoid arthritis, monthly aggregation typically provides sufficient granularity while simplifying downstream analysis.

5.2 Module 2: Representation learning↩︎

Following data pre-processing, representation learning is used to develop institution-specific embeddings. Embeddings are vector representations of the EHR data that capture the semantic and relational properties of codes from structured data and CUIs from free-text. The embeddings can be used for a variety of downstream tasks within each institution, including medical knowledge graph construction, phenotyping, and predictive modeling [49], [71]. If multi-institutional EHR data is available, PEHRT also contains a module that implements a novel matrix-completion technique to train a joint embedding [72]. The joint embedding leverages information across the data sources without requiring the sharing of individual level data and can be used for collaborative analyses. Additionally, PEHRT incorporates embeddings from pre-trained language models (PLMs) into its representation learning module to further enhance the quality of the learned data representations. Using similar strategies as in [71], we structure Module 2 to have four sub-steps: (2.1) EHR embedding training, (2.2) PLM-based embedding generation, (2.3) joint multi-institutional EHR embedding training, and (2.4) embedding validation.

Step 2.1: EHR Embedding training. PEHRT first generates EHR embeddings from summary-level data using the Singular Value Decomposition of the Pointwise Mutual Information (SVD-PMI) algorithm [73]. This method factorizes a PMI matrix constructed from co-occurrence counts of codes and CUIs. As a variant of the widely adopted word2vec algorithm, SVD-PMI has proven to be highly effective in learning meaningful and interpretable clinical embeddings [74].

The SVD-PMI algorithm consists of three steps. First, a co-occurrence matrix \({\boldsymbol{C}} = [C(w, c)]\) is constructed, where each element represents the number of patients in which a target code or CUI \(w\) co-occurs with a context code or CUI \(c\) within a predefined time window (e.g., 30 days). This matrix captures the local context of clinical concepts and provides a foundation for computing semantic similarity. Because calculating \({\boldsymbol{C}}\) at scale is computationally intensive, we developed an optimized algorithm for efficient co-occurrence computation in our prior work, enabling scalable training of PEHRT embeddings for large EHR datasets [75]–[77].

Next, the co-occurrence matrix is used to calculate the shifted positive PMI (SPPMI) matrix, which represents the relationships among codes and CUIs. The SPPMI matrix is defined as \[\text{SPPMI}(w, c) = \max \left\{ \log \frac{C(w, c) \cdot |D|}{C(w, \cdot) C(c, \cdot)} -\log k, 0 \right\}\] where \(C(w, \cdot)\) is the row sum of \(C(w,c)\), \(C( \cdot, c)\) is the column sum of \(C(w,c)\), \(|D|\) is the total sum of the co-occurrence, and \(k\) is the negative sampling rate. We have found that the embedding quality is generally not sensitive to the the length of the time window, but is best when \(k = 1\) [49]. Lastly, the SPPMI matrix is decomposed with its rank-\(d\) SVD, represented as \({\boldsymbol{Q}}_{d} {\boldsymbol{\Lambda}} {\boldsymbol{Q}}_{d}\). PEHRT outputs the \(d\)-dimensional embedding vectors as \(\mathbf{X}_{\scriptscriptstyle \sf EHR}= {\boldsymbol{Q}}_d {\boldsymbol{\Lambda}}^{1/2}\). To select \(d\), we recommend retaining a large amount of variation in the SVD (e.g., 95%) by evaluating the eigenvalue decay [75], [78]. Alternatively, \(d\) can be selected by maximizing the area under the receiver operating characteristic curve (AUC) for discriminating between pairs of codes and CUIs with known relationships against randomly selected pairs (see Step 2.3 for further details) [79], [80].

Step 2.2: PLM-based embeddings. The SVD-PMI embeddings are derived from the co-occurence matrix and therefore capture the meaning of codes and CUIs based on how they are used within a healthcare system. Additional semantic information about the meaning of codes and CUIs can also be obtained from their textual descriptions to complement this system-level perspective. To leverage textual descriptions in embedding training, PEHRT produces a second set of embeddings using PLMs. PLMs are trained on large text corpora and, in some cases, further fine-tuned with biomedical knowledge sources such as PubMed articles, clinical notes, and knowledge graphs. Commonly used PLMs include Self-Aligned Pre-trained Bidirectional Encoder Representations from Transformers (SapBERT), ClinicalBERT, Cross-lingual knowledge-infused medical term embedding (CODER), PubMedBERT, BERT for Biomedical Text Mining (BioBERT), and BAAI General Embeddings (BGE), many of which were fine tuned from the original BERT model [81]–[86]. Given the text string of a code or CUI, a PLM produces a corresponding embedding vector. PEHRT contains embeddings from many common PLMs, including CODER, SapBERT, PubMedBERT, and BioBERT. Obtaining the PLM-based embeddings is generally not computationally burdensome, though users can alternatively utilize text-embedding-3-small model ² via the OpenAI API.

When working with data from a single institution, the PLM-based embeddings can be integrated with the SVD-PMI embeddings from Step 2.1 to enhance overall embedding quality. A simple yet effective approach is to create a weighted concatenation of the two embeddings, with the weighting adjusted to the specific downstream task. Specifically, we let \[\mathbf{X}_{\scriptscriptstyle \sf INT}= [ w \mathbf{X}_{\scriptscriptstyle \sf EHR}, (1-w)\mathbf{X}_{\scriptscriptstyle \sf PLM}] \label{eq:w}\tag{1}\] where \(\mathbf{X}_{\scriptscriptstyle \sf PLM}\) is the PLM-based embedding and \(w \in [0,1]\) is the weight. The integrated embedding captures the complementary strengths of SVD-PMI and PLM-based embeddings: SVD-PMI excels at identifying clinically related codes (e.g., drug-disease pairs), while PLMs capture semantic similarity between codes [33], [84].

Step 2.3: Joint multi-institution EHR embedding training. When data from multiple institutions are available, PEHRT uses the BONMI algorithm [72] to derive a shared representation of EHR concepts by aligning and completing institution-specific SPPMI matrices efficiently with near-optimal error bounds. BONMI constructs an aggregated matrix covering all unique codes and CUIs, assigning weighted averages to overlapping pairs and marking others as missing. The weights are based on data quality using user-defined or data-driven metrics and the missing values are imputed by aligning institution-specific embeddings via orthogonal transformations. The completed matrix is then factorized with a SVD to generate the joint embedding, with rank selection as in Step 2.1. The joint embedding can be integrated with PLM-based embeddings through weighted concatenation, following the procedure in Step 2.2.

Step 2.4: Embedding validation. To evaluate the quality of the trained embeddings, PEHRT provides simple metrics quantifying their performance in discriminating between concept pairs with known relationships against randomly selected pairs. The relationships can be curated from existing ontologies and the UMLS. For each pair under consideration, the cosine similarity of the corresponding embedding vectors is calculated to measure their degree of relatedness. Embedding quality is then quantified based on the AUC of the cosine similarity in distinguishing between the related and random pairs (i.e., the probability that a randomly selected related pair will have a higher cosine similarity than a randomly selected random pair). These metrics can be used to evaluate the performance of the institution-specific EHR-based embeddings, the PLM-based embeddings, as well as the joint embedding when data multi-institional EHR data is available. In the latter case, we have found that the BONMI embeddings generally achieve the highest performance in a wide variety of applications, but recommend comparing their performance with PLM-based embeddings and institution-specific embeddings for thorough evaluation [71], [77], [87].

6 Example: Using PEHRT for predictive modeling↩︎

Below we illustrate how to use PEHRT to: (i) obtain pre-processed EHR data, (ii) develop embeddings using EHR data from multiple institutions for simple predictive modeling tasks, and (iii) perform an integrative predictive modeling task leveraging a joint embedding. We use data from the Mass General Brigham (MGB), the Veterans Health Administration (VA), Boston Children’s Hospital (BCH), the University of Pittsburgh Medical Center (UPMC), and MIMIC-IV. In our analysis, the MGB EHR data contains 2.5 million patients from 1998 to 2018. The VA Corporate Data Warehouse (CDW) aggregates data from 150 VA facilities into a single data warehouse, with records from 1999 to 2019 covering 12.6 million patients. The BCH contains 251K patients from 2009 to 2022 and the UPMC EHR data includes 95K patients from 2004 to 2022, focusing on individuals with at least one occurrence of ICD codes related to Alzheimer’s disease and dementia or multiple sclerosis. The MIMIC-IV dataset contains data on over 65K ICU admissions and over 200K emergency department admissions at Beth Israel Deaconess Medical Center in Boston, Massachusetts, spanning 2008 to 2019.

6.1 Obtaining pre-processed EHR data↩︎

We used Module 1 of PEHRT to pre-process EHR data from all of the institutions. For illustrative purposes, all of the pre-processing steps are implemented in our online tutorial for the MIMIC-IV dataset, beginning with instructions on how to set up your workspace and gain access to the MIMIC-IV data, as well as how to become familiar with the data structure and content. The input of Module 1 are the original data tables from the MIMIC-IV database and the output is a pre-processed dataset. The pre-processing begins with data cleaning, which involves merging the appropriate data tables, standardizing data format across tables, removing irrelevant and redundant information, constraining the data to the relevant time window, and processing the data in batches. Next, code roll-up is performed for diagnosis, procedure, and medication codes. MIMIC-IV uses common coding systems so that code mapping is not required in the pre-processing steps. We generally recommend NILE for text processing, but show a lightweight example using a custom NLP module for the purposes of illustration in our tutorial. Following the structured and unstructured data pre-processing, we also illustrate how to refine the data to a cohort for a specific analysis, using a study of asthma as an example.

6.2 Developing joint multi-institution embeddings↩︎

6.2.1 Training Embeddings↩︎

Within each institution, we obtained EHR embeddings following the procedure in Module 2 based on PheCodes, CCS categories, RXNorm codes, LOINC codes, and local codes specific to a particular institution. Table 1 shows the number of different codes across the 5 institutions and the various coding domains. As expected, substantial heterogeneity exists in terms of the numbers of unique codes with in each domain. We also obtained PLM-based embeddings using CODER, SapBERT, BioBERT PubMedBERT, BGE, and OpenAI’s text-embedding-3-small model.

Table 1: Number of unique codes used in five different healthcare systems (MGB, VA, BCH, UPMC, and MIMIC-IV) across the five coding domains: PheCode, CCS, RxNorm, LOINC, and institution-specific local codes.
Institution	PheCode	CCS	RxNorm	LOINC	Local Codes	Total
MGB	1772	243	1235	6370	0	9620
VA	1776	224	1257	1034	2673	6964
BCH	1543	209	1509	1942	0	5203
UPMC	1841	245	1987	5833	8080	17986
MIMIC IV	637	129	959	0	2894	4619
Total	1869	248	4103	11198	13366	30784

6.2.2 Evaluating embedding quality↩︎

We evaluated the quality of the individual PLM embeddings, the joint embeddings trained with BONMI, and joint embeddings integrated with CODER embeddings (BONMI+). The quality of the embeddings derived from the various methods was evaluated in detecting related versus random pairs of codes as described in Step 2.4. We also assessed embedding quality in mapping local lab codes in the VA data to LOINC/LP codes using \(11,808\) curated mappings from OMOP [88]. We reported the top \(k\) accuracy of the codes for each set of embeddings, defined as the proportion of test cases in which the correct mapping for a given code appears among the top \(k\) predictions generated by the embeddings.

6.3 Integrative predictive modeling↩︎

To highlight the practical utility of PEHRT, we used the trained embeddings to improve the identification and selection of relevant features for predictive modeling. We focused on eleven diseases: Type 1 Diabetes (T1D), Type 2 Diabetes (T2D), Alzheimer’s Disease (AD), Depression (DP), Coronary Atherosclerosis (CA), Congestive Heart Failure (CHF), Congestive Heart Failure - Nonhypertensive (CHFN), Regional Enteritis (RE), Ulcerative Colitis (UC), Rheumatoid Arthritis (RA), and Rheumatoid Arthritis and Other Inflammatory Polyarthropathies (RAO). For each disease, we identified the top \(100\) features with the highest cosine similarities to the disease’s PheCode using each embedding method. Additionally, we randomly selected negative features from the complement of the union of features identified by all methods. To evaluate the accuracy of identifying relevant features, we assigned relevance scores (ranging from \(0\) to \(1\)) to each feature using GPT-4. We then computed the AUC for each method, treating the top \(100\) features as positive cases and the randomly selected features as negative cases, with the GPT-4 relevance scores serving as probabilities. A higher AUC indicates greater accuracy in selecting relevant features.

We also considered two predictive modeling tasks: predicting future disability status in multiple sclerosis (MS) patients and predicting time to nursing home admission or death in Alzheimer’s disease (AD) patients. Both predictive modeling tasks were evaluated at UPMC and MGB based on models incorporating demographics (age at baseline, sex, race/ethnicity), healthcare utilization, and selected features using the procedure described in the previous paragraph. For model training, counts of the selected features and number of visits, a measure of healthcare utilization, were aggregated over the pre-specified time period at baseline (i.e., 1 year for predicting future disability status and 2 years for predicting time to nursing home admission or death). We also log-transformed (\(x \mapsto \log(x+1)\)) the count features to improve stability of model fitting. A lasso-penalized logistic regression model was trained for the disability status outcome and a lasso-penalized Cox proportional hazards model was trained for the time to nursing home admission outcome. The hyperparameter was tuned through five-fold cross-validation.

6.4 Results↩︎

6.4.1 Joint multi-institution embeddings↩︎

Table 2: AUC scores for different models.
	BONMI	BONMI+	CODER	SapBERT	BioBERT	PubMedBERT	openAI	BGE
Similarity	0.916	0.966	0.950	0.755	0.537	0.565	0.951	0.801
Relatedness	0.815	0.842	0.811	0.682	0.477	0.547	0.832	0.690

The embedding validation results for detecting known relationship pairs are summarized in Table 2. Overall, PEHRT-based embeddings outperform most PLM-based embeddings in terms of discrimination. Among PLM-based methods, OpenAI and CODER achieve the strongest results, yet they still fall short of BONMI+. This gap arises because PLM-based embeddings are primarily trained on biomedical text corpora and therefore fail to capture the nuanced disease patterns and clinical associations reflected in real-world EHR data. By contrast, the PEHRT-based BONMI+ embedding achieves the highest performance on both tasks as it draws on the representational strengths of PLMs while also integrating information across EHR data from multiple institutions. For code mapping, the results in Table 3 show that the PLM-based embeddings from SapBERT and openAI are superior to BONMI+. This result is expected since this code mapping relies heavily on the semantic meaning of code descriptions and underscores our recommendation to validate the PLM-based embeddings individually as they may be more appropriate for some tasks.

Table 3: Accuracy in percent of mapping VA local lab codes to LOINC/LP codes using different methods.Top \(k\) accuracy refers to the correct mapping of the standard code being present within the local code’s top \(k\) closest codes based on cosine similarities.
	BONMI	BONMI+	CODER	SapBERT	BioBERT	PubMedBERT	openAI	BGE
Top-1	0.20	23.55	30.83	44.26	6.45	7.18	49.34	34.98
Top-5	2.64	50.66	58.38	66.05	8.94	12.60	79.92	52.96
Top-10	4.49	62.53	69.27	72.45	11.48	16.32	85.00	59.36
Top-20	8.70	74.55	76.70	76.84	14.70	21.25	88.72	65.36

6.4.2 Integrative predictive modeling↩︎

For the predictive modeling tasks, Table 4 presents the rank correlation between the cosine similarities of the candidate features and the GPT-4 scores for the 11 target diseases. BONMI+ consistently outperforms the other embeddings in selecting features for all of the diseases. In partiular, both BONMI and BONMI+ outperform the PLM-based embeddings as feature selection inherently depends on relationships between codes and CUIs, which are well documented in real-world EHR data.

Table 4: The rank correlation between the cosine similarities of the candidate features and the GPT-4 scores for 11 target diseases as well as the average across these diseases.
Disease	BONMI	BONMI+	CODER	SapBERT	BioBERT	PubMedBERT	openAI	BGE
T1D	0.385	0.429	0.295	0.144	-0.072	0.045	0.369	0.138
T2D	0.479	0.497	0.303	0.087	-0.045	0.057	0.477	0.179
AD	0.313	0.362	0.289	0.164	-0.079	0.021	0.382	0.289
DP	0.449	0.489	0.361	0.024	0.014	0.002	0.440	0.216
CA	0.448	0.478	0.343	0.055	-0.028	0.033	0.426	0.007
CHF	0.484	0.540	0.444	0.377	0.035	-0.032	0.444	0.113
CHFN	0.687	0.735	0.607	0.464	0.035	0.174	0.642	0.078
RE	0.289	0.252	0.115	0.059	0.080	0.005	0.206	0.107
UC	0.262	0.215	0.067	0.048	-0.006	0.029	0.267	0.163
RA	0.328	0.291	0.184	0.030	0.034	-0.002	0.338	0.073
RAO	0.499	0.463	0.249	0.223	0.054	0.000	0.490	0.044
AVE.	0.420	0.432	0.296	0.152	0.002	0.030	0.407	0.128

Figures 2 and 3 present the AUC for the models for MS disability prediction and time to nursing home admission and death for AD patients at MGB and UPMC, respectively. Consistent with our results measuring the quality of feature selection, models incorporating the BONMI and BONMI+ selected features have the strongest performance. Interestingly, models with features selected by institution-specific EHR embeddings achieved better performance than PLM embeddings at MGB, but not at UPMC. This finding underscores our recommendation to validate multiple embeddings as results can vary across tasks and institutions.

Figure 2: AUC of lasso-penalized logistic regression models for predicting disability status in MS patients based on Patient Determined Disease Steps (PDDS) scores two years after their first visit using varying numbers of selected features (20, 50, 100, and 200). Comparisons are shown for different embedding methods, including BONMI+, BONMI, PLM-based embeddings, institution-specific EHR embeddings (SVD-PMI), and a “random” method consisting of randomly selected features combined with the main PheCode and healthcare utilization feature. Results are presented separately for UPMC (left) and MGB (right), with higher AUC values indicating better predictive performance. The training sample size is 500.

Figure 3: C-index of lasso-penalized Cox proportional hazards model for predicting time to nursing home admission or death in Alzheimer’s disease (AD) patients using varying numbers of selected features (20, 50, 100, and 200). Comparisons are shown for different embedding methods, including BONMI+, BONMI, PLM-based embeddings, institution-specific EHR embeddings (SVD-PMI), and a “random” method consisting of randomly selected features combined with the main PheCode and healthcare utilization feature. Results are presented separately for UPMC (left) and MGB (right), with higher C-index values indicating better predictive performance. The training sample size is 15,000.

7 Conclusion↩︎

Data harmonization is essential for ensuring the validity, transparency, and reproducibility of multi-institutional EHR-based research. However, significant heterogeneity across data sources complicates harmonization and no comprehensive and standardized procedures currently exist to address this challenge. To fill this gap, we introduced PEHRT, a common pipeline for harmonization of EHR data for translational applications. PEHRT operates entirely on summary-level data and preserves data privacy. We designed our pipeline for easy implementation through our online tutorial and suite of resources, including R and Python modules, notebooks, and APIs. We also demonstrated the utility of our pipeline in several modeling tasks using data from five healthcare systems. Beyond these applications, PEHRT supports a wide range of scientific objectives, including phenotyping, cross-institutional clinical studies, knowledge graph construction, and federated learning, making it a versatile tool for advancing clinical research and practice [87].

References↩︎

[1]

Li, R., Chen, Y., Ritchie, M. D., and Moore, J. H. (2020). Electronic health records and polygenic risk scores for predicting disease risk. Nature Reviews Genetics, 21(8):493–502.

[2]

Zhao, S. S., Hong, C., Cai, T., Xu, C., Huang, J., Ermann, J., Goodson, N. J., Solomon, D. H., Cai, T., and Liao, K. P. (2020). Incorporating natural language processing to improve classification of axial spondyloarthritis using electronic health records. Rheumatology, 59(5):1059–1065.

[3]

Cheng, D., Ananthakrishnan, A. N., and Cai, T. (2021). Robust and efficient semi-supervised estimation of average treatment effects with application to electronic health records data. Biometrics, 77(2):413–423.

[4]

Xu, D., Wang, C., Khan, A., Shang, N., He, Z., Gordon, A., Kullo, I. J., Murphy, S., Ni, Y., Wei, W.-Q., et al. (2021). Quantitative disease risk scores from ehr with applications to clinical risk stratification and genetic studies. NPJ Digital Medicine, 4(1):116.

[5]

Hong, C., Liang, L., Yuan, Q., Cho, K., Liao, K. P., Pencina, M. J., Christiani, D. C., and Cai, T. (2023). Semi-supervised calibration of noisy event risk (scaner) with electronic health records. Journal of biomedical informatics, 144:104425.

[6]

Hou, J., Chan, S. F., Wang, X., and Cai, T. (2023a). Risk prediction with imperfect survival outcome information from electronic health records. Biometrics, 79(1):190–202.

[7]

Yang, Z., Mitra, A., Liu, W., Berlowitz, D., and Yu, H. (2023b). Transformehr: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records. Nature communications, 14(1):7857.

[8]

Li, M., Li, X., Pan, K., Geva, A., Yang, D., Sweet, S. M., Bonzel, C.-L., Panickan, V. A., Xiong, X., Mandl, K. D., et al. (2024). Multi-source graph synthesis (mugs) for pediatric knowledge graphs from electronic health records. medRxiv, pages 2024–01.

[9]

McCaw, Z. R., Gao, J., Lin, X., and Gronsbell, J. (2024). Synthetic surrogates improve power for genome-wide association studies of partially missing phenotypes in population biobanks. Nature Genetics, 56(12):1527–1536.

[10]

Tang, A. S., Rankin, K. P., Cerono, G., Miramontes, S., Mills, H., Roger, J., Zeng, B., Nelson, C., Soman, K., Woldemariam, S., et al. (2024a). Leveraging electronic health records and knowledge networks for alzheimer’s disease prediction and sex-specific biological insights. Nature Aging, 4(3):379–395.

[11]

Dugas, M., Blumenstock, M., Dittrich, T., Eisenmann, U., Feder, S. C., Fritz-Kebede, F., Kessler, L. J., Klass, M., Knaup, P., Lehmann, C. U., et al. (2024). Next-generation study databases require fair, ehr-integrated, and scalable electronic data capture for medical documentation and decision support. NPJ Digital Medicine, 7(1):10.

[12]

Brat, G. A., Weber, G. M., Gehlenborg, N., Avillach, P., Palmer, N. P., Chiovato, L., Cimino, J., Waitman, L. R., Omenn, G. S., Malovini, A., et al. (2020). International electronic health record-derived covid-19 clinical course profiles: the 4ce consortium. NPJ digital medicine, 3(1):109.

[13]

Haendel, M. A., Chute, C. G., Bennett, T. D., Eichmann, D. A., Guinney, J., Kibbe, W. A., Payne, P. R., Pfaff, E. R., Robinson, P. N., Saltz, J. H., et al. (2021). The national covid cohort collaborative (n3c): rationale, design, infrastructure, and deployment. Journal of the American Medical Informatics Association, 28(3):427–443.

[14]

Vishwanatha, J. K., Christian, A., Sambamoorthi, U., Thompson, E. L., Stinson, K., and Syed, T. A. (2023). Community perspectives on ai/ml and health equity: Aim-ahead nationwide stakeholder listening sessions. PLOS Digital Health, 2(6):e0000288.

[15]

Hripcsak, G. and Albers, D. J. (2013). Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association, 20(1):117–121.

[16]

de Mello, B. H., Rigo, S. J., da Costa, C. A., da Rosa Righi, R., Donida, B., Bez, M. R., and Schunke, L. C. (2022). Semantic interoperability in health records standards: a systematic literature review. Health and technology, 12(2):255–272.

[17]

Sarwar, T., Seifollahi, S., Chan, J., Zhang, X., Aksakalli, V., Hudson, I., Verspoor, K., and Cavedon, L. (2022). The secondary use of electronic health records for data mining: Data characteristics and challenges. ACM Computing Surveys (CSUR), 55(2):1–40.

[18]

Yang, S., Varghese, P., Stephenson, E., Tu, K., and Gronsbell, J. (2023a). Machine learning approaches for electronic health records phenotyping: a methodical review. Journal of the American Medical Informatics Association, 30(2):367–381.

[19]

Tang, A. S., Woldemariam, S. R., Miramontes, S., Norgeot, B., Oskotsky, T. T., and Sirota, M. (2024b). Harnessing ehr data for health research. Nature Medicine, 30(7):1847–1855.

[20]

Ramakrishnaiah, Y., Macesic, N., Webb, G. I., Peleg, A. Y., and Tyagi, S. (2023). Ehr-qc: A streamlined pipeline for automated electronic health records standardisation and preprocessing to predict clinical outcomes. Journal of Biomedical Informatics, 147:104509.

[21]

Mehra, M. R., Desai, S. S., Kuy, S., Henry, T. D., and Patel, A. N. (2020a). Cardiovascular disease, drug therapy, and mortality in covid-19. New England Journal of Medicine, 382(25):e102.

[22]

Mehra, M. R., Ruschitzka, F., and Patel, A. N. (2020b). Retraction—hydroxychloroquine or chloroquine with or without a macrolide for treatment of covid-19: a multinational registry analysis. The lancet, 395(10240):1820.

[23]

Weber, G. M., Hong, C., Palmer, N. P., Avillach, P., Murphy, S. N., Gutiérrez-Sacristán, A., Xia, Z., Serret-Larmande, A., Neuraz, A., Omenn, G. S., et al. (2021a). International comparisons of harmonized laboratory value trajectories to predict severe covid-19: Leveraging the 4ce collaborative across 342 hospitals and 6 countries: A retrospective cohort study. medRxiv.

[24]

Weber, G. M., Zhang, H. G., L’Yi, S., Bonzel, C.-L., Hong, C., Avillach, P., Gutierrez-Sacristan, A., Palmer, N. P., Tan, A. L. M., Wang, X., et al. (2021b). International changes in covid-19 clinical trajectories across 315 hospitals and 6 countries: retrospective cohort study. Journal of medical Internet research, 23(10):e31400.

[25]

Hong, C., Zhang, H. G., L’Yi, S., Weber, G., Avillach, P., Tan, B. W., Gutiérrez-Sacristán, A., Bonzel, C.-L., Palmer, N. P., Malovini, A., et al. (2022). Changes in laboratory value improvement and mortality rates over the course of the pandemic: an international retrospective cohort study of hospitalised patients infected with sars-cov-2. BMJ open, 12(6):e057725.

[26]

Kohane, I. S., Aronow, B. J., Avillach, P., Beaulieu-Jones, B. K., Bellazzi, R., Bradford, R. L., Brat, G. A., Cannataro, M., Cimino, J. J., Garcı́a-Barrio, N., et al. (2021). What every reader should know about studies using electronic health record data but may be afraid to ask. Journal of medical Internet research, 23(3):e22219.

[27]

Beer-Borst, S., Hercberg, S., Morabia, A., Bernstein, M., Galan, P., Galasso, R., Giampaoli, S., McCrum, E., Panico, S., Preziosi, P., et al. (2000). Dietary patterns in six european populations: results from euralim, a collaborative european data harmonization and information campaign. European journal of clinical nutrition, 54(3):253–262.

[28]

Kush, R. D., Warzel, D., Kush, M. A., Sherman, A., Navarro, E. A., Fitzmartin, R., Pétavy, F., Galvez, J., Becnel, L. B., Zhou, F., et al. (2020). Fair data sharing: the roles of common data elements and harmonization. Journal of biomedical informatics, 107:103421.

[29]

Abbasizanjani, H., Torabi, F., Bedston, S., Bolton, T., Davies, G., Denaxas, S., Griffiths, R., Herbert, L., Hollings, S., Keene, S., et al. (2023). Harmonising electronic health records for reproducible research: challenges, solutions and recommendations from a uk-wide covid-19 research collaboration. BMC Medical Informatics and Decision Making, 23(1):8.

[30]

Wabo, G. K., Prasser, F., Gierend, K., Siegel, F., Ganslandt, T., et al. (2023). Data quality–and utility-compliant anonymization of common data model–harmonized electronic health record data: Protocol for a scoping review. JMIR Research Protocols, 12(1):e46471.

[31]

Liao, K. P., Cai, T., Gainer, V., Goryachev, S., Zeng-treitler, Q., Raychaudhuri, S., Szolovits, P., Churchill, S., Murphy, S., Kohane, I., et al. (2010). Electronic medical records for discovery research in rheumatoid arthritis. Arthritis care & research, 62(8):1120–1127.

[32]

Hou, J., Kim, N., Cai, T., Dahal, K., Weiner, H., Chitnis, T., Cai, T., and Xia, Z. (2021). Comparison of dimethyl fumarate vs fingolimod and rituximab vs natalizumab for treatment of multiple sclerosis. JAMA network open, 4(11):e2134627–e2134627.

[33]

Zhou, D., Gan, Z., Shi, X., Patwari, A., Rush, E., Bonzel, C.-L., Panickan, V. A., Hong, C., Ho, Y.-L., Cai, T., et al. (2022). Multiview incomplete knowledge graph integration with application to cross-institutional ehr data harmonization. Journal of Biomedical Informatics, 133:104147.

[34]

Verma, A., Verma, S. S., Pendergrass, S. A., Crawford, D. C., Crosslin, D. R., Kuivaniemi, H., Bush, W. S., Bradford, Y., Kullo, I., Bielinski, S. J., et al. (2016). emerge phenome-wide association study (phewas) identifies clinical associations and pleiotropy for stop-gain variants. BMC Medical Genomics, 9:19–25.

[35]

Cai, T., Zhang, Y., Ho, Y.-L., Link, N., Sun, J., Huang, J., Cai, T. A., Damrauer, S., Ahuja, Y., Honerlaw, J., et al. (2018). Association of interleukin 6 receptor variant with cardiovascular disease effects of interleukin 6 receptor blocking therapy: a phenome-wide association study. JAMA cardiology, 3(9):849–857.

[36]

Read, R. W., Schlauch, K. A., Elhanan, G., Metcalf, W. J., Slonim, A. D., Aweti, R., Borkowski, R., and Grzymski, J. J. (2019). Gwas and phewas of red blood cell components in a northern nevadan cohort. PLoS One, 14(6):e0218078.

[37]

Chan, S. F., Hejblum, B. P., Chakrabortty, A., and Cai, T. (2020). Semi-supervised estimation of covariance with application to phenome-wide association studies with electronic medical records data. Statistical Methods in Medical Research, 29(2):455–465.

[38]

Crawford, D. C. and Sedor, J. R. (2021). Biobanks linked to electronic health records accelerate genomic discovery. Journal of the American Society of Nephrology, 32(8):1828–1829.

[39]

Fang, Y., Fritsche, L. G., Mukherjee, B., Sen, S., and Richmond-Rakerd, L. S. (2022). Polygenic liability to depression is associated with multiple medical conditions in the electronic health record: phenome-wide association study of 46,782 individuals. Biological psychiatry, 92(12):923–931.

[40]

Mandair, D., Tiwari, P., Simon, S., Colborn, K. L., and Rosenberg, M. A. (2020). Prediction of incident myocardial infarction using machine learning applied to harmonized electronic health record data. BMC medical informatics and decision making, 20:1–10.

[41]

Wang, X., Panickan, V. A., Cai, T., Xiong, X., Cho, K., Cai, T., and Bourgeois, F. T. (2023). Endovascular aneurysm repair devices as a use case for postmarketing surveillance of medical devices. JAMA internal medicine, 183(10):1090–1097.

[42]

Abad-Navarro, F. and Martı́nez-Costa, C. (2024). A knowledge graph-based data harmonization framework for secondary data reuse. Computer Methods and Programs in Biomedicine, 243:107918.

[43]

Mateus, P., Moonen, J., Beran, M., Jaarsma, E., van der Landen, S. M., Heuvelink, J., Birhanu, M., Harms, A. G., Bron, E., Wolters, F. J., et al. (2024). Data harmonization and federated learning for multi-cohort dementia research using the omop common data model: A netherlands consortium of dementia cohorts case study. Journal of biomedical informatics, page 104661.

[44]

Li, S., Cai, T., and Duan, R. (2023). Targeting underrepresented populations in precision medicine: A federated transfer learning approach. The Annals of Applied Statistics, 17(4):2970–2992.

[45]

Pathak, J., Bailey, K. R., Beebe, C. E., Bethard, S., Carrell, D. S., Chen, P. J., Dligach, D., Endle, C. M., Hart, L. A., Haug, P. J., et al. (2013). Normalization and standardization of electronic health records for high-throughput phenotyping: the sharpn consortium. Journal of the American Medical Informatics Association, 20(e2):e341–e348.

[46]

Makadia, R. and Ryan, P. B. (2014). Transforming the premier perspective® hospital database into the observational medical outcomes partnership (omop) common data model. Egems, 2(1).

[47]

Health Level Seven International(2023). . https://www.hl7.org/fhir/. Accessed: 2025-01-24.

[48]

Observational Health Data Sciences and Informatics(2025). . https://www.ohdsi.org/data-standardization/. Accessed: 2025-01-24.

[49]

Hong, N., Wen, A., Shen, F., Sohn, S., Wang, C., Liu, H., and Jiang, G. (2019). Developing a scalable fhir-based clinical data normalization pipeline for standardizing and integrating unstructured and structured electronic health record data. JAMIA open, 2(4):570–579.

[50]

Mandyam, A., Yoo, E. C., Soules, J., Laudanski, K., and Engelhardt, B. E. (2021). Cop-e-cat: cleaning and organization pipeline for ehr computational and analytic tasks. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 1–9.

[51]

Muse, V. P. and Brunak, S. (2024). Protocol for ehr laboratory data preprocessing and seasonal adjustment using r and rstudio. STAR protocols, 5(1):102912.

[52]

Almeida, J. R., Silva, L. B., and Oliveira, J. L. (2022). . https://bioinformatics-ua.github.io/CMToolkit/. Accessed: 2025-02-06.

[53]

Observational Health Data Sciences and Informatics(2024). . http://ohdsi.github.io/CommonDataModel/. Accessed: 2025-01-24.

[54]

Zhang, L., Zhang, Y., Cai, T., Ahuja, Y., He, Z., Ho, Y.-L., Beam, A., Cho, K., Carroll, R., Denny, J., et al. (2019a). Automated grouping of medical codes via multiview banded spectral clustering. Journal of biomedical informatics, 100:103322.

[55]

Huser, V., DeFalco, F. J., Schuemie, M., Ryan, P. B., Shang, N., Velez, M., Park, R. W., Boyce, R. D., Duke, J., Khare, R., et al. (2016). Multisite evaluation of a data quality tool for patient-level clinical data sets. EGEMs, 4(1).

[56]

Lewis, A. E., Weiskopf, N., Abrams, Z. B., Foraker, R., Lai, A. M., Payne, P. R., and Gupta, A. (2023). Electronic health record data quality assessment and tools: a systematic review. Journal of the American Medical Informatics Association, 30(10):1730–1740.

[57]

Merritt, V. C., Chen, A. W., Bonzel, C.-L., Hong, C., Sangar, R., Morini Sweet, S., Sorg, S. F., Chanfreau-Coffinier, C., and Program, V. M. V. (2024). Development and validation of an electronic health record-based algorithm for identifying tbi in the va: A va million veteran program study. Brain Injury, 38(13):1084–1092.

[58]

Heumos, L., Ehmele, P., Treis, T., Upmeier zu Belzen, J., Roellin, E., May, L., Namsaraeva, A., Horlava, N., Shitov, V. A., Zhang, X., et al. (2024). An open-source framework for end-to-end analysis of electronic health record data. Nature medicine, 30(11):3369–3380.

[59]

Aminoleslami, A., Anderson, G. M., and Chicco, D. (2024). Ehrs data harmonization platform, an easy-to-use shiny app based on recodeflow for harmonizing and deriving clinical features. arXiv preprint arXiv:2411.10342.

[60]

Sohn, S., Clark, C., Halgrim, S. R., Murphy, S. P., Chute, C. G., and Liu, H. (2014). Medxn: an open source medication extraction and normalization tool for clinical text. Journal of the American Medical Informatics Association, 21(5):858–865.

[61]

Denny, J. C., Ritchie, M. D., Basford, M. A., Pulley, J. M., Bastarache, L., Brown-Gentry, K., Wang, D., Masys, D. R., Roden, D. M., and Crawford, D. C. (2010). Phewas: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics, 26(9):1205–1210.

[62]

National Library of Medicine(2020). . https://www.nlm.nih.gov/research/umls/rxnorm/sourcereleasedocs/vandf.html. Accessed: 2025-01-24.

[63]

World Health Organization(2025). . https://www.who.int/tools/atc-ddd-toolkit/atc-classification. Accessed: 2025-01-24.

[64]

McDonald, C. J., Huff, S. M., Suico, J. G., Hill, G., Leavelle, D., Aller, R., Forrey, A., Mercer, K., DeMoor, G., Hook, J., et al. (2003). Loinc, a universal standard for identifying laboratory observations: a 5-year update. Clinical chemistry, 49(4):624–633.

[65]

Aronson, A. R. (2001). Effective mapping of biomedical text to the umls metathesaurus: the metamap program. In Proceedings of the AMIA Symposium, page 17. American Medical Informatics Association.

[66]

Savova, G. K., Masanz, J. J., Ogren, P. V., Zheng, J., Sohn, S., Kipper-Schuler, K. C., and Chute, C. G. (2010). Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. Journal of the American Medical Informatics Association, 17(5):507–513.

[67]

Yu, S., Cai, T., and Cai, T. (2013). Nile: fast natural language processing for electronic health records. arXiv preprint arXiv:1311.6063.

[68]

Zhang, Y., Cai, T., Yu, S., Cho, K., Hong, C., Sun, J., Huang, J., Ho, Y.-L., Ananthakrishnan, A. N., Xia, Z., et al. (2019b). High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (phecap). Nature protocols, 14(12):3426–3444.

[69]

Shivade, C., Raghavan, P., Fosler-Lussier, E., Embi, P. J., Elhadad, N., Johnson, S. B., and Lai, A. M. (2014). A review of approaches to identifying patient phenotype cohorts using electronic health records. Journal of the American Medical Informatics Association, 21(2):221–230.

[70]

Banda, J. M., Seneviratne, M., Hernandez-Boussard, T., and Shah, N. H. (2018). Advances in electronic phenotyping: from rule-based definitions to machine learning models. Annual review of biomedical data science, 1(1):53–68.

[71]

Xiong, X., Sweet, S. M., Liu, M., Hong, C., Bonzel, C.-L., Panickan, V. A., Zhou, D., Wang, L., Costa, L., Ho, Y.-L., et al. (2023). Knowledge-driven online multimodal automated phenotyping system. medRxiv, pages 2023–09.

[72]

Zhou, D., Cai, T., and Lu, J. (2023). Multi-source learning via completion of block-wise overlapping noisy matrices. Journal of Machine Learning Research, 24(221):1–43.

[73]

Beam, A. L., Kompa, B., Schmaltz, A., Fried, I., Weber, G., Palmer, N., Shi, X., Cai, T., and Kohane, I. S. (2020). Clinical concept embeddings learned from massive sources of multimodal medical data. In Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, volume 25, page 295.

[74]

Levy, O. and Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. Advances in neural information processing systems, 27.

[75]

Hong, C., Rush, E., Liu, M., Zhou, D., Sun, J., Sonabend, A., Castro, V. M., Schubert, P., Panickan, V. A., Cai, T., et al. (2021). Clinical knowledge extraction via sparse embedding regression (keser) with multi-center large scale electronic health record data. NPJ digital medicine, 4(1):151.

[76]

Rush, E. (2022). Largescaleclinicalembedding. https://github.com/rusheniii/LargeScaleClinicalEmbedding. Accessed: February 23, 2025.

[77]

Gan, Z., Zhou, D., Rush, E., Panickan, V. A., Ho, Y.-L., Ostrouchovm, G., Xu, Z., Shen, S., Xiong, X., Greco, K. F., et al. (2025). Arch: Large-scale knowledge graph via aggregated narrative codified health records analysis. Journal of Biomedical Informatics, page 104761.

[78]

Hou, J., Zhao, R., Gronsbell, J., Lin, Y., Bonzel, C.-L., Zeng, Q., Zhang, S., Beaulieu-Jones, B. K., Weber, G. M., Jemielita, T., et al. (2023b). Generate analysis-ready data for real-world evidence: Tutorial for harnessing electronic health records with advanced informatic technologies. Journal of medical Internet research, 25:e45662.

[79]

Jolliffe, I. (2005). Principal component analysis. In Everitt, B. and Howell, D., editors, Encyclopedia of Statistics in Behavioral Science. John Wiley & Sons, Ltd.

[80]

Arroyo, J., Athreya, A., Cape, J., Chen, G., Priebe, C. E., and Vogelstein, J. T. (2021). Inference for multiple heterogeneous networks with a common invariant subspace. Journal of Machine Learning Research, 22(142):1–49.

[81]

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Kang, J. (2020). Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.

[82]

Huang, K., Altosaar, J., and Ranganath, R. (2019). Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342.

[83]

Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., and Poon, H. (2021). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 3:1–23.

[84]

Liu, F., Shareghi, E., Meng, Z., Basaldella, M., and Collier, N. (2021). Self-alignment pretraining for biomedical entity representations. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4228–4238.

[85]

Yuan, Z., Zhao, Z., Sun, H., Li, J., Wang, F., and Yu, S. (2022). Coder: Knowledge-infused cross-lingual medical term embedding for term normalization. Journal of biomedical informatics, 126:103983.

[86]

Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., and Liu, Z. (2024). Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216.

[87]

Zhou, D., Tong, H., Wang, L., Liu, S., Xiong, X., Gan, Z., Griffier, R., Hejblum, B., Liu, Y.-C., Hong, C., et al. (2025). Representation learning to advance multi-institutional studies with electronic health record data. arXiv preprint arXiv:2502.08547.

[88]

OMOP (2021). Omop. Accessed: June, 2021.

The authors gratefully acknowledge please remember to list all relevant funding sources in the unblinded version↩︎
https://platform.openai.com/docs/guides/embeddings/embedding-models ↩︎

PEHRT: A common pipeline for harmonizing electronic health record Data for translational research