A drug classification pipeline for Medicaid claims using RxNorm

Nicholas Williams
Department of Epidemiology
Columbia University
New York, NY 10032
Kara E. Rudolph
Department of Epidemiology
Columbia University
New York, NY 10032


Objective: Freely preprocess drug codes recorded in electronic health records and insurance claims to drug classes that may then be used in biomedical research.
Materials and Methods: We developed a drug classification pipeline for linking National Drug Codes to the World Health Organization Anatomical Therapeutic Chemical classification. To implement our solution, we created an R package interface to the National Library of Medicine’s RxNorm API.
Results: Using the classification pipeline, 59.4% of all unique NDC were linked to an ATC, resulting in 95.5% of all claims being successfully linked to a drug classification. We identified 12,004 unique NDC codes that were classified as being an opioid or non-opioid prescription for treating pain.
Discussion: Our proposed pipeline performed similarly well to other NDC classification routines using commercial databases. A check of a small, random sample of non-active NDC found the pipeline to be accurate for classifying these codes.
Conclusion: The RxNorm NDC classification pipeline is a practical and reliable tool for categorizing drugs in large-scale administrative claims data.

1 Background and Significance↩︎

Electronic health record (EHR) and administrative insurance claims data have become an increasingly important asset in conducting observational epidemiological and pharmacological research [1], [2]. Much of this research requires classifying drugs in terms of mechanism of action, therapeutic intent, or chemical structure; commonly referred to as drug class. However, drug data in the EHR are typically recorded using codes that do not indicate the general drug class, and so require preprocessing before they can be used for biomedical research. In the United States (US), EHR drug data are recorded as national drug codes (NDCs), which are unique three-segment, 11-digit identifiers assigned to all human drugs. The first segment is a manufacturer identifier, the second segment is a product identifier, and the third is a package identifier. Although an NDC uniquely identifies a specific drug product, it does not directly identify the drug class. Thus, preprocessing these data by converting NDC to useful study-specific drug classes is a key step in applied research.

One option for classifying NDC is to pay for a commercial database[3], such as IBM Micromedex Red Book [4] or Multum from Oracle Health [5]. However, paying for one of these services limits access to researchers with large enough budgets, and introduces a roadblock for reproducibility. Another approach would be to manually create a list of all the NDCs that belong to a given drug class of interest [6], [7]. While free, this approach introduces variability between research teams, is time-consuming, and may have poor performance as the number of NDC that belong to a certain class may be well into the thousands (see Table 1).

We address this gap by providing a freely available user-friendly software that performs this preprocessing step, linking NDCs with drug class codes. In performing this linkage, we use the free-to-use US National Library of Medicine’s (NLM’s) RxNorm, which links many products to different drug classification systems. We use RxNorm’s linkage to the World Health Organization’s (WHO’s) Anatomical Therapeutic Chemical classification (ATC), which classifies substances "according to the organ or system on which they act and their therapeutic, pharmacological and chemical properties." [8] In the ATC system, drugs are assigned a code that indicates a hierarchical subdivision into five different levels with increasing levels of specificity. We are only interested in the first four ATC levels as the fifth level indicates the specific substance. For example the ATC code, up to the fourth level, for fentanyl is N01AH: "N" indicates fentanyl operates on the nervous system, "01" indicates fentanyl is an anesthetic, the "A" indicates it’s part of the "general" subgroup of anesthetics, and the "H" indicates it’s an opioid anesthetic.

2 Objective↩︎

We were motivated to develop this software by the need to identify opioid and non-opioid drugs prescribed for pain in a large cohort of Medicaid claims data. We use our motivating example to demonstrate the software’s utility and performance. Our software uses the NLM RxNorm API and an R package interface. We evaluate its performance by the proportion of prescription claims that are linked to a drug class.

3 Materials and methods↩︎

3.1 Classification pipeline↩︎

Our approach retrieves drug information using a set of free-to-use application program interfaces (APIs) provided by NLM via representational state transfer (REST) to access the RxNorm and RxTerms datasets: the RxNorm and RxTerms [9], and RxClass [10] APIs. To avoid sending potentially thousands of requests to the NLM servers, we also make use of RxNav-in-a-Box, a locally installable version of the APIs, in combination with Docker [11]. To query the APIs, we created an R [12] package interface which is available for download on GitHub.

First, we retrieve an NDC’s status. In RxNorm, an NDC is classified as either "active", "obsolete", "alien", or "unknown". We ignore NDCs with a status of "unknown" as they are items that have never appeared in RxNorm. Often these products are medical supplies, vitamins or minerals, dietary drugs, and other over-the-counter medications.

If an NDC is considered "active", we then query RxNorm for the RxCUI assigned to the NDC and then submit a query for the associated product-level ATC. If a product-level ATC is not found, we instead submit a query for the ingredient-level ATC of all active-ingredients in the drug. Note that product-level ATC are assigned by NLM, while ingredient-level ATC are assigned to a substance by the WHO.

If an NDC is considered "obsolete" or "alien", we again query RxNorm for the RxCUI assigned to the NDC, but then perform an additional query for the RxCUI status to see if the RxCUI has been "remapped", "quantified", or considered "not current" or "obsolete". If the RxCUI is "remapped" or "not current", we attempt to replace the original RxCUI with the current RxCUI, and again search for the product or ingredient level ATC. If the RxCUI status is "obsolete" we attempt to link the NDC to a new RxCUI through the semantic clinical drug (SCD) concept which itself is a mapping of a drug product into its ingredient, strength, and dose form; again, then searching for the product or ingredient level ATC of the RxCUI associated with the SCD. If the RxCUI status is "quantified", it indicates that the RxCUI is missing a "quantity factor" but has been linked to another related concept that may be quantified, which we then again attempt to link to a product or ingredient-level ATC.

The R script used for running the above classification pipeline is available on GitHub.

Table 1: ATC codes of interest
ATC Description No. of NDC linked
Opioids for pain
N02A Opioid analgesics 2264
N01AH Opioid anesthetics 117
N07BC Drugs used in opioid dependence 169
Non-opioids for pain
A03D Antispasmodics in combination with analgesics 0
A03EA Antispasmodics, psycholeptics and analgesics in combination 0
M01 Anti-inflammatory and anti-rheumatic products 2529
M02A Topical products for joint and muscular pain 295
M03 Muscle relaxants 848
N02B Other analgesics and anti-pyretics 3154
N06A Anti-depressants 2628
Total: 12004

3.2 Data↩︎

Our dataset includes prescription drug claims in the Medicaid T-MSIS Analytic Files (TAF) pharmacy and other services files from non-pregnant adults aged 35-64 years who were non-dual-eligible Medicaid beneficiaries enrolled 2016-2019 from the following 26 states that implemented Medicaid expansion under the Affordable Care Act in or prior to 2014: ND, VT, NH, CA, OR, MI, IA, NV, OH, IL, NY, MD, MA, RI, HI, WV, WA, KY, DE, AZ, NJ, MN, NM, CT, CO, AR [13]. We limited our dataset to valid NDC that were 11-digits long and only contained values 0-9. We refer the reader to [14] for further background.

4 Results↩︎

Our goal was to identify opioids and non-opioids typically prescribed for treating pain. The classification pipeline was run on a 2021 MacBook Pro with an Apple M1 Pro chip and 32GB of RAM. Using parallel processing, the pipeline ran in approximately 12.5 minutes. The ATC codes used to identify these categories and the number of NDC that were found to belong to an ATC class are shown in Table 1. In total, we identified 12,004 NDC codes that belonged to one of the opioid and non-opioid ATC categories of interest.

The dataset consisted of 809,484,945 claims and 126,604 unique NDC. Of the unique NDC, 50,142 (39.6%) had an NDC status of "active"; 23,520 (18.6%) were "obsolete"; 34,267 were "unknown" (27.1%); 18,675 were "alien" (14.8%). Among the "active" NDC, 49,861 (99.4%) were successfully mapped to a product-level or ingredient-level ATC; 20,946 (89.1%) of the "obsolete" NDC were mapped to an ATC; 4,484 of the "alien" NDC were linked to an ATC. Of all NDC, 75,270 (59.4%) were successfully mapped to an ATC. However, the classified NDCs accounted for 95.5% of all the non-unique NDCs in the dataset.

Among the NDC that did have a linked RxCUI but did not have an ATC associated with the RxCUI 13,289 (77.4%) had an RxCUI status of "not current", 1,974 (11.5%) had an RxCUI status of "obsolete", 1,195 (7%) had an RxCUI status of "remapped", and 75 (0.4%) had an RxCUI status of "quantified". 635 "active" RxCUI were unsuccessfully mapped to an ATC. Of the "obsolete" RxCUI, 1,078 (54.6%) were matched to an RxCUI with an associated ATC; 947 (79.2%) of the "remapped" RxCUI and 74 (99%) of the "quantified" RxCUI were matched to a new RxCUI with a corresponding ATC. As an audit for accuracy, we took a 0.01% sample of the non-"active" NDC that were linked to an ATC code and did an ad-hoc search for their appropriate ATC category. The results of this check are shown in Table 2. There were two NDC incorrectly classified NDC; both NDC were classified as "R05D" (cough suppressants, excl. combinations with expectorants) while the correct classification was "R05F" (cough suppressants and expectorants, combinations).

Table 3 shows the concept names for the five most common "active", "alien", or "obsolete" NDC and the five most common "unknown" NDC that were not linked to an ATC code. Of these, six were glucose testing strips, two were condoms, one was needle and one was a chamber used for an inhaler.

Table 2: Random sample of "alien" or "obsolete" NDC that were linked to an ATC code and checked for ad-hoc accuracy.
NDC NDC Status Pipeline ATC Ad-hoc ATC
00173024955 Obsolete C01AA C01AA
11917014765 Alien A11CC A11CC
57770005500 Obsolete S01XA S01XA/S01KA
00781286531 Obsolete A02BA A02BA
66870050512 Alien A02AA A02AA
51079012220 Obsolete A02BD/P01AB P01AB
51079078719 Obsolete C10AB C10AB
00904526161 Obsolete A02BA A02BA
11822074050 Obsolete N02BE N02BE
50428322435 Alien D06AX D06AX
40986001765 Alien B03BA B03BA
43063005202 Obsolete A04AA A04AA
43292055802 Alien A11GA A11GA
79854001535 Alien A11HA A11HA
00093213193 Obsolete J01XE J01XE
60432004504 Obsolete R05DA R05FA
52544093628 Obsolete G03AA/G03AB G03AB
00440632530 Obsolete C02AC C02AC
50428034138 Obsolete N05CH N05CH
76439013004 Obsolete M05BA M05BA
60687029825 Obsolete N06BA N06BA
70030014843 Alien P03AC P03AC
50428689594 Obsolete A02AF A02AF
54569317700 Obsolete R05DA R05FA
45802091334 Obsolete D10AE D10AE
Table 3: Examples of NDC and their concept names that were not matched to an ATC code.
NDC Concept Name
NDC with a status of "active", "obsolete", or "alien"
"unknown" NDC

5 Discussion↩︎

EHR data are increasingly common and valuable source for health research. However, drug data in the EHR often require classification before being usable. Motivated by the need to classify opioid and non-opioid prescription claims for pain medications in Medicaid, we developed a free-to-use, open-source, and reliable drug classification pipeline using the NLM RxNorm API and R.

Applying our proposed pipline to our motivating example of classifying prescription pain medications in Medicaid claims data, we were able to link 59.4% of all valid unique NDC to an ATC code. Limiting to NDC with an NDC status of "active", "obsolete", or "alien", we linked 81.5% of NDC to an ATC code. These results are similar to [3] who found that RxNorm linked 60% of NDC to an ATC, [15] who found that 84.2% of NDC could be linked to a drug class when including historical versions of RxNorm, and [16] who linked 79.6% of clinical drugs in RxNorm to ATC. The classified NDC accounted for 95.5% of all the claims in the cohort which is on par with the 98.2% coverage from [3] when using a commercial database. We evaluated the accuracy of the pipeline for classifying "obsolete" and "alien" NDC using a small random sample of the classified codes and found its performance to be reliable.

A limitation of our pipeline is that we limited drug classification to ATC. RxNorm can link to other class sources, such as VA class or a classification for diseases that may be treatable with a drug product from the Medication Reference Terminology, both produced by the Veterans Health Administration. A future improvement to the pipeline could be to attempt to link the unclassified NDC to one of these other classes as a way to increase the claims coverage. While we used the pipeline to classify Medicaid claims, we expect it to be easily adaptable to other data as it only depends on NDC. Another limitation of approach is that it is limited to classifying drugs in the United States. Recently, however, [17] developed RxNorm Extension which extends RxNorm to drugs outside of the United States. Future work could examine the performance of classifying drugs not used in the United States with ATC using this tool.

6 Conclusion↩︎

We developed a pipeline to link NDC to ATC in R using the National Library of Medicine’s RxNorm API. Applying the classification pipeline to large-scale Medicaid claims dataset, we found it to be a practical and reliable tool for categorizing NDC.


Joan A Casey, Brian S Schwartz, Walter F Stewart, and Nancy E Adler. Using electronic health records for population health research: a review of methods and applications. Annual review of public health, 37: 61–81, 2016.
Reem Farjo, Hsou-Mei Hu, Jennifer F Waljee, Michael J Englesbe, Chad M Brummett, and Mark C Bicket. Comparison of methods to identify individuals prescribed opioid analgesics for pain. Regional Anesthesia & Pain Medicine, 2024.
Mark L Homer, Nathan P Palmer, Olivier Bodenreider, Aurel Cami, Laura Chadwick, and Kenneth D Mandl. The drug data to knowledge pipeline: Large-scale claims data classification for pharmacologic insight. AMIA Summits on Translational Science Proceedings, 2016: 105, 2016.
IBM. . , 2024. URL https://www.redbooks.ibm.com/.
Oracle. . , 2024. URL https://www.oracle.com/health/service-lines-departments/pharmacy/#rc30p5.
Centers for Disease Control and Prevention. Cdc data resources on opioids. https://www.cdc.gov/opioids/data-resources/index.html, Accessed: 2024-03-11.
Gabrielle F Miller, Gery P Guy Jr, Kun Zhang, Christina A Mikosz, and Likang Xu. Prevalence of nonopioid and opioid prescriptions among commercially insured patients with chronic pain. Pain Medicine, 20 (10): 1948–1954, 2019.
WHO Collaborating Centre for Drug Statistics Methodology. . , 2022.
United States National Library of Medicine. Rxnorm. Available at: https://www.nlm.nih.gov/research/umls/rxnorm/index.html, 2024. Accessed on: January 19, 2024.
United States National Library of Medicine. Rxclass. Available at: https://mor.nlm.nih.gov/RxClass/, 2024. Accessed on: January 19, 2024.
Dirk Merkel. Docker: lightweight linux containers for consistent development and deployment. Linux journal, 2014 (239): 2, 2014.
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2022. URL https://www.R-project.org/.
Kaiser Family Foundation. Status of state medicaid expansion decisions: Interactive map. https://www.kff.org/medicaid/issue-brief/status-of-state-medicaid-expansion-decisions-interactive-map/, 2020.
Katherine L Hoffman, Floriana Milazzo, Nicholas T Williams, Hillary Samples, Mark Olfson, Ivan Diaz, Lisa Doan, Magdalena Cerda, Stephen Crystal, and Kara E Rudolph. Independent and joint contributions of physical disability and chronic pain to incident opioid use disorder and opioid overdose among medicaid patients. Psychological medicine, pages 1–12, 2023.
Lee B Peters and Olivier Bodenreider. Approaches to supporting the analysis of historical medication datasets with rxnorm. In AMIA Annual Symposium Proceedings, volume 2015, page 1034. American Medical Informatics Association, 2015.
Anna Ostropolets, Polina Talapova, Marcel De Wilde, Hamed Abedtash, Peter Rijnbeek, and Christian G Reich. A high-fidelity combined atc-rxnorm drug hierarchy for large-scale observational research. In MEDINFO 2023—The Future Is Accessible, pages 53–57. IOS Press, 2024.
Christian Reich, Anna Ostropolets, Patrick Ryan, Peter Rijnbeek, Martijn Schuemie, Alexander Davydov, Dmitry Dymshyts, and George Hripcsak. Ohdsi standardized vocabularies—a large-scale centralized reference ontology for international data harmonization. Journal of the American Medical Informatics Association, 31 (3): 583–590, 2024.