Data on health care claims may be useful for COVID-19 research despite significant limitations

Although data on healthcare billing claims has been widely used to study healthcare use, spending, and policy changes, its use in the study of infectious diseases has been limited. Other data sources, including the Centers for Disease Control and Prevention (CDC), have provided more timely reports to epidemic experts. However, given the scope of SARS-CoV-2, the causative agent of the new coronavirus disease pandemic 2019 (COVID-19) and the multidimensional impact of the crisis on the healthcare system, analyzes that rely on data from indications on health care have begun to appear. Claim-based COVID-19 studies play a role, but understanding the limitations of this data is critical. We are concerned that many weaknesses are not recognized by those familiar with other forms of patient-level data. Below, we look at several important considerations and provide suggestions on where you can best leverage complaints data to inform policy and decision making.


The first major problem is that claims data are not representative of many populations in the United States and are further broken down by insurance coverage. Large commercial claims databases typically include individuals who are covered by a subset of large employers and insurers, but would not be representative of, for example, all working adults. If people change jobs, even if they remain in the database, their records cannot be linked from year to year. This leads to an evolving and difficult to define target population. Medicare claims are available nationwide, but Medicaid claims are not standardized and are handled by individual states or multi-state consortia. Most COVID-19 analyzes related to claims data will not leverage all of these sources given the timeliness and access considerations discussed below. A clear articulation of who is and is not captured in the study sample is critical.

Underestimation and misclassification

We know that many people who are ill, especially with respiratory infections, do not appear in the claims data because they do not have a meeting with the health system, contributing to the lack of representativeness of the data as well as to an underestimation of cases. This has a disproportionate impact on marginalized groups, including those who distrust the health system due to its perpetuation of systemic racism, and in rural communities with insufficient access to hospitals. During a pandemic, the reasons to forgo a visit to a supplier increase as people avoid treatment out of fear of infection, decreased income, and a lack of safe and socially distant transportation. Information on hospitalization, although still incomplete, may be adequate for some research questions, unlike the incidence of the disease.

Underestimation problems are exacerbated during the emergence of new respiratory diseases in which the introduction of new ones International statistical classification of related diseases and health problems Codes (ICDs) are not immediately available. An ICD-10 code for COVID-19 (i.e., U07.1) became effective for use in the United States on April 1, 2020. However, given the limited transmission of SARS-CoV-2 in the United States started in January 2020, COVID-19 cases would meanwhile be misclassified, as would subsequent cases where suppliers still had to consistently use the new code. Limited testing capacity and long lead times will also affect the accuracy of COVID-19 ICD-10 codes, with vendors potentially choosing to list COVID-19 codes for probable positive cases before diagnostic results are returned, leading to misclassification in the other direction (that is, false positives).

Timeliness and access

Timeliness and access to health care claim data is a salient consideration. Due to delays in vendors submitting complaints, most complaint databases let at least 90 days pass before they deem the data usable for analysis. Again, there are revisions and other delays such that researchers may not have access to the data for up to a year or more. Conversely, while CDC data does not include patient-level information, it is updated weekly with a delay of approximately one to two weeks. CDC data is also freely available, while acquiring data on health care indications often involves considerable costs, detailed data usage agreements, and expensive, secure servers to host the data. This benefits researchers from well-resourced institutions and we should be aware that these privileged groups will make decisions about who and what is studied as the pandemic evolves. Different perspectives on these teams and community-based participatory research should be prioritized.

Trend monitoring

Where could health care guidance be useful for studying COVID-19? One area is trend tracking over time, which can provide important insight into ongoing waves of SARS-CoV-2 infection at a more granular geographic level (i.e., three-digit zip code) than currently offered by the count. of publicly available case data from the CDC (i.e. at the county level). To demonstrate the usefulness of a large database of commercial complaints in discerning such trends for respiratory pathogens such as SARS-CoV-2, we conducted an analysis of recent five-year influenza data. Influenza data for 2013-2017 was collected from the CDC and IBM MarketScan Research databases. CDC data includes both hospitalizations and outpatient visits for influenza. Similarly, MarketScan data includes both inpatient and outpatient visits, and a flu case for a given week was defined as a single enrolled with at least one flu-related ICD code. We observed differences in registered complaint cases versus CDC estimates, as expected, but found similar temporal trends (see Figure 1). However, it will still be necessary to take into account the specific context of COVID-19 in the analysis of trends in health care claims, including the aforementioned changes in the search for treatment and the availability of tests.

Figure 1: Influenza cases in the United States recorded in a commercial complaint database and estimated by the CDC, 2013-2017.

Source: Authors’ analysis of IBM MarketScan Research databases and CDC data. Notes: ICD-9 codes used in 2013-20: 4870, 4871, 4878, 48801, 48802, 48809, 48811, 48812, 48819, 48881, 48889. ICD-10 codes used in 2015-20: J09X1, J09X2, J09X3, J09X9, J1000, J1001, J1008, J101, J102, J1089, J1100, J1108, J111, J112, J1181, J1182, J1183, J1189.

Longitudinal analyzes

A further area where health care guidance can be helpful in our understanding of COVID-19 is case tracking longitudinally for the purpose of documenting long-term outcomes seen after infection. As the pandemic has been active for less than a year, these effects are currently unknown, although early heart health studies following SARS-CoV-2 infection suggest they are likely not negligible, even in mild cases. A key problem when studying patients longitudinally in health care statements is that individuals often enter and leave these databases due to changes in health insurance coverage. Therefore, limiting a sample to include only patients with data spanning, for example, two years (ie, they are continuously enrolled during this period) will have an additional impact on the representativeness of the study for the target population. Medicare claims may be particularly suitable for these studies, as a larger percentage of individuals can be followed for longer periods of time (except when they can switch to Medicare Advantage managed care plans) and the severity of cases among the elderly and disabled adults.

Causal inference and clinical prediction

Studies on the impact of policies in health care claims will be challenging as many interventions and system shocks are occurring simultaneously, limiting clean natural experiments and creating confusion without measurement. However, this may be feasible in select settings, such as some telemedicine studies. Conversely, we should be wary of using health care claims for prediction (for example, to predict which COVID-19 patients will have worse outcomes) or treatment efficacy, as lack of clinical information and mis-measurement are of concern when attempting to use this data to inform supplier decision making.


We believe that, in general, data on health claims are often better suited to studying the health system and specific types of longitudinal questions than clinical applications. However, this too requires a thorough understanding of the underlying processes that generated the data, regulatory changes, supplier behavior, and more to inform policy and decision making.

Notes from the authors

This project was supported in part by the National Institutes of Health (NIH) through the New Innovator Award DP2-MD012722 from an NIH director.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.