Healthcare is notorious for its lack of consistent and widely adopted data formats. The one consistent exception is the billing information exchanged between payors and providers. These files are often referred to as “claims.” Because of their ubiquity, many of today’s analytical approaches - from epidemiology to public health, actuarial sciences, business intelligence, and risk scores - rely heavily, sometimes exclusively, on claims files.

Claims files are reliable enough for understanding the number and amounts of services rendered. However, one should be wary of relying on claims to understand why something happened, whether it worked, or what should happen - in effect claims are unsuitable for answering fundamental questions of improvement.

First, a high-level primer on what’s in claims files:

  • Procedural codes represent specific health interventions taken by medical professionals. Reimbursement for the units or types of services rendered is derived largely from these codes.
  • Diagnostic Codes (often called disease codes) document the diseases, disorders, or symptoms afflicting each person. These codes can impact the reimbursement rates for the procedures performed.
  • Pharmaceutical Codes show the drugs prescribed to a patient. These must be accurate for billing but they do not track whether the patient actually took the prescribed medication.

Claims files are intended to support billing and subsequent reimbursement.  What they capture, how they capture it, and the value of that information is entirely crafted based on these intended uses. In order to receive reimbursement, claims files must contain evidence of what was done and to whom it was done. The units and complexity of the services are also captured. Incomplete capture of this information leads to trouble getting paid. Misrepresentation leads to audits and legal jeopardy.  We should therefore feel quite comfortable relying on claims data to tell us what happened from a transactional standpoint.

Understanding why something happened or whether it worked is another matter.

No clinician would turn to claims files to understand these questions. They know that the information contained within claims is not reliable nor complete enough to begin to address these questions. In addition to the numerous studies on the shortcomings of claims, every epidemiologist or health services researcher has their fair share of “claims are bad” anecdotes. In our case, Cyft was created to use all of healthcare’s data partly because our founder, Dr. Leonard D’Avolio ran headfirst into the shortcomings of claims as a newly minted Ph.D. While working with 6 academic medical centers on a quality improvement project, his group’s first step was to find the people with the target disease - in this case, colorectal cancer. The project started by querying the disease codes in claims data. Then to verify their results they conducted a chart review to compare what was in the medical record with the ICD codes of each patient.

By reviewing the charts, they found 80% of all patients with codes indicating colorectal cancer did not have the cancer. The following years confirmed that this frightening pattern of inaccurate disease codes is more the rule than the exception. Even just recently, we found that ~50% of heart disease codes were wrong at an ACO we were speaking with. This is not exactly a solid foundation for improving our understanding of healthcare.

There’s also reason to believe things will get worse before they get better.  Issues with incorrect disease codes were prevalent for years as people (and software) employed the International Classification of Disease codes, version 9 standard (ICD-9), which has ~12,000 different codes. In 2015 CMS made ICD-10 the mandatory standard, expanding the possible codes to over 65,000. Needless to say, we do not anticipate improvements in accuracy or consistency as a result.

So how to use claims for improvement?

Most approaches to analytics simply reuse existing models for other customers. To make the models transferable, they rely on claims-only approaches because this is the only consistent data in healthcare.  Fortunately, from a technical standpoint, we’re beyond that. Or at least, we should be. There are a few thousand studies from about 20 years of research demonstrating that advanced analytical approaches such as machine learning + natural language processing provide better capabilities to finally unlock insights from all available datasources.

The challenge now is that no one has figured out how to put these technologies to work within the complex realities of healthcare and drive real value from them. That’s why we started Cyft - to help healthcare organizations finally begin to learn from the oceans of data that go largely untapped and to apply those insights at the right moments to the right people.

Check out to see how we are helping customers like Beacon Health Options use all of their data, or drop us a quick note to see how we could help move your organization beyond relying on claims data.


Leonard D’Avolio, PhD

CEO & Co-founder, Cyft