Health Affairs January 15, 2020

A recent study by Ziad Obermeyer and colleagues in Science identified a racial bias in a risk stratification algorithm that is used to prioritize patients for care management. Like most algorithms currently in use, it considers past cost to identify individuals most in need of help. Because white people tend to have higher medical expenses, they are prioritized over sicker black patients. The researchers show that if the bias is corrected for, the proportion of black people prioritized for care swings from 17.7 percent to 46.0 percent.

Less than a week later, news of the study appeared in Nature, Business Insider, the Wall Street Journal, and Wired. The State of New York is investigating and threatening suit against UnitedHealthcare and others that employ such approaches.

While this recent discovery is rightfully gaining attention, it is just one of many known biases and shortcomings of the health care system’s current approach to risk stratification. Obermeyer and his colleagues’ study and the concerns it raised offer an opportunity to carefully consider unintended consequences of the prevalent approaches of stratifying risk to find a new way forward.

Flaws And Unintended Consequences

The algorithms in question are decades-old adaptations of actuarial models. They rely mostly on claims (that is, billing) data as input. With the introduction of managed care in the 1980s, health plans needed a way to prioritize their care management activities. The same approach used to estimate the “risk” of populations was applied to predict individuals’ future health care use.

Soon after these approaches were adapted to help care management teams prioritize their limited resources, researchers began publishing studies that identified deficiencies and unintended consequences. A systematic review of 30 risk stratification algorithms appearing in the Journal of the American Medical Association in 2011 concluded, “Most current readmission risk prediction models that were designed for either comparative or clinical purposes perform poorly.” Despite these findings, little has been done to address the problem in the past decade. A number of factors contribute to this poor performance:

Reliance On Claims

As the agreed-upon means of justifying reimbursement, claims data are one of the few widely available data standards in health care. However, the disease code assignments in claims files are notoriously inaccurate. While such errors may be less problematic for calculating the cost of a population over the next year, their effects can be amplified when assigning priority to individuals.

Assumption That Past Cost Equals Future Need

Hockey great Wayne Gretzky credits his success to skating to where the puck will be. In contrast, risk stratification algorithms direct care management teams to where the puck was by finding patients who already cost the most. Research has shown that for patients who use large amounts of health care services, the need often is intense yet temporary. Algorithms that prioritize those who consumed the most health services in the past are inadvertently prioritizing a number of patients nearing end of life and those whose medical needs are subsiding, thus creating a past-consumption bias.

Use Of One-Size-Fits-All Formulas

Many factors—including the nature and stage or severity of the disease, extent of social support, and number of preventable emergency department visits, hospital admissions, or readmissions—may be indicators of medical need. Yet, traditional risk scores ignore these factors, which can provide important context for prioritizing patients.

This myopic approach leads to a type of condition bias in which the diseases that generate the most health care use are prioritized by risk scores. For example, people on dialysis or with late-stage cancer are more likely to be prioritized over people with early signs of type 2 diabetes because patients suffering from the former are likely to have accrued greater medical expenses than the latter. Yet, the greatest opportunity for clinical and financial impact is often in the earlier stages of disease.

Use of one-size-fits-all formulas also introduces age bias; younger sick people are ignored by most algorithms in favor of older people with more chronic, complex conditions. For example, a child with rising risk of potentially fatal diabetic ketoacidosis is unlikely to be prioritized over a 55-year-old with a chronic condition that has led to intense spending over the past 12 months. This bias is particularly problematic for care management organizations serving high-need Medicaid and dually-eligible (Medicare plus Medicaid eligible) populations with a wide age range.

The Impetus For Change

Providers’ adoption of value-based contracts has led to significant new investments in care management. Organizations are expecting measurable returns from these investments in the form of improved outcomes and reduced medical expenditures. Different levels of care and interventions are being introduced to address specific needs at different times, often outside of the clinic. Examples include remote monitoring programs, more in-home and telemedicine programs for patients with chronic and complex needs, and community-based palliative care. These programs typically are costly to implement, thus raising the stakes for efforts to identify patients who are benefiting most and who are most likely to benefit.

New data from electronic medical records, medical devices, and new technologies such as machine learning and natural language processing are introducing more opportunities to use data to identify the patients most likely to benefit—not necessarily those who cost the most in the past.

The Way Forward

As pointed out by Obermeyer and colleagues, the way forward is not as simple as swapping out one variable for another. Neither is the answer to simply apply new data and new math to the traditional method of risk stratification. To meet the evolving needs of care management and capitalize on access to new data and technology, we need to rethink our approach. The next generation of risk stratification approaches should:

Use All Relevant Data, Not Just The Data That Everyone Else Has

It’s no longer necessary or even appropriate to limit models to the same common denominator data that all institutions have access to (for example, claims). Just as companies such as Amazon and Google use all of the data at their disposal to tailor their approaches to selling books and advertisements to individual consumers, health care can use data to move beyond one-size-fits-all approaches to risk stratification.

For example, Cyft, the analytics company I work for, recently collaborated with a care management organization responsible for the health of a Medicaid population. The care management organization cared for all ages of people with behavioral health needs. However, the state offered a special program for people younger than age 18 with behavioral health needs. Unfortunately, the organization’s risk scoring system did not consider age nor was it designed to detect patterns that would indicate behavioral health need. Without the support of customized risk stratification, referrals to the younger than-18 program were limited to patients already engaged with clinicians who happened to be aware of the state program.

To help the team identify and prioritize those likely to benefit, we built two risk stratification models, one that was trained with (that is, learned from) the data of people younger than 18 years old and one trained on patients 18 and older. This approach prevented the younger people from being crowded out by older people with more health care use. Rather than prioritize based on cost, both models were designed to predict inpatient psychiatric admissions as a proxy for impending behavioral health need. The results of the two age- and condition-sensitive models were used to match individuals with interventions tailored to address their age- and condition-sensitive needs.

As organizations capture more data—including clinical data in electronic medical records and care management systems, as well as survey data on topics such as activities of daily living and social determinants of health—this information can be used to prioritize patients, not just for care management but for the programs or interventions best suited to their unique characteristics and needs.

Include Clinicians Throughout The Design Process

Most risk stratification algorithms are licensed and installed with little feedback and even less design input from the clinical team they are intended to support. As a result, interventions may not adequately account for limitations in the clinical team’s capacity or workflows, and they may fail to achieve optimal outcomes. To maximize the potential for algorithms to advance positive results, models should be designed with a collaborative approach in which clinicians lead the discussion about intended use. In the state behavioral health intervention described above, the decisions to model by age and focus on inpatient psychiatric admissions were the result of a design process that included clinicians from the start.

Evaluate Performance With Your Own Population

Clinicians should not be asked to “trust” the results of models that were not evaluated within their own population. Importantly, a local evaluation means that the results of each model can be checked for inadvertent bias by analyzing the distributions of various subpopulations by age, sex, race, and disease.

Use Appropriate Measures

There is not a single “best statistic” for all stratification applications. Understanding which is the right tool for the job is critical for teams planning and evaluating their efforts. Unfortunately, clinical research has a tradition of measuring model performance with diagnostic accuracy measures that indicate how good a model is at predicting which people do not need help (negative predictive value). The commonly used area-under-receiver-operator characteristic (AU-ROC or c-stat), which relies on specificity, measures how well an algorithm predicts true negatives. Both risk stratification vendors and researchers benefit from this type of accounting, which favors correctly identifying the hundreds of thousands of people who are not the most in need versus their ability to identify the hundreds who are. However, these measures offer little insight to teams hoping to allocate limited resources to those most in need.

The metric that matters for effective care management is how good the algorithm is at prioritizing people who do need help (true positives), or the positive predictive value (PPV) of the algorithm. Even more relevant is the PPV for the number of people the team can possibly reach within a given period of time. In other words, measuring the PPV of an algorithm applied to all 100,000 people in a population is less relevant than the PPV of the first 100 predictions per week if that’s the volume and frequency of outreach the care management program can achieve.

Once Models Are Deployed, Conduct Ongoing Monitoring Of Performance And Output

Most risk stratification models are static, yet they are deployed in evolving and complex environments. The introduction of new data sources, changes in reimbursement contracts and policies, and the redesign of care management programs are not uncommon. Without a system of monitoring and periodic assessment to determine whether the model is meeting the clinical teams’ needs, clinical end users may not notice that models are producing irrelevant or inaccurate results.

Measure What Happens Next

The best predictions are merely suggestions. To have impact, a care management program must lead to a series of cascading activities, from outreach to enrollment to intervention. Today, surprisingly few care management teams measure the activities or the outcomes of their programs. Those that do often rely on annual assessments and biased pre- versus post-evaluations.

Moving forward, organizations that use stratification algorithms should do so as part of a system of ongoing measurement and improvement. Clinical teams should participate in the design of what’s measured to be sure that metrics are useful for advancing program goals. While an institution’s leadership may believe it is important to measure admission rates on an annual basis, care management is more likely to benefit from monitoring key metrics on a monthly basis, such as how many people identified as “at risk” received outreach from care managers, the number and method of outreach attempts, and enrollment rates. Measures of improvement should be compared against a control group with similar characteristics. Such information can be used to make incremental improvements that can help reduce admission rates over time.

Obermeyer and team have done health care an important service. By diagnosing a major shortcoming of the current approach to prioritizing patients for care management, this research should help prompt organizations to think more carefully about the use of algorithms. In doing so, it is important to recognize racial bias as one of several unintended consequences—along with past-consumption bias, condition bias, and age bias.

We now have a unique opportunity to modernize care management. It is time to replace the traditional, one-size-fits-all approach with models that are customized to local populations, informed by clinician expertise, and designed to prioritize those most in need. Deployment of these models should not be viewed as a one-time endeavor but rather as an evolving process aligned with a system of continuous quality improvement.

Author’s Note

The author is the founder and CEO of Cyft, a company that focuses on exactly this issue. He is also an adviser to other companies (Datalogue, Firefly Health) and philanthropies (the Helmsley Charitable Trust Foundation) that are responsible for using data to get the right care to the right people at the right time.