We learned that a team of health services researchers at Dartmouth College recently published a study in the Journal of Patient Safety using an early research version of Cyft to detect falls in inpatient notes.  No one on their research team is a data scientist and using a relatively small sample size they outperformed previous efforts at this task by a considerable margin.  They completely excluded us from the work – not even asking for our advice.

Why is this a best case scenario for us?

What if every quality improvement, research, case management, or clinical operations team could use all of their data – from claims to free text – to identify exactly which patients are at risk?  What if they could do it quickly, without teams of data scientists, and for an unlimited number of very specific applications (like falls) based on their own highest priorities?   The positive impact on healthcare would be substantial and widespread.  That’s been the goal of our research for years and now as Cyft, we’re providing the software to healthcare organizations to make it a reality.

Here are some highlights as to how this brings us one step closer to all of healthcare being able to capitalize on these approaches:

1) No one on the study team has data science experience.  

Machine learning + natural language processing are new to healthcare and so everyone assumes that benefitting from these technologies requires engaging in long term consulting deals or purchasing multi-million $ software packages that offer prescriptive, pre-fabricated models focused on what others’ think matters.  We believe these technologies will only reach their full potential when they’re accessible to everyone.   The fact that a team of psychiatrists and health services researchers executed this study without the help of data scientists is an important step in the right direction. 

 

2) Despite no data scientists on the team, they outperformed previous research efforts.

Shiner at al. weren’t the first to try to find falls in free text progress reports.  They report that Toyabe in 2012 achieved an F-measure of .12, “whereas our best performing algorithm achieved an f-measure of .67” a 5x improvement in accuracy.*  

So how did Shiner et al improve so dramatically over what had been done in the past?  And how did they know which was the “best performing algorithm?”

Cyft builds hundreds of models using millions of datapoints and it measures its own performance by running real empirical experiments against historical data.  It does this by blinding itself to some of the answers using a method called cross validation and returns results in terms of a number of standard statistics including recall, precision, F-measure, AU-ROC (or c-stat), sensitivity, and specificity.  

As was the case here, the results of a Cyft experiment can be submitted for peer review.  We think this is greatly preferred to simply accepting that a model build elsewhere on others’ populations will perform exactly the same in all institutions on all challenges.

3) The researchers’ sample sizes were really small.

A question I’m often asked re: machine learning is “doesn’t that require large amounts of data?”  This is one of the many reasons I am not a fan of the term “big data.” In Shiner et al.’s study they were able to achieve impressive results with only a few hundred records – not the millions that the term “big data” implies.  Yes, we’re always excited to dig into huge data sets but we don’t always have that luxury in healthcare.

4) Falls are one of many important yet under-reported adverse events.  

screen-shot-2016-10-03-at-1-15-28-pmA recent CDC report citing the growing risk of falls to older Americans with 29 million falls in 2014, causing 7 million injuries.  As terrible as falls are, they represent one of thousands of adverse events that are simply not well understood because of the challenges of measuring the mostly unstructured and often unreliable data of healthcare.  

Just how widespread are falls? Hospital acquired infections? How many critical comorbidities are missed? For any given area of medicine the list of what matters most but simply can’t be counted today is impressive.

Measurement matters and must be made ubiquitous – even if healthcare data doesn’t play nice.  Shiner’s work and our experience working with more than 250 hospitals shows that it’s possible to finally capitalize on all the data we collect and to focus on any number of problems, not just “one size fits all” risk scores based on cost, death, or readmission.   

Congratulations and thanks to Dr .Shiner and team.  Keep up the good work!

Leonard D’Avolio, PhD

CEO & Co-Founder, Cyft

scholar.harvard.edu/len

@ldavolio

cyft.wpengine.com

*F-measure (or F-score) is the harmonic mean of recall and precision and the widely accepted standard for measuring the performance of classification systems (especially for binary tasks).  Rather than dive into a treatise on the merits of one measure or another, it’s safe to say that Shiner et al were > 5x more effective at their task.