Data scientists like Sid Henriksen, a Ph.D. student nearing graduation, often ask me how they can succeed in healthcare. With Sid's permission, here are a few questions and insights for aspiring healthcare data scientists.

How applicable is generic data science in healthcare?

The core data science skillset of machine learning, data visualization, and statistics is the foundation of working with all data, healthcare included.  Many areas of healthcare are using these tools to assess risk, predict patient behavior and outcomes, and even diagnose disease. It's like being a business reporter - you must be a capable writer to get the job, but you're also going to need to understand business to succeed.

What additional skills are essential for data scientists in healthcare?

Beyond generic data science skills, it is critical to understand the domain you will be working in. The first thing to understand is that healthcare is not a single industry, but rather perhaps a dozen sub-industries such as hospitals, health plans, pharmaceuticals, medical devices, etc. Each of these subgroups are trying to solve different problems and have their own ways of doing things.

Once you've focused in on solving a particular business or clinical problem, it is critical to really understand what constitutes success. There are many possible statistics to choose from and it is important to match the right statistic to the problem. For example, if you are generating models to provide low-cost interventions and you can't afford to miss any people that need your help, sensitivity is much more important than the area under ROC.

Healthcare data is also notoriously messy and much of the most important information is hidden in free text. To unlock this information, it is necessary to use clinical NLP tools as a way to reduce feature sets. For example, cTAKES is a great open source platform for clinical NLP.

What characteristics make people successful?

In my experience, successful data scientists in healthcare (and I believe all industries) work closely with their end users. They do not simply build a model and email off a report - they work collaboratively to ensure that the appropriate statistics are being used and that the model is optimized to promote the desired outcome. To take an example from care management, a truly successful data scientist would embed themselves in the care management team and continually refine his models to ensure that they provided the most actionable guidance for directing care management interventions.