The Cyft Supermarket
Being able to build state-of-the-art models regularly is much harder and takes much longer if you don’t have a robust repeatable pipeline to process your data. This gets even more unwieldy if you’re trying to do that in a team, where one giant notebook doesn’t cut it. You also need a way to scale up to process all of the data coming in, or it will take hours and possibly days. Do all of that, securely, for multiple different projects simultaneously.
At Cyft, we handle data about hundreds of thousands of beneficiaries, each with years of clinical history, all in a secure HIPAA compliant environment. Then, we turn out results from advanced machine learning models, and dozens of reports tracking the progress of clinical teams, all while dealing with the messiness of healthcare data. We needed an environment capable of functioning across all of our projects to let us:
- Keep messy analysis notebooks separate from the data flow, but still in a place that makes sense with the flow.
- Take advantage of Spark’s optimization around long skinny tables to quickly process the data, as opposed to brute forcing it on our GPU instances, but maintain the ability to run deep learning models on GPUs if that makes sense.
- Have a consistent structure that anyone on the team can jump into from another project and know how to navigate. That way, ramp-up can focus on the things that are actually unique between projects
- Involve both our data scientists and engineers from soup to nuts. There are no assumptions made in a vacuum and everyone can follow the trail of where things came from.
- Make code reuse easy and obvious. If you figure out something on one project, it’s apparent where it should go in the next project. Making things easy to just slot in from previous work means we get better with every project.
- Allow team members with varying degrees of experience and familiarity to dive into atomic sections.
- Execute on all of the aspects of using data to improve healthcare, including analysis, modeling, and reporting.
The best metaphor for our environment is a supermarket. We organize all the data into different “aisles” — either raw ingredients or some more prepared foods (feature engineering). The aisles are long and skinny (which distributed clusters like). The aisles are combined into one “supermarket”. From that one centralized file, we can take out various “carts” for different cohorts and models. The supermarket itself doesn’t change for different use cases, and once we’ve stocked the shelves we can keep shopping there.
Here’s our overall structure:
- Validation and Exploration
- Spark models
- Meal prep (tensorize)
- Deep learning models
To do this work, we use both Spark and GPU instances. The vast majority of our data pipelining happens on Spark, so we can take advantage of the distributed cluster. Processing that would take hours on a brute-forced GPU instance can be run much more efficiently on a distributed cluster, especially since that cluster can be scaled up during high demand and then spun back down. On the other hand, distributed clusters have difficulty supporting deep learning methods, since those algorithms don’t distribute well. Running only a cluster would severely limit our modeling capability, while running only big GPU instances would be cost and time prohibitive. Being able to shuttle data between the two allows us to take advantage of both worlds.
Reading the Files
2. Validation and Exploration
Getting healthcare data into the system is a trial in itself. Since we can ingest data in any form without predetermined mappings, the first time we get data from any group, there’s a lot of to figure out about the structure, crosswalks, pipe delimiters, and whatever new wrinkle appeared for the first time in that dataset.
Separate from the ingest code is all of the messiness of figuring out what the data is, which happens in the Validation and Exploration section. It can get pretty big, depending on how different the data is from what we’re used to seeing, and how many weird issues come up. We integrate so directly with the workflows and systems of healthcare providers, we have to manage the custom setups and intricacies of their specific systems. Even after the initial data load, incremental updates can have their own problems. We like to keep our analysis sections separate from the others so there’s less confusion when new data comes into the pipeline. Digging into datasets is going to be messy, but processing that data doesn’t have to be.
These first exploratory checks also gives us the “bounds” of our data. This is really helpful information to have easily accessible later in the process when you’re double-checking the timeframes and general volume.
We put all of that mess and all of that back-and-forth into this one part of our process. If it’s spread out, it could make other things more difficult to manage. It’s also set up for continuous ingest, so once all the intricacies of a dataset are worked out, we can process regular updates in the same way.
Here’s where a lot of pipelines do their data cleaning, but not us. That early in the relationship and the process, we’ve found data cleaning assumptions often turn out to be incorrect. If there are issues, we like to have an audit trail, and in all of the complexities of loading some cleaning assumptions can be difficult to parse out. We save data cleaning for the data processing phase (see below).
What person is eligible for what service on what days? This simple-seeming table forms the backbone for all the other tables, as it tells us what people we can model on, and when. For example, when did people go on and off this health plan, when are members eligible and ineligible for certain programs? This phase usually requires a lot of discussion with data teams to make sure no one’s counted when they shouldn’t be, and vice versa. It’s worth it, though, because that’s the kind of detail that makes our models actually work in the real world.
For example, certain health plans get data on members before they were enrolled in their programs. Those members should not be counted for any kind of analysis before their actual eligibility date, but that’s difficult to know if you’re just looking at whether or not data is available for that person. Separating this aspect out eliminates having to be careful about it every time you do an analysis and instead makes it trivial to only include members within their membership windows.
This is where we start to break all of the unique data and structures into a unified format. We munge the usually wide data into the long skinny tables that Spark handles efficiently. In Events, we make sure to pull out information about various clinical events like inpatient admissions and ED visits. The other Aisles are a mix of the other features right from the dataset as well as our “prepared food aisles” where our proprietary libraries process the now-standardized data points and give back more meaningful features. For example, instead of a zip code, get the median income, population density, and average healthcare system utilization. Instead of just a drug name, get what kind of drug it is. This is sometimes called feature engineering, if that’s your preferred nomenclature. This is also where all of the data cleaning happens.
Once the data is in this more standardized format, we can dive into the Profiling section, another analysis section. Clinicians have assumptions about their populations, and people on analytics teams know what their numbers look like. Now that the data is in a consistent format, this is the easiest point in our process to perform general analytics and quantitative analysis to make sure we line up with what they expect. That way, we don’t move on to modeling with either the wrong data, or an incorrect assumption for how that data should be processed. Every organization has their own set of edge-cases. This is also were we do all of our pre-analysis on potential features, to see their distributions and how we might best utilize them in our models.
The inspiration for our pipeline, the supermarket is a stack of all the Aisles and Events. Our first pass through the pipeline usually results in a 1-aisle supermarket we’ve nicknamed “the farmer’s market”. If you can get a farmer’s market up and running, the other aisles can get added to that to form a fully-fledged supermarket.
Once we have a supermarket, everything else can use it. Different models can be pulled out of the same data in the same structure. Further analysis of the results of those models can be compared to data points from the supermarket. All of this can be done without changing the supermarket at all. It also means that someone can continue to build out the modeling work, while another engineer adds more aisles. When those aisles have been added, they can be loaded into the model in the same way. Everything has a place — Marie Kondo would be proud.
9. Spark models
After the supermarket has been built, you can take a Cart and select the data you need. That way, if you want to model different cohorts, you can take out different carts. It also works if you want to have different models with different data inputs. The underlying supermarket doesn’t change.
After selecting your feature ingredients, if the best next step is to build a model using Spark ML, that can be done right out of the cart. If those results are what’s going to be deployed, you can move directly into the Analysis section and skip all the deep-learning specific sections. This structure doesn’t force you to use all the steps if they’re not useful to you. Otherwise, the evaluation metrics of the Spark ML-based model provide a great baseline for comparing against the deep-learning models. It’s possible to guess what kind of algorithm will produce the best results for a given dataset and outcome, but it’s impossible to know for sure. Having a baseline also helps us know if it’s worth it to put in the resources to build a finely-tuned deep learning model.
10. Meal prep (tensorize)
Deep learning models like their datasets formatted into multidimensional matrices called tensors, which is not how Spark likes them. To make the deep learning side happy, we tensorize the data we’ve pulled out of the carts at this stage. Because of the complexity of healthcare data, and its various lags, the tensors generally come out in 4 dimensions, namely: member, service date (the day the thing actually happened), received date (the day we learned of it), and feature. That way, we can use indexing and slicing to create train and test sets efficiently on the GPU. We include a different service and received date to account for forms of lag we often see in healthcare data and make sure we’re building models that mimic what will happen in the real world.
The Cool Stuff
11. Deep learning models
We’ve arrived! All the data is in a nice, clean, reproducible format and we can finally get started on our fanciest of models. These notebooks run on GPU instances that are spun up specifically for this purpose. They also have shared code repositories so multiple people can work on the same model files at once through git. We add utility functions in other files to prevent those notebooks from getting too long and edit the notebooks in tandem. We try to maintain a balance in our master model notebooks to make them as lightweight as possible without being unnecessarily abstracted.
Slicing and Dicing
While we’re building our models, we track a lot of performance metrics along the way. However, we’ve found that there’s always interest in how model performance did on certain cohorts, or timelines, or a host of other things that end up being used for customer discussions and presentations. To easily get answers to those questions, our modeling results can join back up with the supermarket and the slicing and dicing follows a very similar structure to profiling. This is a really useful place to put all of that kind of data munging, since the deep learning instances have embedded or encoded values which are not as intuitive to categorize by.
Once the final model has been set, the pipeline agreed upon, and the cadence established, it’s time to actually give people some results. In our case, that means some Deployment cleanup first. This can be calibrating the results of the model, translating some of the outputs, or blending two models together.
Then, our output is generated in the Delivery section. In some cases, that’s a rank-ordered list, with columns that have been agreed-upon in advance to help slot the results into a workflow. These other columns also come from the supermarket, since all the information we need is already there in unencoded form. Attributes like name, which isn’t used in the model by itself, are very useful for clinical teams to know.
Once that’s done, we also generate reports to let us, and everyone involved, see what is and isn’t working. These are usually PDF reports that flow right after the lists themselves, also utilizing any relevant data from the supermarkets to implement. It’s all in the same environment and using the same tools, so every metric can be traced all the way back to ingest if need be.
After all this set up and front-loaded work, our pipeline is now ready for continuous deployment. We can ingest data and run it through on whatever cadence is best for the project we’re working on. We usually even set up an automated trigger that runs the pipeline (aside from the analysis sections) when new data for that project hits our system.
In all, this is the pipeline that helps us get from dataset (or database dump) to actual results — but it’s still a work in progress! Every project helps us get better and faster at what we do. Every improvement to our pipeline gives us more time to make better models. Healthcare data is hard, but the Cyft Supermarket makes it manageable.