We're a fully distributed team, meaning that we'll be fixing bugs and replying to your questions from around the world!
When building predictive models, model accuracy, measured by metrics like area under the curve (AUC), has traditionally been the primary driver of model design and operationalization. While this leads to high-fidelity model construction at training and testing time, performance in production often degrades, producing results far worse than expected.
As machine learning (ML) matures within organizations, resiliency often overrides raw predictive accuracy as the defining criterion for operationalizing models. Increasingly, ML practitioners are leaning towards operationalizing decently performing, predictable production models rather than those that exhibit high performance at test time but don’t quite deliver on that promise when deployed. This preference for resilient models is evidenced by articles from Unite.ai, Rapidminer, the Software Engineering Institute at Carnegie Mellon, and Towards Data Science, but how do we get there?
Data Drift and Stability
Of the many reasons models in production behave differently than when trained and tested, one of the most significant and frequently observed, is changes in the properties of the data that anchors them. The original data used to create the features on which the model was trained differs from those that power the model in deployment — a phenomenon called data drift. Data drift, which happens when real-world environments contributing data change in unexpected and unplanned ways, is arguably the primary cause of non-resilient models.
Tooling to the Rescue, Kinda
The impact of drift on model resiliency is universally acknowledged, but existing solutions to mitigate this demonstrate only modest effectiveness. Virtually every enterprise ML software toolkit today includes mechanisms to determine data drift, manifested as a drift function like: drift_pkg(distribution1, distribution2) à {Drift_Metric}
Large cloud providers have also provided for data drift reporting within their ML suite of offerings; for example, Microsoft Azure Machine Learning’s dataset monitors, and Amazon Web Services‘ SageMaker’s Model Monitor.
These are useful tools, but their biggest drawback is that they are reactive. When a deployed model misbehaves, these tools are invoked to check drift, revealing how the data fed into the underperforming model differs from the data used to train it. If drift is detected, the model is corrected (primarily through retraining).
Towards Resilient Models, Not Reactive Tooling
Correction is fine and necessary, but it doesn’t address the most critical ML engineering problem of all — how do we build resilient ML models from the start? Achieving resiliency means building models that have predictable behavior, and seldom misbehave. Without resiliency, operationalizing ML models will remain a major challenge — modelers will continue to build models that underperform in production, requiring frequent correction. The continual need to re-engineer these models will raise organizational questions over the operational utility of ML/AI.
I believe that it is time to reconsider how drift is used in ML workflows and envision novel ways to incorporate drift into the model-building workflow to prevent , not react to, model misbehavior. To do this, we introduce the notion of data stability.
Prioritizing Data Stability
Data drift represents how a target data set is different from a source data set. For time-series data (the most common form of data powering ML models), drift is a measure of the “distance” of data at two different instances in time. The key takeaway is that drift is a singular, or point, measure of the distance of data.
We believe resilient models should be powered by data that exhibits low drift over time — such models, by definition, would exhibit less data-induced misbehavior. To manifest this property, i.e., drift over time, we introduce the notion of data stability. While drift is a point measure, stability is a longitudinal metric. Stable data drifts little over time, whereas unstable data is the opposite. We provide additional color below.
Consider two different attributes: the daily temperature distribution in NYC in November (TEMPNovNYC) and the distribution of the tare weights of aircraft at public airports (AIRKG). It is easy to see that TEMPNovNYC has lower drift than AIRKG; one would expect lesser variation between November temperatures at NYC across various years, than between the weights of aircrafts at two airports (compare the large aircrafts at, say JFK, to ones in a smaller airport, like Montgomery, Alabama). In this example, TEMPNovNYC is more stable as an attribute than AIRKG.
Data stability, though conceptually simple, can serve as a powerful tool to build resilient models. To see this, consider the following simple expression of data flow in ML:
data à features à model
Base data ingested into the modeling workflow is transformed to produce features, which are then fed into models. Stable data elements are likely to lead to stable features, which are likely to power stable (or resilient) models. Let’s assume a scale of stability values — 0, 1, 2, 3, 4, where 0 indicates “unstable,” and 4 denotes “highly stable” and each data attribute could be assigned a stability value. These assignments, provided to modelers during the feature engineering process, help them build features, and in turn models, that have known stability characteristics. Knowing how stable each input attribute is, modelers will choose stable data elements, while avoiding unstable ones, to build stable features, which, in turn, will power resilient models. Supposing a modeler deliberately chooses less stable data to construct a feature; would alert MLOPs that downstream models (i.e., those using this feature as input) need to be monitored more closely than others. MLOps can now anticipate rather than react.
A Change in Mindset, a Change in Methodology
I realize that this is a substantive departure from extant methodologies. The current data pipeline for feature construction and model building doesn’t incorporate how “drifty” specific data items are or factor in the notion of data stability. Instead, it’s an ad hoc process, driven primarily by modeler intuition and expertise, incorporating analytic procedures (such as exploratory data analysis or EDA) whose primary objective is to provide the modeler insights into the predictive power of individual data and features. The primary reason for constructing and eventually using a feature is its contribution to the model’s accuracy. The significant drawback of this approach, as we know now, is that high predictivity shown at model testing doesn’t always translate in production, frequently driven by different properties of data than was determined at training and testing. Thus, unstable models.
Based on my experience, it is time for an approach informed by data stability. The modeler doesn’t need to sacrifice predictive power (i.e., model accuracy) but should be able to trade-off model accuracy and stability, building resilient, “accurate-enough” models with predictable behavior.
Time to Evolve
A final technical challenge that needs addressing to manifest this approach is that ML toolsets, including drift computation packages, need to adapt to measure data stability.
In our earlier example of TEMPNYC and AIRKG analysis, it is made easy by the simple semantics of attributes — the modeler can intuitively estimate stability. However, in the real world, both ingested data and the features created out of them have complex, nuanced semantics that don’t lend themselves to intuitive assessments. Here, point estimates of drift can be quite misleading to the modeler. To understand data stability, we need to longitudinally capture drift and mathematically determine the long-term change in distributional properties. Existing tools don’t naturally enable that.
One can argue that the point drift metric can be adapted to reach this goal through repeated explicit calls, but that would require complex workarounds instead of being natively supported within an ML toolset. It ends up creating potential issues at scale that aren’t practical in the real world.
Our Answer - Anovos
At Mobilewalla we have a dedicated team of data scientists working with one of the world’s largest data sets. We also work with some of the most sophisticated data science teams around the globe who use our solutions. We have seen first-hand the challenges with models brought on by drift and stability and built our own solutions to address them. Anovos, an open-source project we have built, addresses some of the core inefficiencies in the feature engineering component of the predictive modeling workflow by introducing the notion of data stability as a priority. It's built for scale, enabling users to feature engineer with terabytes of data, not gigabytes, and built with data drift and stability at its core. Get started today at anovos.ai and join us in the #oss-anovos channel at the MLOps Community Slack.