I ran 580 model-dataset experiments to show that, even if you try very hard, it is almost impossible to know that a model is degrading just by looking at data drift results

Community Article Published June 3, 2024

In my opinion, data drift detection methods are very useful when we want to understand what went wrong with a model, but they are not the right tools to know how my model's performance is doing.

Essentially, using data drift as a proxy for performance monitoring is not a great idea.

I wanted to prove that by giving data drift methods a second chance and trying to get the most out of them. I built a technique that relies on drift signals to estimate model performance and compared its results against the current SoTA performance estimation methods (PAPE [arxiv link] and CBPE [docs link]) to see which technique performs best.

To effectively compare data drift signals against performance estimation methods, I used an evaluation framework that emulates a typical production ML model and ran multiple dataset-model experiments.

As per data, I used datasets from the Folktables package. (Folktables preprocesses US census data to create a set of binary classification problems.) To make sure the results are not biased, in terms of the nature of the model, I trained different types of models (Linear, Ensemble Boosting) for multiple prediction tasks included in Folktables.

Then, I built a technique that relies on drift signals to estimate model performance. This method uses univariate and multivariate data drift information as features of a DriftSignal model to estimate the performance of the model we monitor. It works as follows:

  1. Fit univariate/multivariate drift detection calculator on reference data (test set).
  2. Take the fitted calculators to measure the observed drift in the production set. For univariate drift detection methods, we use Jensen Shannon, Kolmogorov-Smirnov, and Chi2 distance metrics/tests. Meanwhile, we use the PCA Reconstruction Error and Domain Classifier for multivariate methods.
  3. Build a DriftSignal model that trains a regression algorithm using the drift results from the reference period as features and the monitored model performance as a target.
  4. Estimate the performance of the monitored model on the production set using the trained DriftSignal model.

You can find the full implementation of this method in this GitHub Gist.

Then, for evaluation, I used a modified version of MAE because I needed an aggregated version that take into consideration the standard deviation of the errors. To account for this, I scale absolute/squared errors by the standard error (SE) calculated for each evaluation case. We call the SE-scaled metrics mean absolute standard error (MASTE).

image/webp
MASTE formula

Then it was a matter of running all the 580 experiments and collect results.

Since, each performance estimation method is trying to estimate the roc_auc of the monitored model, I report the MASTE between the estimated and realized roc_auc.

image/webp
The methods involved in the analysis are trying to estimate the AUC-ROC of the monitored models. So, after the experiments, we ended up with the estimated and realized AUC-ROC. To compare the error between both, we compute the MASTE (modification of MAE) to evaluate which performance estimation method worked best. That is what this table is reporting.

PAPE seems to be the most accurate method, followed by CBPE. Surprisingly, constant test set performance is the third best. This is closely followed by random forest versions of univariate and multivariate drift signal models.

This plot shows the quality of performance estimation among different methods, including PAPE and CBPE.

image/webp
Quality of performance estimation (MASTE of roc_auc) vs absolute performance change (SE). (The lower, the better).

Here is a specific time series plot of a model's realized ROC AUC (black) compared against all the performance estimation methods. PAPE (red) accurately estimates the direction of the most significant performance change and closely approximates the magnitude.

image/webp
Time series plot of realized vs estimated roc_auc for dataset ACSIncome (California) and LigthGBM model.

The experiments suggest that there are better tools for detecting performance degradation than data drift, even though I tried my best to extract all the meaningful information from drift signals to create an accurate performance estimation method.

There are better tools for quantifying the impact of data drift on model performance. So, I hope this helps the industry realize that monitoring fine-grained metrics leads to nothing and that a change in an obscure feature might not mean anything. It is better to first estimate model performance and then, if it drops, review data drift results but not the other way around.

Full experiment set up, datasets, models, benchmarking methods, and the code used in the project can be found in this longer post that I wrote last week.