Automate your data quality monitoring

As is obvious, data-driven solutions built or run with crappy data will not create business value, at worst, they do actual damage to business or people.

Solving the issue is not hard. Problem is that too large a portion of data science sprint deliveries are measured by quantity, not by quality of deliveries. “How many models did your team deploy?” should be replaced by “What is the benefit of what you deployed?”and “Will it be valuable in the future..and how long?”

Unpleasant fact: without monitoring, you have no idea. Automated A/B testing is good, but you could see what is happening way earlier. All you need to do is to plug DQ monitoring earlier in the pipeline to notice unpleasant events as they happen.

Gather up your team and client (internal or external) data experts, and for starters, ask:

  1. Which data sources affect the predictions/model/metrics the most?
  2. What are the allowed min/max values, typical distributions, seasonal changes, expected fluctuations to those data sources, based on past metrics and experience?
  3. …and that’s it. Yes. That’s it.

Then you just plug a system observing those metrics, which alert you when things are suspicious. You can naturally plug the feed from this system to your model deployments. If bad data taints your data solution, why have it in production? So a manager can look at a colorful dashboard? Why would waste anyone’s time like that?

Noticing the flaws further downstream after A/B testing or after the predictions/metrics/solution is live is a fail. Is it sensible use of data scientist’s or data engineer’s time to backtrack errors down the pipeline after an A/B test, when you could just send an alert of a central data source producing distorted data as it happens? Even if the data is valid, large fluctuations may well distort the results of an over-trained model, so there really is no excuse for not monitoring your central data sources. I have heard the excuse: “too much alerts will not be noticed anyway.” Then parse the alerts to reports. Suppress aggressive channels after a trigger has been launched enough times. Lack of knowledge does not mean things are all right.

Automate your data quality monitoring. It is not hard.

It is worth it.