Solutions built with crap data will be crap

This is obvious. And yet, teams go on a-building even after seeing the data to be too crap to answer the questions in mind. In the age of anyone with internet connection being able to get their AI terminology right, why do people still…

  • train machine learning models with crappy data? The model(s) will be useless. Quantity does not fix quality, if there is a skew in the data available. If the data you build with does not represent reality, how would your solution work with said reality?
  • ignore the quality monitoring of their data? Data does rot over time, also data from technical systems…and so do the models formed with said data and anything being scored with the models built on said data. Change is constant. Not much else is.
  • …generalize from whatever data they can get, instead of doing the work and finding out what data they need to actually make their claims? It can be lethal to work on limited data. For the business, saying “we can not make decisions based on this data” is OK. You read that right. It is more OK to say “with this data, we don’t know” than giving management flawed numbers/models with which to (destructively) guide their business. “Shipped is better than perfect” is not true if you ship something that runs over a pedestrian or downs an airplane.

People buying IT solutions have matured quickly. I was still some years back explaining why fact based decisions built on relevant data are better than decisions made on the hunch of a senior manager. Now some of these same clients have predictive models live as part of their business. I celebrate it.

But. I think the next step for clients should be to demand their teams prove the data solutions they built 1) reflected facts when built and 2) continue to do so. Otherwise it is insane to direct one’s business with them.

“Garbage in, garbage out.” This is not going away.