Data teams need to grow up

Thine solution shall have at least general architecture documentation.
Thine solution shall log what it did.
Thine solution shall test it’s own behaviour for sanity.
Thine solution’s results shall be monitored automatically.

Do these seem familiar?

For any software developer, of course they do. They have been around now for two decades. For data scientists taking their quick-draft Proof-of-Concept live quickly, these seem to be too often totally alien concepts.

Documentation: Writing 100 pages of documents, which become outdated in a week as the software develops are negative value due to being misleading. But, at least telling exactly what you load from where, what you output to exactly where and where I can see the code that does it should be the bare minimum of documentation. It will save money since otherwise every solution maintainer reads easily through thousands of lines of code, multiple times, runs through the datalake, searching for data, sometimes for hours on end, when someone slipped a creative runtime parameter and now the data is in some “/roger_test/” -path in a corner of the datalake no-one except Roger finds quickly and Roger is on a holiday. Calculate the cost of that. Code is of course, in the end, the only trustworthy documentation, but a developer is so much faster if you give them even a hint of where to look.

Logging: Logging every calculation is of course useless bloat. But, at least telling in the logs “source foo is empty!” should be the bare minimum for the monitoring of datasource foo. Then you can connect monitoring to that log and voila again the maintainer running through 15 layers of integrations is umpteen times faster in finding which technical glitch causes Sally from FINA to be justifiably furious, as company financial reports show 0 turnover, when everyone knows it is bloody well not true.

Testing: Writing a test for every line of code is of course useless bloat of negative value and TDD ended up being an expensive highway to hell…but if there is a flow of producing something of value, why on earth would you not test what it produces in it’s steps…at least test that it produces something.

Monitoring: Pushing monitoring noise into a channel no-one reads is of course useless and of negative value…but when you test and log your data pipelines, you can monitor them automatically and be awake immediately when something goes bork in the night. Do you really want to let the end customer be the one to find out that something broke? You do care?

sigh.

In brief: the best practises of software development should finally be adopted by data teams. They have been available long enough.

“If something is worth doing, it is worth doing badly” – maybe, when you first test something. But when you take it to production, you will really see the cost of a team leaving in production a non-logging, non-tested, non-monitored, undocumented, undecipherable, unmaintainable spaghetti (with maybe some bonus hidden niche features and funny quirks of the programmer showing their individuality). The solution will be unstable. It is expensive for teams later to reverse engineer what a solution does and will raise the total cost of ownership of the solution. It will take time to notice bugs and fix things, which will make your business slower to develop. And if you top that all off by having a single developer being the only one who knows how a solution operates…is it surprising you are screwed if they get sick or quit?

Data teams need to grow up.

Share this: