Machine learning operations don’t belong with cloudops

By | September 9, 2019

It is Monday morning, and after a long weekend of system trouble the cloud operations staff is talking what occurred. It seems that many systems that were associated with a very advanced, new inventory management system enabled with machine learning had issues over the weekend. The postmortem concluded the following:

The batch process that transferred raw information in the operational database to the training database failed, in addition to the auto recovery procedure. An ops team member who was working over the weekend attempted to resubmit but triggered none, but four partial updates that abandoned the training database in an unstable state.
This caused the knowledge units in the machine learning systems to train with bad data and demanded that the new data in the knowledge base be removed and the models rebuilt.

Additionally, several outside data feeds, such as pricing and tax information, were updated at the same time into the training database. Though those worked fine, they too needed to be backed out of the knowledge database considering that the operational data was not in a fantastic state.
The system was unavailable for 2 days and the company lost $4 million, considering missing productivity, customer reactions, and PR problems.
This isn’t 2025; this is today. As businesses find more uses for”cheap and good” cloud-based machine learning systems we are discovering the systems that leverage machine learning are complex to function. The ops teams don’t expect the amount of difficulty and the complexity and are discovering that they are undertrained, understaffed, and underfunded.

The premise is that the cloud operations teams can manage cloud-based databases, cloud-based storage, and cloud-based compute with a fairly easy transition. For the large part that’s been the situation, believing that cloud-based systems are similar to traditional systems.

But systems based on machine learning haven’t yet been viewed for the most part by operations groups. These systems have technical purposes, as well as specialized systems–for example databases and knowledge engines–which have to be monitored and managed in certain ways. This is the point where the present operations teams are failing.

The fix is pretty easy to comprehend, but most enterprises aren’t going to like it, considering it means spending additional bucks for ML cloudops or abandoning ML cloudops. Machine learning systems are technological chainsaws. If used carefully, they are highly effective. If mishandled they could be harmful. Failures can go unnoticed, and if the system automatically uses the resulting bad expertise, you could end up with huge problems that may not be detected until much harm is done. More risk than reward, it sounds.