Many companies are eager to use artificial intelligence (AI) in production, but struggle to achieve real value from the technology.
What’s the key to success? Creating new services that learn from data and can scale across the enterprise involves three domains: software development, machine learning (ML) and, of course, data. These three domains must be balanced and integrated together into a seamless development process.
Most companies have focused on building machine learning muscle – hiring data scientists to create and apply algorithms capable of extracting insights from data. This makes sense, but it’s a rather limited approach. Think of it this way: They’ve built up the spectacular biceps but haven’t paid as much attention to the underlying connective tissues that support the muscle.
Why the disconnect?
Focusing mostly on ML algorithms won’t drive strong AI solutions. It might be good for getting one-off insights, but it isn’t enough to create a foundation for AI apps that consistently generate ongoing insights leading to new ideas for products and services.
AI services have to be integrated into a production environment without risking deterioration in performance. Unfortunately, performance can decline without proper data management, as ML models will degrade quickly unless they’re repeatedly trained with new data (either time-based or event-triggered).
Professionalizing the AI development process
The best approach to getting real and continuous value from AI applications is to professionalize AI development. This approach conforms to machine learning operations (MLOps), a method that integrates the three domains behind AI apps in such a way that solutions can be quickly, easily and intelligently moved from prototype to production.
AI professionalization elevates the role of data scientists and strengthens their development methods. Like all scientists, these professionals bring with them a keen appreciation for experimentation. But often, their dependence on static data for creating machine learning algorithms –which they developed on local laptops using preferred tools and libraries – impedes production AI solutions from continuously producing value. Data communication and library dependency problems will take their toll.
Data scientists can continue to use the tools and methods they prefer, their output accommodated by loosely coupled DevOps and DataOps interfaces. Their ML algorithm development work becomes the centerpiece of a highly professional factory system, so to speak.
Smooth pilot-to-production workflow
Pilot AI solutions become stable production apps in short order. We use DevOps technology and techniques such as continuous integration and continuous delivery (CICD) and have standard templates for automatically deploying model pipelines into production. By using model pipelines, training and evaluation can happen automatically if needed – when new data arrives, for instance – without human involvement.
Our versioning and tracking ensure that everything can be reused, reproduced and compared if necessary. Our advanced monitoring provides end-to-end transparency into production AI use cases (including data and model pipelines, data quality and model quality and model usage).
Using our innovative MLOps approach, we were able to bring the pilot-to-production timeline for one U.S. company’s AI app down from six months to less than one week. For a UK company, the window for delivering a stable AI production app shrank from five weeks to one day.
The transparency of AI solutions, and confidence in their agility and stability, is critical. After all, the value lies in the ability to use AI to discover new business models and market opportunities, deliver industry-disrupting products and creatively respond to customer needs.
Before singing praises on the wonders that MLOps can do, let me shine some lights on a few new learnings, thanks to the post-pandemic crisis, that the companies across the globe have learned, especially the CPG.
Digital channels, or at least, digitization is a requisite. It is like Yoda said – do or don’t, there is no try! CPG companies who have toiled for years to see their brand sprout across the market witnessed a sharp decline in sales in a matter of months! Logistics became a big problem, yes, but their poorly implemented strategies were the actual Gordon Knot.
Today, consumers have a plethora of options. CPG firms cannot rely on their standard go-to-market strategies. How to connect with end-consumers? Now, there is an addendum to the question – how to connect with end-consumers and win them?
Companies across the world, irrespective of the size and market presence, have started moving from offline to online, in one or another way – Who does not think and act ‘online’ is up for a loss.
Health and wellness have become essential factors for the customers.
Millennials shop online; nothing drives them more except the cost to value. They want convenience, a sense of belonging, and too at lower prices.
Well, these are just the picture’s skeleton, the actual painting factors in multiple new developments, such as:
The emergence of small and medium-sized companies, focusing on target customers.
Manufacturers and distributors share data to streamline the logistics.
A surge in the usage of automated systems.
Shift towards local consumption.
E-Logistics companies collaborating with the retail stores.
The list is long.
A quick glimpse of how a product reaches the end consumer.
If you start eagle eyeing each step, you will find tremendous opportunities hidden in them.
Here are a few.
Opportunity 1 – Introduce a forecasting functionality based on new data. Opportunity 2 – Bring in an integrated system that synchronizes the data across the process. Opportunity 3 – Factor in self-learning feature that would comprise the market changes, customers’ buying behavior, etc.
You can cash on the above opportunities by implementing automation systems with various machine learning (ML) algorithms. You can introduce ML algorithms, such as:
Route optimization to make the best of the sales reps’ time.
Product optimization to solve the product mix problems.
NLP to analyze the consumers’ behavior.
Trade promotion optimization to plan and execute your trade spends.
Again, this list is endless.
So, you have the solution – build ML models and deploy them. What are the critical roadblocks in adopting Machine Learning?
Problem 1 – Continuous delivery of value
Your team who works on the use case and writes the ML codes do not deploy them. Or at least, they do not have expertise on the delivery. So, relying your success entirely on the data science team can frustrate them and derail your ML journey.
Problem 2 – Composite and complex ML builds
Unlike traditional development builds, ML models make predictions by (indirectly) capturing data patterns without following the explicit rules. The ML build runs a pipeline that extracts patterns from the data to create model artifacts, making it far too complex and experimental.
Problem 3 – Productionizing ML models
Gartner figures 80% of the data science projects fail or never make it to production. To run the project successfully in a real-time environment, you need to find the problem situation and solve the problem when it occurs. You need to continuously monitor the process to find the difference between correct and incorrect predictions (bias) and know in advance how your training data will represent real-time data.
Areas to Focus: Identify Where Things Might Go Wrong for You
Beyond ML deployment difficulties and risks in the CPG, there are several other key areas where things can go wrong, so instead:
Find out the exact use case; if you try solving the wrong problems, things will go wrong.
Do not build models that do not map well to your business processes.
Check if you have any flawed assumptions about the data.
Convert the results of your experimentation into a production-ready model.
There are opportunities, there are problems, and there are ML models. However, the only requirement that delays the models’ deployments or often triggers performance issues is simply the lack of means to deploy it successfully. Anteelo can reduce your effort in solving the ML deployment challenges through its state-of-the-art ML Works platform that provides you the means to run thousands of ML models at scale and at once.
Most of us are familiar with Continuous Integration (CI) and Continuous Deployment (CD) which are core parts of MLOps/DevOps processes. However, Continuous Monitoring (CM) may be the most overlooked part of the MLOps process, especially when you are dealing with machine learning models.
CI, CD and CM, together, are an integral part of an end-to-end ML model management framework, which not only helps customers to streamline their data science projects, but to also get full value out of their analytics investments. This blog focuses on the Continuous Monitoring aspect of MLOps and gives an overview of how Anteelo is using ML Works, a model monitoring accelerator built on Databricks’ platform, to help customers build a robust model management framework.
Here are a few examples of MLOps customer personas:
1.Business Org – A business team, which sponsors an analytics project will have the expectation that machine learning models are running in the background, helping them to get valuable insights from their data. However, these ML models are mostly in a black box and in a lot of cases, the business sponsors are not even sure if the analytics project will lead to a good ROI.
2.IT/Data Org – A company’s internal IT team, which supports business teams usually has a team of data engineers and data scientists who build ML pipelines. Their core mandate is to build the best ML model and migrate them to production. When doing so, they’re either too busy building the next best ML model to put it into production or managing production model support is not the right use of their time. Hence, there is a lack of streamlined model monitoring process in production and IT, and data leaders are left wondering how to support their business partners.
3.Support Org – A company has an IT support organization, which takes care of supporting all IT issues. This team likely treats all issues the same, including similar SLAs, and may not differentiate between supporting an ML model and a Java web application. Hence, a generic support team may not have the right skills to support ML models and may not be able to meet the expectations of their internal customers
A well-designed MLOps framework will address the challenges of all three personas.
Anteelo not only has multiple experiences in end-to end MLOps implementations across tech stacks but has also built MLOps accelerators to help customers gain the full potential of their analytics investments.
Let’s drill down on our model monitoring accelerator in the Continuous Monitoring (CM) space and talk about the offer in more detail.
Model monitoring is not easy!
Unlike monitoring a BI dashboard or an ETL pipeline, the biggest challenge with ML models is that their results are probabilistic in nature and have their own dependencies like training data, hyper parameters, model drift, and the ability to explain the output of the model results. As a result, complications increase, and model monitoring becomes almost impossible when models are built on unstructured notebook formats that are used across multiple data science teams. This severely impacts Support SLAs and results in business users gradually losing confidence in the model’s predictions.
ML Works to the rescue
ML Works is our model monitoring accelerator built on Databricks’ unified data analytics platform to augment our MLOps offerings. After evaluating multiple architectural options, we decided to build ML Works on Databricks to leverage Databricks’ offerings like Managed MLflow and Delta Lake. ML Works is trained on thousands of models and can handle Enterprise scale model monitoring, or it can be used for automated monitoring within a small team of data scientists and analysts. Here is an overview of ML Works core offerings:
1.Workflow Graph – Monitoring a ML pipeline along with its relevant data engineering tasks can be a daunting task for a support engineer. ML Works uses Databricks’ managed ML flow framework to build a visual end-to-end workflow monitor for easy and efficient model monitoring. This helps support engineers troubleshoot production issues and narrow down the root cause faster, significantly reducing Support SLAs.
2.Persona-based Monitoring – We understand that a ML model monitoring process should not only make the life of a support engineer easier but also help other relevant persons like business users, data scientists, ML engineers and data engineers to get visibility into their respective ML model metrics. Hence, we have built a persona-based monitoring journey using Databricks’ managed ML flow to make the model monitoring process easy for all personas.
3.Lineage Tracker – Picking up the task of debugging someone else’s ML code is not a pleasant experience, especially when there isn’t good documentation. Our Lineage Tracker uses Databricks’ managed ML flow and helps customers start from a dashboard metric and drill all the way to the base ML model, including the model’s hyper parameter values, training data, etc. thus giving full visibility into every model’s operations. This gets all relevant details about a model in one place, which improves model traceability. This feature is further enhanced when we use Delta Lake’s Time Travel functionality to create snapshots of training data
3.Drift Analyzer – Monitoring the model’s accuracy with time is critical for business users to gain trust in the insights. Unfortunately, a model’s accuracy will drift with time for various reasons including production data changing over time; business requirements changing and making original features no longer relevant and acquiring a new business which introduces new data sources and new patterns in the data. Our Drift Analyzer analyzes the Data Drift and Concept Drift automatically by reviewing the data distributions, which triggers alerts if the drift has exceeded a threshold and ensures that production models are continuously monitored for accuracy and relevance.
Using ML Works, business teams are able to monitor and track their relevant metrics on the Persona Dashboard and use Drift Analyzer to understand the impact of model degradation on metrics. This will help them to look at the underlying ML models as a white box solution. Lineage Tracking helps data engineers and data scientists obtain end-to-end visibility into ML models and their relevant data pipelines, which streamlines development cycles by taking care of the dependencies.
Support teams can use Workflow Graph and relevant metrics to troubleshoot production issues faster, significantly reducing Support SLAs. And finally, customers can now get full value from their analytics investments using ML Works, while also ensuring that ML deployments in production really work.
Machine learning – a tech buzz phrase that has been at the forefront of the tech industry for years. It is almost everywhere, from weather forecasts to the news feed on your social media platform of choice. It focuses on developing computer programs that can acquire data and “learn” by recognizing patterns and making decisions with them.
Although data scientists build these models to simplify and make business processes more efficient, their time is, unfortunately, split and rarely dedicated to modeling. In fact, on average, data scientists spend only 20% of their time on modeling; the other 80% is spent on the machine learning lifecycle.
Building
This exciting step is unquestionably the highlight of the job for most data scientists. This is the step where they can stretch their creative muscles and design models that best suits the application’s needs. This is where Anteelo believes that data scientists ought to spend most of their time to maximize their value to the firm.
Data Preparation
Though information is easily accessible in this day and age, there is no universally accepted format. Data can come from various sources, from hospitals to IoT devices; to feed the data into models, sometimes, transformations are required. For example, machine learning algorithms generally need data to be numbers, so textual data may need to be adjusted. Statistical noise or errors in data may also need to be corrected.
Model Training
Training a model means determining good values for all the weights and bias in a model. Essentially, the data scientists are trying to find an optimal model that can minimize loss – an indication of how badly the prediction is performed on a single example.
Parameter Selection
During training, it is necessary to select some parameters that will impact the prediction of the model. Although most are selected automatically, some subsets cannot learn and require expert configuration. These are known as hyper parameters. Experts trying to configure hyper parameters have to implement various optimization strategies to tune the hyper parameters.
Transfer Learning
It is quite common to reuse machine learning models across various domains. Although models may not be directly transferrable, some can serve as excellent foundations or building blocks for developing other models.
Model Verification
At this stage, the trained model will be tested to see if the validated model can provide sufficient information to achieve its intended purpose. For example, when the trained model is presented with new data, can it still maintain its accuracy?
Deployment
At this point, the model has been thoroughly trained & tested and has passed all requirements. The step aims to use this model for the firm and ensure that it can continue to perform with a live stream of data.
Monitoring
Now that the model is deployed and live, many businesses generally consider the process to be final. Unfortunately, this is far from reality. Like any tool, the model will wear out after use. If not tested regularly, it will provide irrelevant information. To make matters worse, since most machine learning models work in a “black box,” they lack the clarity to explain the model’s predictions, making the predictions challenging to defend.
Without this entire process, models would never see the light of day. That said, the process often weighs heavily on data scientists, simply because many steps require direct actions on their end. Enter Machine Learning Operations (MLOps).
MLOps (Machine Learning Operations) is a set of practices, frameworks, and tools that combines Machine Learning, DevOps, and Data Engineering to deploy and maintain ML models in production reliably and efficiently. MLOps solutions provide Data engineers, scientists, and engineers with the necessary tools to make the entire process a breeze. Next time, find out how Anteelo Engineers have developed a tool that targets one of these steps to make the lives of data scientists’ easier.
‘Efficiency’ roots from processes, solutions, and people. It is one of the main driving forces leading to significant changes in the way companies work in the first decade of the 21st century. The following decennary further accelerated this dynamic. Now, post-COVID, it is vital for us to become efficient, productive, and environmentally friendly.
One of our clients manufactures and sells precast concrete solutions that improve their customers’ building efficiency, reduce costs, increase productivity on construction sites, and reduce carbon footprints. They provide higher quality, consistency, and reliability while maintaining excellent mechanical properties to meet customers’ most stringent requirements. The customers rely on their quality service and punctual delivery to receive products. This is possible because their supply chain model is simple. They prepare the order by date, call the driver the day before, and load the concrete the next morning. The driver delivers the exact specific product to the specified address.
However, a large percentage of customers cancel orders. One of the main reasons for the cancellation is the weather.
The client turned to Anteelo to provide an analytical solution for flagging such orders so that their employees do not have to prepare for such deliveries.
I’ll abridge the journey so far that it led to the creation of a promising solution.
How it all started?
One of the business units of the client suffered huge operational losses due to the cancellation of orders. Although the causes were(are) beyond their control, they always had(have) to compensate truck driver and concrete workers. To improve the demand and supply planning process’s efficiency, they had to encounter order cancellation risks. Though they might have increased their resource capacity by adding more people or working in shifts, this option may not have paved well in the long run. Apart from this, the risks may not have mitigated as anticipated, which might have further reduced the RoI.
Although they put forward various innovative ideas, the results did not reflect the expectations, resulting in the loss of thousands of drivers’ hours. Before deciding to use an analytical solution, they discovered that their existing system has two main shortcomings.
Extensive reliance on conventional methods for dispatch
Absence of a data-driven approach
Thus, they wanted to leverage a powerful ML-enabled solution to empower ‘order dispatching’ to effectively get ahead of order cancellation and minimize high labor costs.
Roadmap that led to the solution’s development
The analytics team from Anteelo pitched the idea of developing a pilot solution and executing it in the decided test market and then creating a full-blown working solution.
We used retrospective data in the sterile concept (the idea was to solve as many challenges as possible for POC (Proof of Concept)). Later, when the field team gave positive feedback, we planned to deploy a cloud-based working model with a real-time front-end. Next, measure its benefits in terms of hours saved in the next 12 to 24 months.
Proof of Concept (POC)
To reap the maximum benefits and minimize risks on the analytical initiative, we opted to start with the proof of concept (POC) and execute a lightweight version of the ML tool. We developed a predictive model to flag orders at risk of cancellation and simulated operational savings based on the weather and previous years’ data. We found that:
50% of orders were canceled each year
A staggering percentage of orders were canceled after a specific time the day before the scheduled delivery – ‘Last-minute cancellations.’
Because of these last-minute cancellations, hundreds of thousands of driving hours were lost
Creating the Most Viable Product (MVP)
Before we could go any further or zero down to the solution deployment, we had to understand the cancellation’s levers. And once the POC was ready, we decided to evaluate the results based on the baselines and expectations and compare them with the original goals. Next, we decided to proceed with the pilot test and modify the solution based on its result. Therefore, we selected a location and deployed some field representatives to provide real-time feedback and relied on our research for this purpose. The results (savings potential) were as follows:
Fewer large orders canceled
More orders canceled on Monday
When the temperature dropped to certain degrees, the number of cancellations increased
More than half of the last-minute cancellations were from the same customers
If a certain proportion of the orders were canceled at least one day in advance, the remaining orders were canceled at the last minute
On days with rain, the number of cancellations increased
Overall, order quantity, project, and customer behavior were the essential variables.
The MVP stage provided a staggering number, representing the associated monetary loss (in millions) due to the last-minute cancellations. The reasons behind such a grim figure were the lack of a data-oriented approach and prioritization method.
The deployed MVP helped reduce the idle hours. It helped flag the cancellations that we usually would have missed with our heuristic model. It also provided the market-wise potential, which we ultimately decided to roll out.
Significant findings (and refinements) in the ML model based on pilot test
Labor planning is a holistic process
An effective labor plan must deliberate factors other than the quantity (orders), such as the distribution of orders throughout the day, the value of the relationship with customers, and so on.
Therefore, the model output was modified to predict the quantity based on the hourly forecast.
Order quantity may vary with resource plan
‘Order quantity’ shows a considerable variation between the forward order book and the tickets, making it impossible to use it as a predictor variable.
Resources are reasonably fixed during the day
This contradicts one of the POC’s assumptions that resources will be concentrated in the market on a given day. This has led to corresponding changes in forecast reports, accuracy calculations, etc.
Building and Deploying a Full-blown ML-model at Scale
At this stage, we had the cancelation metrics, levers that worked, and exact variables to use in the solution. Now, the team has enough data to build an end-to-end solution comprising intuitive UI screens & functions, automated data flows, and model runs. And finally, measure the impact in monetary equivalent.
Benefits’ (Impact) Measurement
To turn the wheel and get it on track, we have to extract the model’s maximum value and evaluate it over time. We decided on two evaluation time metrics for measuring the impact.
Year-on-Year
Month-on-Month
The following is a summary table of improvements to key operational KPIs. Based on TPD change, the estimated savings are calculated based on the annual business volume.
TPD
Location-specific
US
Metric value (YoY)
30% (up)
>$350k
>$3M
Metric value (MoM)
12% (up)
>$150k
>$3M
*data is speculative and based on the pilot run.
Predictive Model’s Key Features
Visual Insights
Weekly Model Refresh
Modular Architecture for seamless maintenance
Results
Reduced Deadheading
Streamlined dispatch planning
Higher Labor Utilization
Greater Revenue Capture
Why should you consider Anteelo’s ML/AI solutions?
We have successfully tested the pilot solution, and the model has shown annual savings of more than $3 million. Now, we will build and deploy the full version of the model.
Anteelo is one of the top analytics and data engineering companies in the US and APAC regions. If you need to make multi-faceted changes in your business operations, let us understand your top-of-mind concerns and help you with our unique analytics services. Reach out to us at https://anteelo.com/contact/.
Managing ML production requires a combination of data scientists (algorithm procrastinators) and operations (data architects, product owners? Yes, why not?).
Operationalizing ML solutions in on-prem or cloud environments is a challenge for the entire industry. Enterprise customers usually have a long and random software update cycle, usually once or twice a year. Therefore, it is impractical to couple the deployment of the ML model with irregular update cycles. Besides, data scientists have to deal with:
Data governance
Model serving & deployment
System performance drifts
Picking model features
ML model training pipeline
Setting the performance threshold
Explainability
And data architects have enough databases and systems to develop, install, configure, analyze, test, maintain… the verb would keep on accumulating, depending on the ratio of the company’s size to the number of data architects.
This is where MLOps come in to rescue the team, solution, and the enterprise!
What is MLOps?
MLOps is a new coinage, and the ML community keeps on adding/ perfecting its definition (as the ML life cycle continues to evolve, its understanding is also evolving). In layman terminology, it is a set of practices/disciplines to standardize & streamline ML models in production.
It all started when a data scientist shared his plight with a DevOps engineer. Even the engineer was unhappy with the incumbent (inclusion of) data and models in the development life cycle. In cahoots, they decided to amalgamate the practices and philosophies of DevOps and ML. Lo and behold! MLOps came into existence. This may not be entirely true, but you have to give credits to the growing community of ML & DevOps personnel.
Five years ago, in 2015, a research paper highlighted the shortcomings of traditional ML systems (third reference on this Wikipedia page). Even then, the ML implementation grew exponentially. After three years of the research’s publication, MLOps became mainstream – 11 years after DevOps! Yes, it took this long to combine the two. The reason is simple – AI became mainstream only a few years back, 2016, 2018, or 2019 (the year is debatable).
MLOps Lifecycle
MLOps brings the DevOpsprinciples to your ML workflow. It allows continuous integration into data science workflows, automates code creation and testing, helps create repeatable training pipelines, and then provides continuous deployment workflow to automate the package, model validation, and deployment to the target server. It then monitors the pipeline, infrastructure, model performance, and new data and creates a data feedback flow to restart the pipeline.
These practice involving data engineers, data scientists, and ML engineers enables the retraining of models.
All seems hunky-dory at this stage; however, in my numerous encounters with the enterprise customers, and after going through several use cases, I have seen MLOps, although evolutionary & state-of-the-art, failing several times in delivering the expected result or RoI. The foremost reason, often discovered, because of –
The singular, unmotivated performance monitoring approach
Unavailability of KPIs to set/measure the performance
And lack of threshold to raising model degradation alerts
In contrast, these are the technical hindsight that is often vindictive because of the lack of MLOps standardization; However, a few business factors, such as lack of discipline, understanding, resources, can slog or disrupt your entire ML operations.
How do you retract the steps that led to the model’s creation, say, your data scientists are away for some reason?
How will you reproduce predictions to validate its outcome, say, someone shoots the question?
It is not just about resourcing data scientists, software developers, or data engineers to work in isolation to achieve the operationalization and automation of the ML lifecycle. It is about how the three can work in tandem as a unit. For this, the data’s quality or availability must remain identical across the process & environment to ensure the model performs on par with the set metrics. Again, the core problem boils down to operation and automation, which we diligently tried/try to address via MLOps.
To solve the problem’s crux, you first need to answer a few questions:
How do the three personas, i.e., data engineers, data scientists, and ML engineers, use different tools and techniques?
How do you collaborate on the ML workflow within and between teams?
As you cannot share models like other software packages, you need to share the ML pipeline that can reproduce and tune the model based on new data specific to the new environment/scenario. A ubiquitous work culture or norm in large enterprises is to have independent data science teams, and most of whom are engaged, day in and day out, on similar workflows.
Now, how do you collaborate and share the results?
When it comes to enterprise readiness, how do you plan data/ ML model governance while dealing with data & ML?
When you deal with specialized hardware, cost management comes into play as you have to compute with large amounts of GPU, memory, jobs that take a long time to run. Some of these jobs can take days or even weeks to run to get a good model. So how do you establish the trust?
Having insights, even dismal, will help you identify the real-time use cases and factor in an Enterprise AI plan.
What Next? Martian Version for Earthling Solution?
ML Works Will Just Do!
Most MLOps toolkit often focus on the technical aspect of the MLOps, while ignoring its real-life impact. Other factors that can weigh in its contribution are having a 360-degree view and control on the micro/macro aspects of the data science process.
At Anteelo, we have tried reordering the ML alphabets with our proprietary suite of toolkits, in which we take immense pride. We call it ML Works. The solution, which is cloud-agnostic and scalable, automates the model’s build, deploy, and monitoring processes, thereby reducing the need for larger teams.