Machine learning (ML) has become one of the key elements of digital transformation in organizations around the world. According to Gartner's 2023 report, 70% of organizations worldwide say they are using ML or plan to implement it in the next two years. The global ML market was valued at $21 billion in 2022, and is projected to reach more than $209 billion by 2030, corresponding to an average annual growth rate of 38.8%.
Why do organizations invest in ML?
Companies are investing in ML to gain a competitive advantage. These technologies are used in a variety of scenarios, such as:
- Predicting customer behavior: Predicting churn allows companies to better understand why customers churn, giving them a chance to take preventive action.
- Optimizing logistics operations: Dynamic route planning or inventory management can reduce operating costs and increase efficiency.
- Dynamic pricing: Matching prices to changing market conditions in real time helps increase revenue.
- Personalizing the user experience: Real-time recommendations build customer loyalty and increase customer engagement.
However, in order to achieve a successful ML implementation, organizations must overcome many challenges, such as integration with existing processes and ensuring data quality. This is why it is crucial to follow best practices in building and implementing ML models.
Understanding the business problem
Defining the purpose of the ML project
Every ML project should begin with a precise definition of the business objective. The fundamental questions to ask are:
- What decisions do we want to make based on the model results?
- What processes do we want to optimize?
- How is the ML model expected to contribute to business goals?
- What use will be made of the specific result delivered before the ML model and what action will it translate into?
For example, if the goal is to increase customer retention, it is worth considering how we should identify a customer at risk of leaving (e.g., how high the probability must be that they will unsubscribe in the near future), what actions we should try to take to retain them, and at what point we want to execute them.
Understanding business needs
Stakeholders should clearly define the problems they want to solve, while ML teams need to deeply understand the specifics of business and technical processes, taking into account their potential limitations. The model's usage scenario should support timely and accurate business decisions and be focused on effective optimization activities.
A key prerequisite is a thorough understanding of the problem and the acquisition of domain knowledge. To achieve this, a data scientist should actively collaborate with stakeholders, acquiring from them the expertise and know-how necessary to develop an effective ML model.
Working together on an application scenario enables the creation of a solution that not only meets technological requirements, but more importantly addresses real business needs and goals. The harmonious combination of business and technical perspectives is the foundation for achieving tangible benefits.
KPIs and measures of success
A key element of any ML project is defining indicators of success:
- Technical indicators: Accuracy, precision, recall, F1-score allow measuring the effectiveness of the model from a technical perspective.
- Business indicators: For example, reducing the churn rate by 10% or increasing revenue through better product recommendations.
The most important element in evaluating machine learning-based activities is the use of business metrics that reflect the actual impact of the model on the organization. Classical technical metrics are undoubtedly crucial at the stage of training, tuning and selecting a model, as they report on its performance under certain conditions and on selected data sets. However, their role ends with the technical aspect - they do not tell how the model actually affects business processes and whether it supports the achievement of key objectives.
The true success of an ML project should be judged by its integration with business processes. The key questions are: does the model enable better decision-making, does it support the optimization of specific areas, and does its use realistically improve organizational performance? Measures of success should be defined on a case-by-case basis for each business issue, such as by comparing the performance of processes with and without the model. This ensures that the performance evaluation focuses on the real impact on the business, not just the technical performance of the model.
In summary, technical metrics are a valuable tool for data scientists in the model building process, but it is the business metrics that ultimately determine the value of the implementation in practice. Defining these metrics at the beginning of a project helps to clearly define expectations and direction.
Data preparation
Understanding the available data
One of the first steps in any data science project is a thorough understanding of the available data and its collection processes. It is important to identify potential inconsistencies, resulting from errors in ETL processes, differences between the presentation of data in source systems, or simply from imperfections in data sources. Such issues can limit the ability to make full use of the information in the models. Typical challenges include missing data, duplicates, inconsistent identifiers, unexpected or simply wrong values in data columns.
To deal with these challenges, various techniques are used, such as creating dictionaries to align values in text columns, identifying and removing outliers, or filling in missing data using statistical methods, such as medians observed in the segments. It is also crucial to fix data collection processes if problems have been identified in this area.
Data mining
Data mining is an important step in which a Data Scientist analyzes what the data says about a business problem. The goal is to identify key indicators for solving the problem, identify relevant patterns and relationships, and identify segments with similar characteristics. During this phase, expert hypotheses obtained from the business are also verified to see if the experts' assumptions are reflected in the data and support the solution to the problem.
Such analysis helps understand which data elements are relevant to the problem and which should be included in the model. Data mining combines raw information with business context, ultimately allowing it to be transformed into practical conclusions.
Inżynieria funkcji
Feature engineering is an important step that directly determines the model's effectiveness. It uses insights from data mining, translating business insights and relevant patterns into a mathematical language that is easily processed by the model. This process involves creating new features, transforming existing ones and selecting the most important variables that best represent the data in the context of the problem. Well-designed features allow the model to detect relationships more effectively, improving its performance and ability to make accurate decisions.
Data splitting and validation
Dividing the data into training, validation and test sets is the basis for avoiding overfitting and ensuring model reliability. The recommended split is 70%-15%-15%, although this can vary depending on the size of the dataset.
Construction of the ML model
Selection of methodology
Before moving on to algorithm selection, a critical step in a data science project is choosing the right methodology. This includes decisions about when the model will be run, how to build the explanatory variable, and where in the overall design process the ML model will be embedded. This is a stage that determines the effectiveness of the project much more than the later selection of the algorithm itself.
Selecting a methodology allows you to design a comprehensive solution that addresses real business needs. It requires taking into account key factors such as the time of data availability, the moment when the business should make decisions, and the type of activities the model is to support. Often solutions include not only the ML model, but also additional analytical tools.
Selecting the right algorithm
The choice of ML algorithm should be thoughtful and tailored to the specifics of the project. It is crucial to consider the amount of available data - some algorithms, like neural networks, require large data sets, while others, like logistic regression or decision trees, perform better with smaller sets.
Equally important is a common-sense approach: choosing an overly complex model when a simpler solution can achieve sufficient efficiency leads to an unnecessary increase in the cost of preparing and maintaining the solution. Shooting a cannon at a sparrow may be glamorous, but it is rarely effective.
Explainable AI (XAI)
Explainable AI plays a key role in connecting machine learning models to the business, enabling their interpretation. The sheer relevance of variables in models helps to better understand the issue being analyzed, such as user needs and behavior, resulting in more accurate decisions and more effective actions. Significance analysis of variables often reveals key factors influencing customer decisions, supporting product and service optimization.
Linear models, due to their simplicity, offer additional value - their weights can be directly used in a production environment. An example is attribution models, where variable weights help determine which marketing channels have the greatest impact on achieving conversions.
Explainable AI not only increases transparency and trust in the models, but also enables knowledge that is valuable in its own right, regardless of the algorithm's performance. This makes XAI an indispensable tool in achieving business goals.
Implementation of the model into production
Pipeline ML and MLOps
Effective integration of models into existing systems requires automated ML pipelines and implementation of MLOps practices such as data and code versioning. Key elements include:
- Automating data collection and processing.
- Scaling models in response to increased load.
- Versioning models to easily revert to earlier versions in case of problems.
Monitoring and retraining of models
Monitoring and retraining models in production are key activities to ensure their effectiveness under changing conditions. Models running in a production environment must be regularly reviewed and updated to maintain high quality predictions and meet business objectives.
Monitoring models in production
- Technical indicators
Tracking indicators such as accuracy, precision, or F1-score can detect performance degradation (model drift). This can be caused by changes in input data (data drift) or relationships between features and responses (concept drift). Automated monitoring systems make it possible to react quickly to problems. - Business indicators
Models must be evaluated for impact on:- Reduction of customer churn - the correctness of identifying customers at risk of churn.
- Conversion growth - effectiveness in increasing sales or clicks.
- Optimization of operating costs - such as in logistics or inventory management.
Retraining models
Regular retraining ensures that models are adapted to the changing environment and take full advantage of the latest data.
- Schedule retraining
Automated pipelines that support the process - from data collection, to training and validation, to new model deployment - ensure retraining consistency and efficiency. - Triggered retraining
Automated pipelines that support the process - from data collection, to training and validation, to new model deployment - ensure retraining consistency and efficiency. - Automation Pipeline
Automated pipelines that support the process - from data collection, to training and validation, to new model deployment - ensure retraining consistency and efficiency.
Data monitoring in production
Once the ML model is implemented, it is crucial to monitor production data, as data sources and ETL processes can change. Data may be issued in a different structure, change in format, contain new values, and the time of availability may deviate from training assumptions. Such changes can cause discrepancies between training and production data, which significantly reduces the effectiveness of the model.
Monitoring can detect changes in data structure and quality, such as missing values, new categories, or changes in feature distribution. It is also crucial to keep track of ETL processes that may be the source of unexpected problems. Early detection of such anomalies allows quick action to be taken, from adjusting processing processes to re-training the model.
Without proper monitoring, data may no longer match the model's assumptions, leading to erroneous predictions and a decrease in the value of the implementation. Therefore, continuous data monitoring is essential to maintain the quality and effectiveness of the ML solution.
Tools to support monitoring and retraining
Tools such as MLflow (experiment tracking), Evidently AI (data and model monitoring) and Kubeflow Pipelines (process automation) support the entire life cycle of models.
Monitoring and retraining is an investment in maintaining systems that provide accurate predictions and support business goals.Monitoring i retraining to inwestycja w utrzymanie systemów, które dostarczają trafnych predykcji i wspierają realizację celów biznesowych.
Scalability and performance
The use of cloud or serverless solutions allows models to scale as needed. A key aspect is optimizing the model's response time, which is especially important in real-time systems.
To reduce model response time, the following techniques can be used:
- Reducing the size of the model: Compression by prunning (removing unimportant parameters) or quantization (reducing the precision of the weights).
- Result caching: Storing results for the most frequent queries to avoid multiple calculations.
- Infrastructure optimization: Using hardware that supports array computing, such as GPUs or TPUs, and using serverless services with minimal startup latency (e.g., AWS Lambda with prewarmed instances).
- Model distillation: Creating a smaller, faster model that mimics the performance of the original one.
The most common pitfalls and challenges
Problems with data quality
Data quality is of fundamental importance. Both initial analysis of data quality and continuous monitoring of production data and its consistency with training data are important.
Overengineering
Avoiding overly complex models is important. Simple solutions are often sufficient and easier to implement. It is important to focus on MVP (Minimum Viable Product) to deliver business value quickly.
Communication of results
The results of the model must be presented in a way that the business can understand. Educating stakeholders on how to interpret the results is essential. Reports should be tailored to the audience and include practical recommendations.
Lack of business training and education
Insufficient business involvement in the process of implementing models results in a lack of understanding of how they work and how they perform. This can lead to poor realization of the models' potential and problems in communication between technical and business teams.
Abandoning the model instead of iterating
Models are sometimes discarded when their results become less relevant, instead of performing analysis and adjusting them to changing data. The lack of an iterative approach can result in a loss of business confidence in ML technology and underutilized investments.
Wrong usage scenario
Even if the model works technically correctly, its results may not be of value to the business if the results are not “actionable” - that is, they do not provide information on which decisions can be made. This is usually due to an inadequate definition of the model's purpose at the planning stage.
Focus on technical metrics instead of business value
Too much focus on metrics such as accuracy or F1-score can distract from the model's actual impact on key business metrics, such as conversion, revenue or cost reduction.
Data leakage
Skipping the analysis of data availability over time can lead to situations where data from the future or unavailable at the time of prediction affects model training. This leads to artificially high performance and erroneous predictions in production.
Non-representative validation
Improper sampling of data validating the model can lead to inadequate results. A sample that does not reflect the actual long-term distribution of production data overestimates the model's effectiveness. As a result, a model that performs well during testing, in a selected narrow range of data, may not perform correctly in real-world conditions, resulting in erroneous predictions and business decisions.
Case studies
ML implementation for an e-learning client
For our client, the Alterdata team implemented an ML model to support user engagement and motivation. Key activities included: analyzing the drivers of student activity, segmenting users, and implementing the XGBoost model in BigQueryML. Integration of the model with data allowed for fine-tuning of educational recommendations and automation of analytics.
Results: 80% efficiency in predicting engagement, increased user retention and simplified data management. The example demonstrates how ML can support experience personalization and learning platform development.
Summary of key steps
Successful implementation of ML models requires a complete understanding of the business problem, attention to data quality, selection of the right methodology and integration with organizational processes. Each step, from data mining to monitoring the model in production, is fundamental to the final outcome.
Need support in implementing a new model or optimizing an existing one? Our experts will share their experience and best practices. Get in touch with us!