Best practices in building and implementing machine learning models

#Machine Learning

Best practices in building and implementing machine learning models

Machine learning can transform your business but only if done right. Learn how to turn ML models into measurable impact not just technical success. ...

Wojciech Szlęzak, Data Analysis & Science Lead

25/12/2024

Why do organizations invest in ML?
Understanding the business problem
Defining the purpose of the ML project
Understanding business needs
KPIs and measures of success
Data preparation
Understanding the available data
Data mining
Inżynieria funkcji
Data splitting and validation
Construction of the ML model
Selection of methodology
Selecting the right algorithm
Explainable AI (XAI)
Implementation of the model into production
Pipeline ML and MLOps
Monitoring and retraining of models
Tools to support monitoring and retraining
The most common pitfalls and challenges
Case studies
ML implementation for an e-learning client
Summary of key steps

Machine learning (ML) has become one of the key elements of digital transformation in organizations around the world. According to Gartner's 2023 report, 70% of organizations worldwide say they are using ML or plan to implement it in the next two years. The global ML market was valued at $21 billion in 2022, and is projected to reach more than $209 billion by 2030, corresponding to an average annual growth rate of 38.8%.

Why do organizations invest in ML?

Companies are investing in ML to gain a competitive advantage. These technologies are used in a variety of scenarios, such as:

Predicting customer behavior: Predicting churn allows companies to better understand why customers churn, giving them a chance to take preventive action.
Optimizing logistics operations: Dynamic route planning or inventory management can reduce operating costs and increase efficiency.
Dynamic pricing: Matching prices to changing market conditions in real time helps increase revenue.
Personalizing the user experience: Real-time recommendations build customer loyalty and increase customer engagement.

However, in order to achieve a successful ML implementation, organizations must overcome many challenges, such as integration with existing processes and ensuring data quality. This is why it is crucial to follow best practices in building and implementing ML models.

Understanding the business problem

Defining the purpose of the ML project

Every ML project should begin with a precise definition of the business objective. The fundamental questions to ask are:

What decisions do we want to make based on the model results?
What processes do we want to optimize?
How is the ML model expected to contribute to business goals?
What use will be made of the specific result delivered before the ML model and what action will it translate into?

For example, if the goal is to increase customer retention, it is worth considering how we should identify a customer at risk of leaving (e.g., how high the probability must be that they will unsubscribe in the near future), what actions we should try to take to retain them, and at what point we want to execute them.

Understanding business needs

Stakeholders should clearly define the problems they want to solve, while ML teams need to deeply understand the specifics of business and technical processes, taking into account their potential limitations. The model's usage scenario should support timely and accurate business decisions and be focused on effective optimization activities.

A key prerequisite is a thorough understanding of the problem and the acquisition of domain knowledge. To achieve this, a data scientist should actively collaborate with stakeholders, acquiring from them the expertise and know-how necessary to develop an effective ML model.

Working together on an application scenario enables the creation of a solution that not only meets technological requirements, but more importantly addresses real business needs and goals. The harmonious combination of business and technical perspectives is the foundation for achieving tangible benefits.

KPIs and measures of success

A key element of any ML project is defining indicators of success:

Technical indicators: Accuracy, precision, recall, F1-score allow measuring the effectiveness of the model from a technical perspective.
Business indicators: For example, reducing the churn rate by 10% or increasing revenue through better product recommendations.

The most important element in evaluating machine learning-based activities is the use of business metrics that reflect the actual impact of the model on the organization. Classical technical metrics are undoubtedly crucial at the stage of training, tuning and selecting a model, as they report on its performance under certain conditions and on selected data sets. However, their role ends with the technical aspect - they do not tell how the model actually affects business processes and whether it supports the achievement of key objectives.

The true success of an ML project should be judged by its integration with business processes. The key questions are: does the model enable better decision-making, does it support the optimization of specific areas, and does its use realistically improve organizational performance? Measures of success should be defined on a case-by-case basis for each business issue, such as by comparing the performance of processes with and without the model. This ensures that the performance evaluation focuses on the real impact on the business, not just the technical performance of the model.

In summary, technical metrics are a valuable tool for data scientists in the model building process, but it is the business metrics that ultimately determine the value of the implementation in practice. Defining these metrics at the beginning of a project helps to clearly define expectations and direction.

Data preparation

Understanding the available data

One of the first steps in any data science project is a thorough understanding of the available data and its collection processes. It is important to identify potential inconsistencies, resulting from errors in ETL processes, differences between the presentation of data in source systems, or simply from imperfections in data sources. Such issues can limit the ability to make full use of the information in the models. Typical challenges include missing data, duplicates, inconsistent identifiers, unexpected or simply wrong values in data columns.

To deal with these challenges, various techniques are used, such as creating dictionaries to align values in text columns, identifying and removing outliers, or filling in missing data using statistical methods, such as medians observed in the segments. It is also crucial to fix data collection processes if problems have been identified in this area.

Data mining

Data mining is an important step in which a Data Scientist analyzes what the data says about a business problem. The goal is to identify key indicators for solving the problem, identify relevant patterns and relationships, and identify segments with similar characteristics. During this phase, expert hypotheses obtained from the business are also verified to see if the experts' assumptions are reflected in the data and support the solution to the problem.

Such analysis helps understand which data elements are relevant to the problem and which should be included in the model. Data mining combines raw information with business context, ultimately allowing it to be transformed into practical conclusions.

Inżynieria funkcji

Feature engineering is an important step that directly determines the model's effectiveness. It uses insights from data mining, translating business insights and relevant patterns into a mathematical language that is easily processed by the model. This process involves creating new features, transforming existing ones and selecting the most important variables that best represent the data in the context of the problem. Well-designed features allow the model to detect relationships more effectively, improving its performance and ability to make accurate decisions.

Data splitting and validation

Dividing the data into training, validation and test sets is the basis for avoiding overfitting and ensuring model reliability. The recommended split is 70%-15%-15%, although this can vary depending on the size of the dataset.

Construction of the ML model

Selection of methodology

Before moving on to algorithm selection, a critical step in a data science project is choosing the right methodology. This includes decisions about when the model will be run, how to build the explanatory variable, and where in the overall design process the ML model will be embedded. This is a stage that determines the effectiveness of the project much more than the later selection of the algorithm itself.

Selecting a methodology allows you to design a comprehensive solution that addresses real business needs. It requires taking into account key factors such as the time of data availability, the moment when the business should make decisions, and the type of activities the model is to support. Often solutions include not only the ML model, but also additional analytical tools.

Selecting the right algorithm

The choice of ML algorithm should be thoughtful and tailored to the specifics of the project. It is crucial to consider the amount of available data - some algorithms, like neural networks, require large data sets, while others, like logistic regression or decision trees, perform better with smaller sets.

Equally important is a common-sense approach: choosing an overly complex model when a simpler solution can achieve sufficient efficiency leads to an unnecessary increase in the cost of preparing and maintaining the solution. Shooting a cannon at a sparrow may be glamorous, but it is rarely effective.

Explainable AI (XAI)

Explainable AI plays a key role in connecting machine learning models to the business, enabling their interpretation. The sheer relevance of variables in models helps to better understand the issue being analyzed, such as user needs and behavior, resulting in more accurate decisions and more effective actions. Significance analysis of variables often reveals key factors influencing customer decisions, supporting product and service optimization.

Linear models, due to their simplicity, offer additional value - their weights can be directly used in a production environment. An example is attribution models, where variable weights help determine which marketing channels have the greatest impact on achieving conversions.

Explainable AI not only increases transparency and trust in the models, but also enables knowledge that is valuable in its own right, regardless of the algorithm's performance. This makes XAI an indispensable tool in achieving business goals.

Implementation of the model into production

Pipeline ML and MLOps

Effective integration of models into existing systems requires automated ML pipelines and implementation of MLOps practices such as data and code versioning. Key elements include:

Automating data collection and processing.
Scaling models in response to increased load.
Versioning models to easily revert to earlier versions in case of problems.

Monitoring and retraining of models

Monitoring and retraining models in production are key activities to ensure their effectiveness under changing conditions. Models running in a production environment must be regularly reviewed and updated to maintain high quality predictions and meet business objectives.

Monitoring models in production

Technical indicators
Tracking indicators such as accuracy, precision, or F1-score can detect performance degradation (model drift). This can be caused by changes in input data (data drift) or relationships between features and responses (concept drift). Automated monitoring systems make it possible to react quickly to problems.
Business indicators
Models must be evaluated for impact on:
- Reduction of customer churn - the correctness of identifying customers at risk of churn.
- Conversion growth - effectiveness in increasing sales or clicks.
- Optimization of operating costs - such as in logistics or inventory management.

Retraining models

Regular retraining ensures that models are adapted to the changing environment and take full advantage of the latest data.

Schedule retraining
Automated pipelines that support the process - from data collection, to training and validation, to new model deployment - ensure retraining consistency and efficiency.
Triggered retraining
Automated pipelines that support the process - from data collection, to training and validation, to new model deployment - ensure retraining consistency and efficiency.
Automation Pipeline
Automated pipelines that support the process - from data collection, to training and validation, to new model deployment - ensure retraining consistency and efficiency.

Data monitoring in production

Once the ML model is implemented, it is crucial to monitor production data, as data sources and ETL processes can change. Data may be issued in a different structure, change in format, contain new values, and the time of availability may deviate from training assumptions. Such changes can cause discrepancies between training and production data, which significantly reduces the effectiveness of the model.

Monitoring can detect changes in data structure and quality, such as missing values, new categories, or changes in feature distribution. It is also crucial to keep track of ETL processes that may be the source of unexpected problems. Early detection of such anomalies allows quick action to be taken, from adjusting processing processes to re-training the model.

Without proper monitoring, data may no longer match the model's assumptions, leading to erroneous predictions and a decrease in the value of the implementation. Therefore, continuous data monitoring is essential to maintain the quality and effectiveness of the ML solution.

Tools to support monitoring and retraining

Tools such as MLflow (experiment tracking), Evidently AI (data and model monitoring) and Kubeflow Pipelines (process automation) support the entire life cycle of models.

Monitoring and retraining is an investment in maintaining systems that provide accurate predictions and support business goals.Monitoring i retraining to inwestycja w utrzymanie systemów, które dostarczają trafnych predykcji i wspierają realizację celów biznesowych.

Scalability and performance

The use of cloud or serverless solutions allows models to scale as needed. A key aspect is optimizing the model's response time, which is especially important in real-time systems.

To reduce model response time, the following techniques can be used:

Reducing the size of the model: Compression by prunning (removing unimportant parameters) or quantization (reducing the precision of the weights).
Result caching: Storing results for the most frequent queries to avoid multiple calculations.
Infrastructure optimization: Using hardware that supports array computing, such as GPUs or TPUs, and using serverless services with minimal startup latency (e.g., AWS Lambda with prewarmed instances).
Model distillation: Creating a smaller, faster model that mimics the performance of the original one.

The most common pitfalls and challenges

Problems with data quality

Data quality is of fundamental importance. Both initial analysis of data quality and continuous monitoring of production data and its consistency with training data are important.

Overengineering

Avoiding overly complex models is important. Simple solutions are often sufficient and easier to implement. It is important to focus on MVP (Minimum Viable Product) to deliver business value quickly.

Communication of results

The results of the model must be presented in a way that the business can understand. Educating stakeholders on how to interpret the results is essential. Reports should be tailored to the audience and include practical recommendations.

Lack of business training and education

Insufficient business involvement in the process of implementing models results in a lack of understanding of how they work and how they perform. This can lead to poor realization of the models' potential and problems in communication between technical and business teams.

Abandoning the model instead of iterating

Models are sometimes discarded when their results become less relevant, instead of performing analysis and adjusting them to changing data. The lack of an iterative approach can result in a loss of business confidence in ML technology and underutilized investments.

Wrong usage scenario

Even if the model works technically correctly, its results may not be of value to the business if the results are not “actionable” - that is, they do not provide information on which decisions can be made. This is usually due to an inadequate definition of the model's purpose at the planning stage.

Focus on technical metrics instead of business value

Too much focus on metrics such as accuracy or F1-score can distract from the model's actual impact on key business metrics, such as conversion, revenue or cost reduction.

Data leakage

Skipping the analysis of data availability over time can lead to situations where data from the future or unavailable at the time of prediction affects model training. This leads to artificially high performance and erroneous predictions in production.

Non-representative validation

Improper sampling of data validating the model can lead to inadequate results. A sample that does not reflect the actual long-term distribution of production data overestimates the model's effectiveness. As a result, a model that performs well during testing, in a selected narrow range of data, may not perform correctly in real-world conditions, resulting in erroneous predictions and business decisions.

Case studies

ML implementation for an e-learning client

For our client, the Alterdata team implemented an ML model to support user engagement and motivation. Key activities included: analyzing the drivers of student activity, segmenting users, and implementing the XGBoost model in BigQueryML. Integration of the model with data allowed for fine-tuning of educational recommendations and automation of analytics.

Results: 80% efficiency in predicting engagement, increased user retention and simplified data management. The example demonstrates how ML can support experience personalization and learning platform development.

Summary of key steps

Successful implementation of ML models requires a complete understanding of the business problem, attention to data quality, selection of the right methodology and integration with organizational processes. Each step, from data mining to monitoring the model in production, is fundamental to the final outcome.

Need support in implementing a new model or optimizing an existing one? Our experts will share their experience and best practices. Get in touch with us!

Is Cloud Worth the Investment? 6 Reasons and 5 Challenges You Should Know Before Moving to the Cloud

#Data Engineering

Is Cloud Worth the Investment? 6 Reasons and 5 Challenges You Should Know Before Moving to the Cloud

Cloud means speed, scale and flexibility. But is it right for your business? Discover the real benefits and risks before making the move. ...

Adam Symerewicz, Data Engineering Lead

11/07/2025

See all

Business Intelligence

Smart Data Analytics

Data Quality

Data Warehouse Design and Development

Data Integration

Cloud Data Migration

Data Architecture Design

Data Modelling

Data Warehouse Optimization

Data App Development

Machine Learning

Generative AI

Gaming

E-commerce

Digital Natives

Energy & Heating

Media & Entertainment

Telco

Demos

Blog

Recordings

Business Intelligence

Smart Data Analytics

Data Quality

See all services

Data Warehouse Design and Development

Data Integration

Cloud Data Migration

Data Architecture Design

Data Modelling

Data Warehouse Optimization

Data App Development

See all services

Machine Learning

Generative AI

See all services

Best practices in building and implementing machine learning models

Table of Contents

Why do organizations invest in ML?

Understanding the business problem

Defining the purpose of the ML project

Understanding business needs

KPIs and measures of success

Data preparation

Understanding the available data

Data mining

Inżynieria funkcji

Data splitting and validation

Construction of the ML model

Selection of methodology

Selecting the right algorithm

Explainable AI (XAI)

Implementation of the model into production

Pipeline ML and MLOps

Monitoring and retraining of models

Monitoring models in production

Retraining models

Data monitoring in production

Tools to support monitoring and retraining

Scalability and performance

The most common pitfalls and challenges

Problems with data quality

Overengineering

Communication of results

Lack of business training and education

Abandoning the model instead of iterating

Wrong usage scenario

Focus on technical metrics instead of business value

Data leakage

Non-representative validation

Case studies

ML implementation for an e-learning client

Summary of key steps

Read also

Subscriptions in a new country: how to predict results without historical data?

Subscriptions in a new country: how to predict results without historical data?

7 Key Things You Must Know About the Freemium Model in Gaming

7 Key Things You Must Know About the Freemium Model in Gaming

Is Cloud Worth the Investment? 6 Reasons and 5 Challenges You Should Know Before Moving to the Cloud

Is Cloud Worth the Investment? 6 Reasons and 5 Challenges You Should Know Before Moving to the Cloud