Introduction
You trained a model. 98% accuracy on cross-validation. You’re happy. The team lead pats you on the shoulder. The model goes to production. And three days later, the product manager brings a revenue drop chart and quietly asks: “Are you sure this worked?”
Sound familiar? If not, it means either you’re a genius — or you haven’t deployed your first ML project into the real world yet.
The difference between a notebook prototype and a production service is not just “a bit more data.” It’s a gap, at the bottom of which lie the bodies of models killed by simple but systemic mistakes.
According to the Kaggle State of Data Science 2024 survey, 62% of data specialists say the main problem in their work is not algorithms, but data quality and experiment reproducibility. We’ll go further and break down 7 specific technical pitfalls that even mid-level engineers step on.
Mistake №1. Data Leakage — When the Model Sees the Future
This is the king of all problems. Data leakage occurs when information that would be unavailable during real prediction ends up in the training data. The model doesn’t “learn” — it simply peeks at the answers.
What it looks like in real code
Suppose you’re predicting customer churn. You have a feature date — the transaction date. You decide to add a “rolling average” of customer spending over the last three months. And you write this:
python
# 🔴 Anti-pattern: calculating the average across ALL data
df[‘spending_mean‘] = df.groupby(‘client_id‘)[‘spending’].transform(‘mean’)
What’s the problem? To calculate the average spending over the last three months, you used future transactions that appear later in the dataset by date. At prediction time on March 1, you don’t have data for March 15. But your model saw them during training.
Why clean data matters
Leakage isn’t just about code logic; it’s also hidden in raw data. Medical scans or time-series logs often contain invisible “future” information due to improper preprocessing.
For example, at Unidata.pro, you can find annotated brain CT scans where artifacts are already processed, and windows are aligned. With such data, at least the time-split mistake isn’t amplified by noise. If you clean everything yourself, be prepared for surprises. A misaligned window in a CT scan is just another form of leakage: the model sees processed artifacts that won’t exist in a live stream.
The correct solution
python
# 🟢 Expanding window (only past data)
df[‘spending_mean’] = (
df.sort_values(‘date’)
.groupby(‘client_id’)[‘spending’]
.expanding()
.mean()
.values
)
Or even stricter: calculate any aggregates (mean, std, min, max, quantile) ONLY inside cross-validation using TimeSeriesSplit.
How to test yourself
Take any row from the test set. Mentally trace the path: would this feature be available at prediction time? If yes — okay. If not — you’re cheating.
Real-world example: I once saw a team building a fraud detection model that included a feature “number of chargebacks in the next 30 days.” They didn’t realize it until week two of training. The model had 99.7% accuracy. On fresh data — 51%. Literally a coin flip.
Mistake №2. Incorrect Handling of Categorical Features
Categorical variables are hell. Especially when there are many unique values (high cardinality). Here are the most common bugs.
A) LabelEncoder for unordered categories
python
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df[‘color_encoded’] = le.fit_transform([‘red’, ‘blue’, ‘green’]) # red=1, blue=2, green=0
The model will decide that blue (2) > red (1) > green (0). This only makes sense if the categories are actually ordered — for example, “bad”, “average”, “excellent”. In all other cases, it’s a disaster.
What happens under the hood: Tree-based models (Random Forest, XGBoost) can still work because they split based on thresholds. But linear models, SVM, or neural networks will treat this as an ordinal relationship, and your coefficients will be garbage.
B) One-Hot Encoding without protection against new categories
python
pd.get_dummies(df, columns=[‘city’]) # If ‘NewCity’ appears in test — crash
Solution: Use sklearn.preprocessing.OneHotEncoder(handle_unknown=’ignore’) or switch to Target Encoding with global smoothing.
C) Target Encoding bug (the most insidious one)
python
# 🔴 Leakage through target
mean_target = df.groupby(‘category’)[‘target’].mean()
df[‘category_encoded’] = df[‘category’].map(mean_target)
You just mixed the target into your feature. The model will show 99% accuracy on training and 50% on test.
python
# 🟢 Proper Target Encoding via cross-validation
from category_encoders import TargetEncoder
encoder = TargetEncoder()
X_train_encoded = encoder.fit_transform(X_train, y_train)
X_test_encoded = encoder.transform(X_test)
Pro tip: For high-cardinality features (like user IDs or ZIP codes), use Target Encoding with smoothing. It prevents overfitting by shrinking rare category estimates toward the global mean.
python
# Manual smoothing example
global_mean = y_train.mean()
category_counts = df.groupby(‘category‘).size()
category_means = df.groupby(‘category’)[‘target’].mean()
shrinkage = category_counts / (category_counts + 10) # 10 is smoothing factor
encoded = (shrinkage * category_means) + ((1 – shrinkage) * global_mean)
Mistake №3. Metrics — Why 99% Accuracy Is a Red Flag
This is especially true when you have class imbalance. Classic example: detecting a rare disease. Your dataset has 99% healthy people and 1% sick people. A model that says “everyone is healthy” gets 99% accuracy. And it’s completely useless.
What to look at instead of accuracy
Let’s use concrete numbers. You have:
- Actually sick people: 100
- Model found: 80 of them (True Positive = 80, False Negative = 20)
- Model flagged as sick by mistake: 500 healthy people (False Positive = 500, True Negative = 94,400 out of 94,900)
Accuracy = (80 + 94,400) / 100,000 = 94.48% (looks great!)
Recall = 80 / 100 = 80% (missed 20 sick people — maybe okay in some contexts)
Precision = 80 / (80+500) = 13.8% (disaster! 500 false alarms)
If you’re building a system that wakes a doctor at night, 13% precision means seven false alarms for every real case. After two weeks, the doctor will unplug your system.
Table: Which metric to use when
|
Business situation |
Which metric |
Why |
|
Spam filter — cannot afford to miss important email |
Recall |
FP (good email in spam folder) is less bad than FN (important email never arrives) |
|
Cancer screening — every suspicion gets double-checked |
Recall |
Better to verify ten false alarms than miss one real case |
|
Credit approval — every rejection loses profit |
Precision |
Better to reject ten reliable customers than approve one fraudster |
|
Factory defect detection (1000 parts/hour) |
F1-Score |
Need balance: don’t miss defects, but don’t stop production unnecessarily |
|
Any task with equal cost of errors |
ROC-AUC |
Measures the model’s ability to separate classes independent of threshold choice |
Bonus tip: Always look at the confusion matrix first. Don’t just read a single number. A confusion matrix tells you exactly where your model is failing — false positives, false negatives, and their ratio.
How to choose the threshold: After you have probability predictions, move the decision threshold up or down to optimize your specific business cost.
python
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_true, y_probs)
# Find threshold that maximizes F1-score
f1_scores = 2 * (precisions[:-1] * recalls[:-1]) / (precisions[:-1] + recalls[:-1] + 1e-9)
best_threshold = thresholds[np.argmax(f1_scores)]
Mistake №4. Shuffling Time Series
You cannot shuffle time. This is not an opinion — it’s a law of data physics.
python
# 🔴 Random split for time series
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, random_state=42)
What happens here? Data from August ends up in the training set, and data from June ends up in the test set. Your model trains on the future to predict the past. On the test set, it shows brilliant results. But in reality, when you ask it to predict tomorrow, it fails — because tomorrow’s data wasn’t in training.
python
# 🟢 Honest time split
split_date = ‘2024-07-01’
train = df[df[‘date’] < split_date]
test = df[df[‘date’] >= split_date]
Use TimeSeriesSplit from sklearn for cross-validation on time series:
python
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Your training code here
Common mistake: Even with correct splitting, many people accidentally leak information when creating lag features. For example, if you create a feature “sales_previous_day” and compute it before splitting — you just leaked. Always compute lag features inside the training loop, or before split but only using data that existed prior to each row.
And here’s where many people stumble on what seems like solid ground. You can perfectly tune your split, fix leakage, optimize hyperparameters — but if the original data was dirty, everything collapses. This is especially visible in sensitive domains: medical imaging, satellite data, datasets with rare events. Even small noise in labeling destroys the temporal structure.
This is exactly why professionals who work with sensitive data don’t spend months manually cleaning raw data. They start with clean datasets.
Mistake №5. Ignoring Multicollinearity
When two features are highly correlated (e.g., “height in cm” and “height in inches”), your model starts to wobble. Coefficients become unstable, interpretation becomes impossible. For linear models, this is especially critical.
How to detect it
python
import pandas as pd
import numpy as np
corr_matrix = df.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
high_corr = [column for column in upper.columns if any(upper[column] > 0.95)]
print(f”Multicollinear features: {high_corr}”)
A threshold of 0.95 is a good starting point, but in practice with smaller datasets, you might want to flag correlations above 0.8.
What to do next
Option 1 — Drop one feature: Keep the one that makes more sense theoretically or has higher correlation with the target.
Option 2 — Use regularization: Ridge (L2), Lasso (L1), or ElasticNet suppress the influence of correlated features by penalizing large coefficients.
python
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0) # alpha controls regularization strength
Option 3 — Apply PCA: Principal Component Analysis transforms correlated features into uncorrelated components. Warning — you lose interpretability.
python
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95) # keep 95% of variance
X_reduced = pca.fit_transform(X)
Real-world example
Suppose you have features: “total_hours_worked”, “total_tasks_completed”, and “average_time_per_task”. Mathematically, if total_hours ≈ total_tasks * average_time, these three are almost perfectly collinear. Your linear regression will produce coefficients with opposite signs and huge standard errors. Dropping two of them completely solves the problem.
Mistake №6. Random State Instability
You get 87.3% accuracy. You’re happy. The next day you restart your notebook — accuracy 81.1%. What happened? You forgot to fix the random_state everywhere randomness appears.
List of places that need random_state
- train_test_split
- Algorithms: RandomForest, KMeans, SGD, neural networks (initialization, dropout)
- shuffle=True anywhere
- Data augmentation
python
# Example of fixing reproducibility
RANDOM_STATE = 42
X_train, X_test = train_test_split(X, random_state=RANDOM_STATE)
rf = RandomForestClassifier(random_state=RANDOM_STATE)
kmeans = KMeans(random_state=RANDOM_STATE)
But be aware: even a fixed seed can give different results across different versions of libraries or operating systems. Full reproducibility is a myth — but you must still control for randomness as much as humanly possible.
Pro tip: Use numpy.random.seed() and random.seed() for other random operations:
python
import random
import numpy as np
random.seed(RANDOM_STATE)
np.random.seed(RANDOM_STATE)
Some deep learning libraries (like TensorFlow and PyTorch) also need environment variables:
python
import os
os.environ[‘TF_DETERMINISTIC_OPS’] = ‘1’
os.environ[‘PYTHONHASHSEED’] = str(RANDOM_STATE)
Mistake №7. When Pipeline Is Not a Pipeline
The most common production bug: you apply transformations (scaling, One-Hot, outlier removal) to train and test separately, manually. Then you accidentally call fit_transform on the test set again.
python
# 🔴 Two different transformations — mismatch
scaler_train = StandardScaler().fit(X_train)
X_train_scaled = scaler_train.transform(X_train)
# Mistake: fitting again on test
scaler_test = StandardScaler().fit(X_test)
X_test_scaled = scaler_test.transform(X_test)
Result: training data scaled to its own mean and standard deviation, test data scaled to its own. The model has no idea what’s happening.
The solution: Pipeline
python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
(‘scaler’, StandardScaler()),
(‘classifier’, RandomForestClassifier(random_state=42))
])
# One call — all transformations inside cross-validation
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
Pipeline guarantees that fit is called ONLY on training data, and transform (or predict) on test data. No leakage. No mismatch.
Advanced pipelines: combining multiple preprocessing steps
python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
numeric_features = [‘age’, ‘income’, ‘spending’]
categorical_features = [‘city’, ‘device_type’]
preprocessor = ColumnTransformer([
(‘num’, StandardScaler(), numeric_features),
(‘cat’, OneHotEncoder(handle_unknown=‘ignore’), categorical_features)
])
full_pipeline = Pipeline([
(‘preprocessor‘, preprocessor),
(‘classifier’, RandomForestClassifier(random_state=42))
])
full_pipeline.fit(X_train, y_train)
Conclusion
Everything listed above is not theory — it’s real-world failure scenarios. Over ten years of watching ML projects, I’ve never seen one where developers didn’t step on at least two or three of these early on.
Your checklist before shipping a model to production
- Check leakage — are there aggregates that use future data?
- Validate split — if it’s a time series, don’t shuffle
- Check metrics — accuracy is not for imbalanced data
- Handle categories properly — no LabelEncoder for unordered data
- Fix randomness — everywhere it appears
- Wrap everything in a Pipeline — no manual transform chains
- Check correlations — add regularization if multicollinearity exists

