Common ML Coding Mistakes Developers Make

Table of Contents

Introduction

You trained a model. 98% accuracy on cross-validation. You’re happy. The team lead pats you on the shoulder. The model goes to production. And three days later, the product manager brings a revenue drop chart and quietly asks: “Are you sure this worked?”

Sound familiar? If not, it means either you’re a genius — or you haven’t deployed your first ML project into the real world yet.

The difference between a notebook prototype and a production service is not just “a bit more data.” It’s a gap, at the bottom of which lie the bodies of models killed by simple but systemic mistakes.

According to the Kaggle State of Data Science 2024 survey, 62% of data specialists say the main problem in their work is not algorithms, but data quality and experiment reproducibility. We’ll go further and break down 7 specific technical pitfalls that even mid-level engineers step on.

Mistake №1. Data Leakage — When the Model Sees the Future

This is the king of all problems. Data leakage occurs when information that would be unavailable during real prediction ends up in the training data. The model doesn’t “learn” — it simply peeks at the answers.

What it looks like in real code

Suppose you’re predicting customer churn. You have a feature date — the transaction date. You decide to add a “rolling average” of customer spending over the last three months. And you write this:

python

# 🔴 Anti-pattern: calculating the average across ALL data

df[‘spending_mean‘] = df.groupby(‘client_id‘)[‘spending’].transform(‘mean’)

What’s the problem? To calculate the average spending over the last three months, you used future transactions that appear later in the dataset by date. At prediction time on March 1, you don’t have data for March 15. But your model saw them during training.

Why clean data matters

Leakage isn’t just about code logic; it’s also hidden in raw data. Medical scans or time-series logs often contain invisible “future” information due to improper preprocessing.

For example, at Unidata.pro, you can find annotated brain CT scans where artifacts are already processed, and windows are aligned. With such data, at least the time-split mistake isn’t amplified by noise. If you clean everything yourself, be prepared for surprises. A misaligned window in a CT scan is just another form of leakage: the model sees processed artifacts that won’t exist in a live stream.

The correct solution

python

# 🟢 Expanding window (only past data)

df[‘spending_mean’] = (

df.sort_values(‘date’)

.groupby(‘client_id’)[‘spending’]

.expanding()

.mean()

.values

)

Or even stricter: calculate any aggregates (mean, std, min, max, quantile) ONLY inside cross-validation using TimeSeriesSplit.

How to test yourself

Take any row from the test set. Mentally trace the path: would this feature be available at prediction time? If yes — okay. If not — you’re cheating.

Real-world example: I once saw a team building a fraud detection model that included a feature “number of chargebacks in the next 30 days.” They didn’t realize it until week two of training. The model had 99.7% accuracy. On fresh data — 51%. Literally a coin flip.

Mistake №2. Incorrect Handling of Categorical Features

Categorical variables are hell. Especially when there are many unique values (high cardinality). Here are the most common bugs.

A) LabelEncoder for unordered categories

python

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df[‘color_encoded’] = le.fit_transform([‘red’, ‘blue’, ‘green’]) # red=1, blue=2, green=0

The model will decide that blue (2) > red (1) > green (0). This only makes sense if the categories are actually ordered — for example, “bad”, “average”, “excellent”. In all other cases, it’s a disaster.

What happens under the hood: Tree-based models (Random Forest, XGBoost) can still work because they split based on thresholds. But linear models, SVM, or neural networks will treat this as an ordinal relationship, and your coefficients will be garbage.

B) One-Hot Encoding without protection against new categories

python

pd.get_dummies(df, columns=[‘city’]) # If ‘NewCity’ appears in test — crash

Solution: Use sklearn.preprocessing.OneHotEncoder(handle_unknown=’ignore’) or switch to Target Encoding with global smoothing.

C) Target Encoding bug (the most insidious one)

python

# 🔴 Leakage through target

mean_target = df.groupby(‘category’)[‘target’].mean()

df[‘category_encoded’] = df[‘category’].map(mean_target)

You just mixed the target into your feature. The model will show 99% accuracy on training and 50% on test.

python

# 🟢 Proper Target Encoding via cross-validation

from category_encoders import TargetEncoder

encoder = TargetEncoder()

X_train_encoded = encoder.fit_transform(X_train, y_train)

X_test_encoded = encoder.transform(X_test)

Pro tip: For high-cardinality features (like user IDs or ZIP codes), use Target Encoding with smoothing. It prevents overfitting by shrinking rare category estimates toward the global mean.

python

# Manual smoothing example

global_mean = y_train.mean()

category_counts = df.groupby(‘category‘).size()

category_means = df.groupby(‘category’)[‘target’].mean()

shrinkage = category_counts / (category_counts + 10) # 10 is smoothing factor

encoded = (shrinkage * category_means) + ((1 – shrinkage) * global_mean)

Mistake №3. Metrics — Why 99% Accuracy Is a Red Flag

This is especially true when you have class imbalance. Classic example: detecting a rare disease. Your dataset has 99% healthy people and 1% sick people. A model that says “everyone is healthy” gets 99% accuracy. And it’s completely useless.

What to look at instead of accuracy

Let’s use concrete numbers. You have:

Actually sick people: 100
Model found: 80 of them (True Positive = 80, False Negative = 20)
Model flagged as sick by mistake: 500 healthy people (False Positive = 500, True Negative = 94,400 out of 94,900)

Accuracy = (80 + 94,400) / 100,000 = 94.48% (looks great!)
Recall = 80 / 100 = 80% (missed 20 sick people — maybe okay in some contexts)
Precision = 80 / (80+500) = 13.8% (disaster! 500 false alarms)

If you’re building a system that wakes a doctor at night, 13% precision means seven false alarms for every real case. After two weeks, the doctor will unplug your system.

Table: Which metric to use when

Business situation	Which metric	Why
Spam filter — cannot afford to miss important email	Recall	FP (good email in spam folder) is less bad than FN (important email never arrives)
Cancer screening — every suspicion gets double-checked	Recall	Better to verify ten false alarms than miss one real case
Credit approval — every rejection loses profit	Precision	Better to reject ten reliable customers than approve one fraudster
Factory defect detection (1000 parts/hour)	F1-Score	Need balance: don’t miss defects, but don’t stop production unnecessarily
Any task with equal cost of errors	ROC-AUC	Measures the model’s ability to separate classes independent of threshold choice

Bonus tip: Always look at the confusion matrix first. Don’t just read a single number. A confusion matrix tells you exactly where your model is failing — false positives, false negatives, and their ratio.

How to choose the threshold: After you have probability predictions, move the decision threshold up or down to optimize your specific business cost.

python

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_true, y_probs)

# Find threshold that maximizes F1-score

f1_scores = 2 * (precisions[:-1] * recalls[:-1]) / (precisions[:-1] + recalls[:-1] + 1e-9)

best_threshold = thresholds[np.argmax(f1_scores)]

Mistake №4. Shuffling Time Series

You cannot shuffle time. This is not an opinion — it’s a law of data physics.

python

# 🔴 Random split for time series

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, random_state=42)

What happens here? Data from August ends up in the training set, and data from June ends up in the test set. Your model trains on the future to predict the past. On the test set, it shows brilliant results. But in reality, when you ask it to predict tomorrow, it fails — because tomorrow’s data wasn’t in training.

python

# 🟢 Honest time split

split_date = ‘2024-07-01’

train = df[df[‘date’] < split_date]

test = df[df[‘date’] >= split_date]

Use TimeSeriesSplit from sklearn for cross-validation on time series:

python

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)

for train_index, test_index in tscv.split(X):

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

# Your training code here

Common mistake: Even with correct splitting, many people accidentally leak information when creating lag features. For example, if you create a feature “sales_previous_day” and compute it before splitting — you just leaked. Always compute lag features inside the training loop, or before split but only using data that existed prior to each row.

And here’s where many people stumble on what seems like solid ground. You can perfectly tune your split, fix leakage, optimize hyperparameters — but if the original data was dirty, everything collapses. This is especially visible in sensitive domains: medical imaging, satellite data, datasets with rare events. Even small noise in labeling destroys the temporal structure.

This is exactly why professionals who work with sensitive data don’t spend months manually cleaning raw data. They start with clean datasets.

Mistake №5. Ignoring Multicollinearity

When two features are highly correlated (e.g., “height in cm” and “height in inches”), your model starts to wobble. Coefficients become unstable, interpretation becomes impossible. For linear models, this is especially critical.

How to detect it

python

import pandas as pd

import numpy as np

corr_matrix = df.corr().abs()

upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

high_corr = [column for column in upper.columns if any(upper[column] > 0.95)]

print(f”Multicollinear features: {high_corr}”)

A threshold of 0.95 is a good starting point, but in practice with smaller datasets, you might want to flag correlations above 0.8.

What to do next

Option 1 — Drop one feature: Keep the one that makes more sense theoretically or has higher correlation with the target.

Option 2 — Use regularization: Ridge (L2), Lasso (L1), or ElasticNet suppress the influence of correlated features by penalizing large coefficients.

python

from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0) # alpha controls regularization strength

Option 3 — Apply PCA: Principal Component Analysis transforms correlated features into uncorrelated components. Warning — you lose interpretability.

python

from sklearn.decomposition import PCA

pca = PCA(n_components=0.95) # keep 95% of variance

X_reduced = pca.fit_transform(X)

Real-world example

Suppose you have features: “total_hours_worked”, “total_tasks_completed”, and “average_time_per_task”. Mathematically, if total_hours ≈ total_tasks * average_time, these three are almost perfectly collinear. Your linear regression will produce coefficients with opposite signs and huge standard errors. Dropping two of them completely solves the problem.

Mistake №6. Random State Instability

You get 87.3% accuracy. You’re happy. The next day you restart your notebook — accuracy 81.1%. What happened? You forgot to fix the random_state everywhere randomness appears.

List of places that need random_state

train_test_split
Algorithms: RandomForest, KMeans, SGD, neural networks (initialization, dropout)
shuffle=True anywhere
Data augmentation

python

# Example of fixing reproducibility

RANDOM_STATE = 42

X_train, X_test = train_test_split(X, random_state=RANDOM_STATE)

rf = RandomForestClassifier(random_state=RANDOM_STATE)

kmeans = KMeans(random_state=RANDOM_STATE)

But be aware: even a fixed seed can give different results across different versions of libraries or operating systems. Full reproducibility is a myth — but you must still control for randomness as much as humanly possible.

Pro tip: Use numpy.random.seed() and random.seed() for other random operations:

python

import random

import numpy as np

random.seed(RANDOM_STATE)

np.random.seed(RANDOM_STATE)

Some deep learning libraries (like TensorFlow and PyTorch) also need environment variables:

python

import os

os.environ[‘TF_DETERMINISTIC_OPS’] = ‘1’

os.environ[‘PYTHONHASHSEED’] = str(RANDOM_STATE)

Mistake №7. When Pipeline Is Not a Pipeline

The most common production bug: you apply transformations (scaling, One-Hot, outlier removal) to train and test separately, manually. Then you accidentally call fit_transform on the test set again.

python

# 🔴 Two different transformations — mismatch

scaler_train = StandardScaler().fit(X_train)

X_train_scaled = scaler_train.transform(X_train)

# Mistake: fitting again on test

scaler_test = StandardScaler().fit(X_test)

X_test_scaled = scaler_test.transform(X_test)

Result: training data scaled to its own mean and standard deviation, test data scaled to its own. The model has no idea what’s happening.

The solution: Pipeline

python

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([

(‘scaler’, StandardScaler()),

(‘classifier’, RandomForestClassifier(random_state=42))

])

# One call — all transformations inside cross-validation

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

Pipeline guarantees that fit is called ONLY on training data, and transform (or predict) on test data. No leakage. No mismatch.

Advanced pipelines: combining multiple preprocessing steps

python

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_features = [‘age’, ‘income’, ‘spending’]

categorical_features = [‘city’, ‘device_type’]

preprocessor = ColumnTransformer([

(‘num’, StandardScaler(), numeric_features),

(‘cat’, OneHotEncoder(handle_unknown=‘ignore’), categorical_features)

])

full_pipeline = Pipeline([

(‘preprocessor‘, preprocessor),

(‘classifier’, RandomForestClassifier(random_state=42))

])

full_pipeline.fit(X_train, y_train)

Conclusion

Everything listed above is not theory — it’s real-world failure scenarios. Over ten years of watching ML projects, I’ve never seen one where developers didn’t step on at least two or three of these early on.

Your checklist before shipping a model to production

Check leakage — are there aggregates that use future data?
Validate split — if it’s a time series, don’t shuffle
Check metrics — accuracy is not for imbalanced data
Handle categories properly — no LabelEncoder for unordered data
Fix randomness — everywhere it appears
Wrap everything in a Pipeline — no manual transform chains
Check correlations — add regularization if multicollinearity exists