Missing values
Most machine learning models cannot handle missing values directly. So when missing values are present, you’ll need to decide whether to fill them in (imputation) or remove the affected samples entirely.
| A | B | C | D | E | target | |
|---|---|---|---|---|---|---|
| 0 | NaN | 0.020584 | 0.611853 | 0.607545 | 0.122038 | 0 |
| 1 | 0.950714 | 0.969910 | 0.139494 | 0.170524 | NaN | 1 |
| 2 | 0.731994 | 0.832443 | 0.292145 | 0.065052 | 0.034389 | 1 |
| 3 | 0.598658 | 0.212339 | 0.366362 | 0.948886 | 0.909320 | 1 |
| 4 | 0.156019 | 0.181825 | 0.456070 | 0.965632 | 0.258780 | 1 |
Missing values are not always explicitly marked as NaN.
Sometimes, they're represented by values like 0, -1, or a very large or small number. This often happens because some data types don't support native missing values — for example, int and bool types (see section data types). In contrast, floating-point columns can represent missing values as NaN.
If you suspect this is the case, you can replace such placeholder values with actual missing values like so:
A few exceptions Some models can handle missing values almost natively; especially tree-based models, such as: Decision Trees, XGBoost, LightGBM and CatBoost These models can often work well with missing values with very simple imputation.
Missing data mechanism
The nature of missing values plays a key role in deciding how to handle them. 1 2. In some cases, the very fact that a value is missing can hold useful information; and may even be treated as a feature in itself.
To better understand how missingness behaves, let’s explore the three common mechanisms, each with a brief example to illustrate.
Missing completely at random (MCAR)
The examiner accidentally spilled ink on part of the questionnaire, making an answer unreadable.
In this case, the probability of a value being missing is completely unrelated to both observed and unobserved data. Missingness occurs purely by chance, and there is no systematic pattern behind it.
This is the strongest and most desirable assumption, because it means the missing data are essentially harmless.
- Key point: Probability of missingness is the same for all cases, entirely due to randomness.
Missing at random (MAR)
People with a certain belief or political view choose not to answer a question — and their belief is recorded in the dataset.
With MAR, the probability of a value being missing depends on observed data, but not on the missing value itself. In other words, missingness is systematic, but predictable — based on information we already have.
Example: In a university survey, students from a specific faculty tend to skip a particular question. Within each faculty, the probability of missingness is consistent.
- Key point: Probability of missingness varies across groups, but can be explained by known variables.
Missing not at random (MNAR)
People with a certain belief or political view choose not to answer a question — and their belief is not recorded.
This is the most difficult type to deal with. Here, the probability of a value being missing depends on unobserved data — often the missing value itself.
Example:
If individuals with higher incomes are less likely to report their income, and we have no other way to estimate that income, the data are MNAR. This mechanism can introduce serious bias and distort model performance.
- Key point: Probability of missingness depends on information we don’t observe — making it hard to handle or correct.
Handling missing values
When dealing with missing values, there are a few strategies you can take:
- Remove the missing values: by dropping rows (samples) or columns (features)
- Constant imputation: fill in missing values using simple statistics based on the same feature (e.g. mean, median, mode, or a fixed value)
- Conditional imputation: estimate missing values using information from other features
- Do nothing: let the model handle missing values natively (supported by some algorithms like XGBoost, LightGBM, or CatBoost)
Each of these approaches comes with its own advantages and limitations. Let’s take a closer look at how they work — and when to use them.
Removing features
If a feature contains too many missing values, it may be best to remove the feature entirely. This may be true if:
- The feature is missing in most rows
- It provides little or no predictive value
- There is no information in the value being missing (not MNAR)
- There are better alternatives in the dataset
While there's no strict threshold, it's common to consider dropping a feature if more than 50–70% of its values are missing; but always consider the context and importance of the feature before removing it.
Dropping samples
The simplest way to deal with missing values is to remove the entire row that contains them. This approach is often perfectly fine when only a small number of values are missing relative to the dataset, and the missing value does not hold information (not MNAR).
However, if not done carefully, this can introduce bias into your model.
For example:
Let’s say you’re analysing patient data, and only 1% of the samples are missing a particular feature. It might be tempting to simply remove those rows.
But what if those missing values occurred because the patients were in such a weakened state that collecting the measurement was too invasive or unsafe? By dropping those rows, you'd be systematically removing a specific subgroup; potentially leading to a model that doesn't account for those most at risk.
In this case, the missingness itself might contain important information about the target — and removing it could result in a biased or incomplete model.
Missing indicator feature
In most cases, the fact that a value is missing can actually carry meaningful information. For example, missing values might correlate with a certain class, behaviour, or outcome.
When this is the case, it can be helpful to add a missing indicator feature (sometimes called a mask) — a new column that flags whether the original value was missing.
This allows your model to learn patterns from the presence or absence of data itself, not just the values.
When is this useful? Adding a missing indicator is especially helpful when your missing values follow a:
- MAR (Missing at Random) pattern — where missingness depends on other observed features
- MNAR (Missing Not at Random) pattern — where missingness depends on unobserved data, often the missing value itself
In both cases, the fact that a value is missing can be informative — and the indicator gives the model access to that signal.
On the other hand, if the data are MCAR (Missing Completely at Random) — meaning the missingness is purely random — then the indicator likely won’t help and may just introduce noise.
Simple imputer
The simplest strategy is to replace missing values in each column with a single value, such as: the mean, median, most frequent value (mode) or a constant, such as -999 (commonly used in tree-based models)
Although this technique is simple, it can be surprisingly effective — especially when the amount of missing data is small.
- For numerical features, the mean or median is often used.
- For categorical features, the most frequent value is a good default.
- For tree-based models, using an extreme value (like -999) can help the model treat "missing" as a separate condition.
- The choice between mean and median follows the usual rule of thumb: Use the mean by default, and switch to the median when your data is heavily skewed or contains extreme outliers.
There are however some down sides:
- Mean/median imputation reduces variance and can distort correlations, which might bias models sensitive to data distribution (e.g., linear regression).
- It assumes the data are MCAR; if missingness is MAR or MNAR, it may bias results.
- For categorical imputation with the mode, it can inflate the frequency of the most common category.
K-neigbors imputer
The KNN imputer estimates missing values based on the values of the k nearest neighbours (rows with similar feature values).
This approach is more flexible and often yields better estimates than simple imputers — especially when the dataset contains natural clusters or patterns.
Iteratively imputer
The most advanced imputation method is the Iterative Imputer. Here’s how it works:
- Start by filling missing values with a simple strategy (e.g. mean).
- For each column with missing data, use the other columns to predict the missing values. Use any regression model you like (e.g. linear regression, xgboost etc.).
- Repeat the process iteratively, refining the estimates step by step until results stabilise.
This method is often more accurate, especially when relationships between variables are strong and non-linear.
MICE: Multivariate Imputation by Chained Equations
We've looked at single imputation techniques; where each missing value is filled in once, resulting in a single completed dataset.
However, in statistical analysis, it's common to use multiple imputation methods. This typically involves predicting missing values multiple times, often using a bootstrapping approach, to generate several plausible versions of the dataset.
Each version is analysed separately, and the results are then pooled to account for the uncertainty introduced by the missing data.
This approach is especially valuable in inference-based tasks, where it's important to:
- Capture variability due to imputation
- Estimate uncertainty more accurately
- Avoid underestimating confidence intervals
While multiple imputation is widely used in statistics, it's less common in machine learning, where the focus is usually on prediction rather than statistical inference.
This approach is known as Multivariate Imputation by Chained Equations (MICE) 3
You can enable this in sklearn by setting: sample_posterior=True in the IterativeImputer() constructor.
Iterative imputer is actually inspired by MICE; and not the other way arround.
Do nothing
Sometimes, you don’t need to handle missing values at all; especially when working with tree-based models.
Why? Because tree algorithms make decisions by splitting the data based on feature values — deciding whether a sample goes to the left or right child node.
If a value is missing, the algorithm simply determines whether sending it left or right improves performance — and learns that rule during training. This allows the model to handle missing values natively, with no need for explicit imputation.
It’s a clever and efficient strategy; especially useful when:
- You're using models like XGBoost, LightGBM, or CatBoost
- Missingness might carry information
- You want to avoid preprocessing overhead
MIA – Missing Incorporated in Attributes
Some tree-based algorithms (like CatBoost) support an enhanced version of this idea known as Missing Incorporated in Attributes (MIA) 4.
Rather than simply deciding whether missing values go left or right, the model can learn to treat “missing” as its own condition — essentially creating a separate split for missing vs. not missing.
What to use?
We’ve explored a variety of methods for handling missing values — but when should you use one over the other?
Tree-based models
Tree-based models such as XGBoost, LightGBM, and CatBoost can natively handle missing values. In fact, trying to impute missing values manually may harm performance if it erases informative patterns.
So if you’re using a tree model, you can usually leave the missing values in and let the model handle them for you.
Missing values contain information
If your missing values are not missing completely at random (MCAR) — which is almost always the case — then the fact that a value is missing may be informative in itself.
In such cases, the absence of a value might reflect an underlying condition or decision. A good strategy is to add a missing value indicator: a new column that marks whether a value was missing or not. Note: this will introduce more features which could lead to overfitting and thereby having a negative effect on your model.
Predictive vs inference goals
In this course, we’re focused on prediction — where accuracy matters most.
However, if your goal is inference (e.g. estimating effects, testing hypotheses, or drawing conclusions from the data), then accurate imputation of missing values becomes more important. You’re not just interested in the target — you also want to analyse the features.
In such cases, use methods like:
- MICE (Multivariate Imputation by Chained Equations), especially in R
IterativeImputer(sample_posterior=True)in scikit-learn
These approaches preserve uncertainty and are more appropriate for inference tasks.
Complexity of preprocessing
Using methods like iterative imputation or KNN imputation might sound great in theory, but they can add significant complexity to your workflow. For many practical purposes, the added effort isn’t worth it — especially when a simpler method performs just as well.
Complexity of the model
If you’re using a simple model (e.g. linear regression), it can help to use a more advanced imputation method (iterative imputation estimators in combination with tree-based) to compensate. But if your predictive model is already highly flexible (e.g. trees or neural networks), a simple imputation strategy combined with a missing value indicator often works just as well — and is much easier to maintain.
Bootstrapping
Where does bootstrapping fit into all this?
It’s another way to reduce imputation variance and make models more robust to missingness. This is a quite complicated technique at this point so it's left out for now. You can read more about this in: 5 6
Code
Dropping values
# Drop features
_ = X_train.drop(columns=['Nitric oxides concentraion', 'High way accessibility'],
inplace=False)
# Drop rows
_ = X_train.dropna(axis=0, # Drop rows
# how='any', # If any NA values are present (alternative: 'all')
thresh=9, # Treshold
# subset=['Crime rate', 'Residential land zoned for lots', 'non-retail business'], # Only consider these columns
inplace=False)
Simple imputer
Using pandas
X_train_imputed = X_train.fillna(X_train.mean())
X_test_imputed = X_test.fillna(X_train.mean()) # Prevent data leakage!
X_train_imputed = X_train.fillna(X_train.median())
X_train_imputed = X_train.fillna(X_train.mode().iloc[0]) # Most frequent value
X_train_imputed = X_train.fillna(-999) # Constant value
Using sklearn
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(
missing_values=np.nan,
strategy='mean', # median, most_frequent, constant
fill_value=-999, # Only used for constant strategy
add_indicator=False # If True, adds a boolean column for each feature with missing values
)
imputer.fit(X_train)
X_train_imputed = pd.DataFrame(imputer.transform(X_train), columns=imputer.get_feature_names_out())
KNN imputer
Directly applying it.
from sklearn.impute import KNNImputer
imputer = KNNImputer(
missing_values=np.nan,
n_neighbors=5,
weights='uniform', # voting: uniform, distance
metric='nan_euclidean', # search
add_indicator=False # If True, adds a boolean column for each feature with missing values
)
imputer.fit(X_train)
X_train_imputed = pd.DataFrame(imputer.transform(X_train), columns=X_train.columns)
X_test_imputed = pd.DataFrame(imputer.transform(X_test), columns=X_test.columns)
With scaling on the numerical features
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# First the scaling
preprocesing = ColumnTransformer([
('scaler', scaler, numerical_features)
], remainder='passthrough')
# Then the imputation
pipeline = Pipeline(steps=[
('scaler', preprocesing),
('imputer', KNNImputer())
])
pipeline.fit(X_train)
X_train_processed = pd.DataFrame(pipeline.transform(X_train), columns=X_train.columns)
X_test_processed = pd.DataFrame(pipeline.transform(X_test), columns=X_train.columns)
Iterative imputer
from sklearn.experimental import enable_iterative_imputer # Note: this is required because the class is experimental
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import BayesianRidge
imputer = IterativeImputer(
estimator=BayesianRidge(), # RandomForestRegressor(), # The default is BayesianRidge()
missing_values=np.nan,
sample_posterior=False, # This is only for models that can sample from posterior. in other words: models that are probabilistic.
max_iter=10,
tol=1e-3,
n_nearest_features=None,
initial_strategy='mean',
imputation_order='ascending', # 'descending', 'roman', 'arabic', 'random' So this is how to round-robin over the features happens
skip_complete=True, # Do not perform itterative imputation for features that have no missing values during fitting
min_value=None,
max_value=None,
verbose=0,
random_state=42,
add_indicator=False
)
imputer.fit(X_train)
Sampling posterior
If you set the sampling posterior to True and you have a model that supports return_std=True in model.predict(X_test_complete, return_std=True).
Then you can also get a sampling. This means that every transformation with the imputer will result in different imputed values.
Adding indicator
This only return the indicators
from sklearn.impute import MissingIndicator
imputer = MissingIndicator()
imputer.fit(X_train)
X_train_missing_indicator = pd.DataFrame(imputer.transform(X_train), columns=imputer.get_feature_names_out())
All the imputation classes in sklearn have an add_indicator setting. This would be the same as doing:
preprocessing = ColumnTransformer(transformers=[
('imputer', SimpleImputer(), X_train.columns),
('missing_indicator', MissingIndicator(), X_train.columns)
],
)
preprocessing.fit(X_train)
columns = list(X_train.columns) + list(preprocessing.named_transformers_['missing_indicator'].get_feature_names_out())
X_train_preprocessed = pd.DataFrame(preprocessing.transform(X_train), columns=columns)
Footnotes
-
Statistical Analysis with Missing Data by Roderick J. A. Little, Donald B. Rubin - Wiley ↩
-
Flexible Imputation of Missing Data by Stef van Buuren - CRC Press ↩
-
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67.jstatsoft ↩
-
B.E.T.H. Twala, M.C. Jones, D.J. Hand, Good methods for coping with missing data in decision trees, Pattern Recognition Letters, Volume 29, Issue 7, 2008, Pages 950-956, ISSN 0167-8655, ScienceDirect ↩
-
Bootstrapping and Multiple Imputation Ensemble Approaches for Missing Data. Shehroz S. Khan, Amir Ahmad, Alex Mihailidis arXiv ↩
-
Alexandre Perez-Lebel, Gaël Varoquaux, Marine Le Morvan, Julie Josse, Jean-Baptiste Poline, Benchmarking missing-values approaches for predictive models on health databases, GigaScience, Volume 11, 2022, giac013, GigaScience ↩