Missing values

Most machine learning models cannot handle missing values directly. So when missing values are present, you’ll need to decide whether to fill them in (imputation) or remove the affected samples entirely.

	A	B	C	D	E	target
0	NaN	0.020584	0.611853	0.607545	0.122038	0
1	0.950714	0.969910	0.139494	0.170524	NaN	1
2	0.731994	0.832443	0.292145	0.065052	0.034389	1
3	0.598658	0.212339	0.366362	0.948886	0.909320	1
4	0.156019	0.181825	0.456070	0.965632	0.258780	1

Note

Missing values are not always explicitly marked as NaN.

Sometimes, they're represented by values like 0, -1, or a very large or small number. This often happens because some data types don't support native missing values — for example, int and bool types (see section data types). In contrast, floating-point columns can represent missing values as NaN.

If you suspect this is the case, you can replace such placeholder values with actual missing values like so:

A few exceptions Some models can handle missing values almost natively; especially tree-based models, such as: Decision Trees, XGBoost, LightGBM and CatBoost These models can often work well with missing values with very simple imputation.

Missing data mechanism

The nature of missing values plays a key role in deciding how to handle them. ¹ ². In some cases, the very fact that a value is missing can hold useful information; and may even be treated as a feature in itself.

To better understand how missingness behaves, let’s explore the three common mechanisms, each with a brief example to illustrate.

Missing completely at random (MCAR)

The examiner accidentally spilled ink on part of the questionnaire, making an answer unreadable.

In this case, the probability of a value being missing is completely unrelated to both observed and unobserved data. Missingness occurs purely by chance, and there is no systematic pattern behind it.

This is the strongest and most desirable assumption, because it means the missing data are essentially harmless.

Key point: Probability of missingness is the same for all cases, entirely due to randomness.

Missing at random (MAR)

People with a certain belief or political view choose not to answer a question — and their belief is recorded in the dataset.

With MAR, the probability of a value being missing depends on observed data, but not on the missing value itself. In other words, missingness is systematic, but predictable — based on information we already have.

Example: In a university survey, students from a specific faculty tend to skip a particular question. Within each faculty, the probability of missingness is consistent.

Key point: Probability of missingness varies across groups, but can be explained by known variables.

Missing not at random (MNAR)

People with a certain belief or political view choose not to answer a question — and their belief is not recorded.

This is the most difficult type to deal with. Here, the probability of a value being missing depends on unobserved data — often the missing value itself.

Example:

If individuals with higher incomes are less likely to report their income, and we have no other way to estimate that income, the data are MNAR. This mechanism can introduce serious bias and distort model performance.

Key point: Probability of missingness depends on information we don’t observe — making it hard to handle or correct.

Handling missing values

When dealing with missing values, there are a few strategies you can take:

Remove the missing values: by dropping rows (samples) or columns (features)
Constant imputation: fill in missing values using simple statistics based on the same feature (e.g. mean, median, mode, or a fixed value)
Conditional imputation: estimate missing values using information from other features
Do nothing: let the model handle missing values natively (supported by some algorithms like XGBoost, LightGBM, or CatBoost)

Each of these approaches comes with its own advantages and limitations. Let’s take a closer look at how they work — and when to use them.

Removing features

If a feature contains too many missing values, it may be best to remove the feature entirely. This may be true if:

The feature is missing in most rows
It provides little or no predictive value
There is no information in the value being missing (not MNAR)
There are better alternatives in the dataset

While there's no strict threshold, it's common to consider dropping a feature if more than 50–70% of its values are missing; but always consider the context and importance of the feature before removing it.

Dropping samples

The simplest way to deal with missing values is to remove the entire row that contains them. This approach is often perfectly fine when only a small number of values are missing relative to the dataset, and the missing value does not hold information (not MNAR).

However, if not done carefully, this can introduce bias into your model.

For example:

Let’s say you’re analysing patient data, and only 1% of the samples are missing a particular feature. It might be tempting to simply remove those rows.

But what if those missing values occurred because the patients were in such a weakened state that collecting the measurement was too invasive or unsafe? By dropping those rows, you'd be systematically removing a specific subgroup; potentially leading to a model that doesn't account for those most at risk.

In this case, the missingness itself might contain important information about the target — and removing it could result in a biased or incomplete model.

Missing indicator feature

In most cases, the fact that a value is missing can actually carry meaningful information. For example, missing values might correlate with a certain class, behaviour, or outcome.

When this is the case, it can be helpful to add a missing indicator feature (sometimes called a mask) — a new column that flags whether the original value was missing.

This allows your model to learn patterns from the presence or absence of data itself, not just the values.

When is this useful? Adding a missing indicator is especially helpful when your missing values follow a:

MAR (Missing at Random) pattern — where missingness depends on other observed features
MNAR (Missing Not at Random) pattern — where missingness depends on unobserved data, often the missing value itself

In both cases, the fact that a value is missing can be informative — and the indicator gives the model access to that signal.

On the other hand, if the data are MCAR (Missing Completely at Random) — meaning the missingness is purely random — then the indicator likely won’t help and may just introduce noise.

Simple imputer

The simplest strategy is to replace missing values in each column with a single value, such as: the mean, median, most frequent value (mode) or a constant, such as -999 (commonly used in tree-based models)

Although this technique is simple, it can be surprisingly effective — especially when the amount of missing data is small.

For numerical features, the mean or median is often used.
For categorical features, the most frequent value is a good default.
For tree-based models, using an extreme value (like -999) can help the model treat "missing" as a separate condition.
The choice between mean and median follows the usual rule of thumb: Use the mean by default, and switch to the median when your data is heavily skewed or contains extreme outliers.

There are however some down sides:

Mean/median imputation reduces variance and can distort correlations, which might bias models sensitive to data distribution (e.g., linear regression).
It assumes the data are MCAR; if missingness is MAR or MNAR, it may bias results.
For categorical imputation with the mode, it can inflate the frequency of the most common category.

K-neigbors imputer

The KNN imputer estimates missing values based on the values of the k nearest neighbours (rows with similar feature values).

This approach is more flexible and often yields better estimates than simple imputers — especially when the dataset contains natural clusters or patterns.

Iteratively imputer

The most advanced imputation method is the Iterative Imputer. Here’s how it works:

Start by filling missing values with a simple strategy (e.g. mean).
For each column with missing data, use the other columns to predict the missing values. Use any regression model you like (e.g. linear regression, xgboost etc.).
Repeat the process iteratively, refining the estimates step by step until results stabilise.

This method is often more accurate, especially when relationships between variables are strong and non-linear.

MICE: Multivariate Imputation by Chained Equations

We've looked at single imputation techniques; where each missing value is filled in once, resulting in a single completed dataset.

However, in statistical analysis, it's common to use multiple imputation methods. This typically involves predicting missing values multiple times, often using a bootstrapping approach, to generate several plausible versions of the dataset.

Each version is analysed separately, and the results are then pooled to account for the uncertainty introduced by the missing data.

This approach is especially valuable in inference-based tasks, where it's important to:

Capture variability due to imputation
Estimate uncertainty more accurately
Avoid underestimating confidence intervals

While multiple imputation is widely used in statistics, it's less common in machine learning, where the focus is usually on prediction rather than statistical inference.

This approach is known as Multivariate Imputation by Chained Equations (MICE) ³ You can enable this in sklearn by setting: sample_posterior=True in the IterativeImputer() constructor.

Fun fact

Iterative imputer is actually inspired by MICE; and not the other way arround.

Do nothing

Sometimes, you don’t need to handle missing values at all; especially when working with tree-based models.

Why? Because tree algorithms make decisions by splitting the data based on feature values — deciding whether a sample goes to the left or right child node.

If a value is missing, the algorithm simply determines whether sending it left or right improves performance — and learns that rule during training. This allows the model to handle missing values natively, with no need for explicit imputation.

It’s a clever and efficient strategy; especially useful when:

You're using models like XGBoost, LightGBM, or CatBoost
Missingness might carry information
You want to avoid preprocessing overhead

MIA – Missing Incorporated in Attributes

Some tree-based algorithms (like CatBoost) support an enhanced version of this idea known as Missing Incorporated in Attributes (MIA) ⁴.

Rather than simply deciding whether missing values go left or right, the model can learn to treat “missing” as its own condition — essentially creating a separate split for missing vs. not missing.

What to use?

We’ve explored a variety of methods for handling missing values — but when should you use one over the other?

Tree-based models

Tree-based models such as XGBoost, LightGBM, and CatBoost can natively handle missing values. In fact, trying to impute missing values manually may harm performance if it erases informative patterns.

So if you’re using a tree model, you can usually leave the missing values in and let the model handle them for you.

Missing values contain information

If your missing values are not missing completely at random (MCAR) — which is almost always the case — then the fact that a value is missing may be informative in itself.

In such cases, the absence of a value might reflect an underlying condition or decision. A good strategy is to add a missing value indicator: a new column that marks whether a value was missing or not. Note: this will introduce more features which could lead to overfitting and thereby having a negative effect on your model.

Predictive vs inference goals

In this course, we’re focused on prediction — where accuracy matters most.

However, if your goal is inference (e.g. estimating effects, testing hypotheses, or drawing conclusions from the data), then accurate imputation of missing values becomes more important. You’re not just interested in the target — you also want to analyse the features.

In such cases, use methods like:

MICE (Multivariate Imputation by Chained Equations), especially in R
IterativeImputer(sample_posterior=True) in scikit-learn

These approaches preserve uncertainty and are more appropriate for inference tasks.

Complexity of preprocessing

Using methods like iterative imputation or KNN imputation might sound great in theory, but they can add significant complexity to your workflow. For many practical purposes, the added effort isn’t worth it — especially when a simpler method performs just as well.

Complexity of the model

If you’re using a simple model (e.g. linear regression), it can help to use a more advanced imputation method (iterative imputation estimators in combination with tree-based) to compensate. But if your predictive model is already highly flexible (e.g. trees or neural networks), a simple imputation strategy combined with a missing value indicator often works just as well — and is much easier to maintain.

Bootstrapping

Where does bootstrapping fit into all this?

It’s another way to reduce imputation variance and make models more robust to missingness. This is a quite complicated technique at this point so it's left out for now. You can read more about this in: ⁵ ⁶

Code

Timestamps

Dropping values

# Drop features
_ = X_train.drop(columns=['Nitric oxides concentraion', 'High way accessibility'], 
             inplace=False)

# Drop rows
_ = X_train.dropna(axis=0, # Drop rows
                #    how='any', # If any NA values are present (alternative: 'all')
                   thresh=9, # Treshold
                #    subset=['Crime rate', 'Residential land zoned for lots', 'non-retail business'], # Only consider these columns
                   inplace=False)

Simple imputer

Using pandas

X_train_imputed = X_train.fillna(X_train.mean())
X_test_imputed = X_test.fillna(X_train.mean()) # Prevent data leakage!
X_train_imputed = X_train.fillna(X_train.median())
X_train_imputed = X_train.fillna(X_train.mode().iloc[0]) # Most frequent value
X_train_imputed = X_train.fillna(-999) # Constant value

Using sklearn

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(
    missing_values=np.nan,
    strategy='mean', # median, most_frequent, constant
    fill_value=-999, # Only used for constant strategy
    add_indicator=False # If True, adds a boolean column for each feature with missing values
    ) 

imputer.fit(X_train)
X_train_imputed = pd.DataFrame(imputer.transform(X_train), columns=imputer.get_feature_names_out())

KNN imputer

Directly applying it.

from sklearn.impute import KNNImputer

imputer = KNNImputer(
    missing_values=np.nan,
    n_neighbors=5,
    weights='uniform', # voting: uniform, distance
    metric='nan_euclidean', # search
    add_indicator=False # If True, adds a boolean column for each feature with missing values
    ) 

imputer.fit(X_train)
X_train_imputed = pd.DataFrame(imputer.transform(X_train), columns=X_train.columns)
X_test_imputed = pd.DataFrame(imputer.transform(X_test), columns=X_test.columns)

With scaling on the numerical features

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# First the scaling
preprocesing = ColumnTransformer([
    ('scaler', scaler, numerical_features)
], remainder='passthrough')

# Then the imputation
pipeline = Pipeline(steps=[
    ('scaler', preprocesing),
    ('imputer', KNNImputer())
])

pipeline.fit(X_train)
X_train_processed = pd.DataFrame(pipeline.transform(X_train), columns=X_train.columns)
X_test_processed = pd.DataFrame(pipeline.transform(X_test), columns=X_train.columns)

Iterative imputer

from sklearn.experimental import enable_iterative_imputer # Note: this is required because the class is experimental
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import BayesianRidge

imputer = IterativeImputer(
    estimator=BayesianRidge(), # RandomForestRegressor(), # The default is BayesianRidge()
    missing_values=np.nan,
    sample_posterior=False, # This is only for models that can sample from posterior. in other words: models that are probabilistic.
    max_iter=10,
    tol=1e-3,
    n_nearest_features=None,
    initial_strategy='mean',
    imputation_order='ascending', # 'descending', 'roman', 'arabic', 'random' So this is how to round-robin over the features happens
    skip_complete=True, # Do not perform itterative imputation for features that have no missing values during fitting
    min_value=None,
    max_value=None, 
    verbose=0,
    random_state=42,
    add_indicator=False
    )

imputer.fit(X_train)

Sampling posterior

If you set the sampling posterior to True and you have a model that supports return_std=True in model.predict(X_test_complete, return_std=True). Then you can also get a sampling. This means that every transformation with the imputer will result in different imputed values.

Adding indicator

This only return the indicators

from sklearn.impute import MissingIndicator

imputer = MissingIndicator()
imputer.fit(X_train)
X_train_missing_indicator = pd.DataFrame(imputer.transform(X_train), columns=imputer.get_feature_names_out())

All the imputation classes in sklearn have an add_indicator setting. This would be the same as doing:

preprocessing = ColumnTransformer(transformers=[
    ('imputer', SimpleImputer(), X_train.columns),
    ('missing_indicator', MissingIndicator(), X_train.columns)
    ],
)

preprocessing.fit(X_train)
columns = list(X_train.columns) + list(preprocessing.named_transformers_['missing_indicator'].get_feature_names_out())
X_train_preprocessed = pd.DataFrame(preprocessing.transform(X_train), columns=columns)

Statistical Analysis with Missing Data by Roderick J. A. Little, Donald B. Rubin - Wiley ↩
Flexible Imputation of Missing Data by Stef van Buuren - CRC Press ↩
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67.jstatsoft ↩
B.E.T.H. Twala, M.C. Jones, D.J. Hand, Good methods for coping with missing data in decision trees, Pattern Recognition Letters, Volume 29, Issue 7, 2008, Pages 950-956, ISSN 0167-8655, ScienceDirect ↩
Bootstrapping and Multiple Imputation Ensemble Approaches for Missing Data. Shehroz S. Khan, Amir Ahmad, Alex Mihailidis arXiv ↩
Alexandre Perez-Lebel, Gaël Varoquaux, Marine Le Morvan, Julie Josse, Jean-Baptiste Poline, Benchmarking missing-values approaches for predictive models on health databases, GigaScience, Volume 11, 2022, giac013, GigaScience ↩

Missing data mechanism​

Missing completely at random (MCAR)​

Missing at random (MAR)​

Missing not at random (MNAR)​

Handling missing values​

Removing features​

Dropping samples​

For example:​

Missing indicator feature​

Simple imputer​

K-neigbors imputer​

Iteratively imputer​

MICE: Multivariate Imputation by Chained Equations​

Do nothing​

MIA – Missing Incorporated in Attributes​

What to use?​

Tree-based models​

Missing values contain information​

Predictive vs inference goals​

Complexity of preprocessing​

Complexity of the model​

Bootstrapping​

Code​

Dropping values​

Simple imputer​

Using pandas​

Using sklearn​

KNN imputer​

Iterative imputer​

Sampling posterior​

Adding indicator​

Footnotes​

Missing data mechanism

Missing completely at random (MCAR)

Missing at random (MAR)

Missing not at random (MNAR)

Handling missing values

Removing features

Dropping samples

For example:

Missing indicator feature

Simple imputer

K-neigbors imputer

Iteratively imputer

MICE: Multivariate Imputation by Chained Equations

Do nothing

MIA – Missing Incorporated in Attributes

What to use?

Tree-based models

Missing values contain information

Predictive vs inference goals

Complexity of preprocessing

Complexity of the model

Bootstrapping

Code

Dropping values

Simple imputer

Using pandas

Using sklearn

KNN imputer

Iterative imputer

Sampling posterior

Adding indicator

Footnotes