Ga naar hoofdinhoud

Categorical features

Machine learning models rely on maths; and maths relies on numbers. But real-world data often includes all sorts of non-numerical values, such as:

  • City names stored as text
  • Words or labels in general
  • Postcodes, which may look like numbers but shouldn’t be treated as such

These are known as categorical features, and they require a different approach when preparing your data for modelling.

Introduction

Examples of categorical features

Emotional state Values like Happy, Neutral, and Sad could be placed along a spectrum; so they can be treated as ordinals. However, when dealing with a larger range of emotional states that don’t fit neatly into a linear order, it’s better to treat them as nominal categories.

Country Countries are highly diverse and don’t follow any inherent order, so they should be handled as nominal features.

Colours In a visual context, colours are usually considered nominal. However, in some situations they can be seen as ordinal (e.g. light to dark), or even represented by their RGB values, especially in fields like image recognition.

Words Words can be thought of as categorical features, but they’re often handled as sequences of ordinal tokens (especially in NLP tasks). Each word, or even each character, can be encoded based on position or frequency, depending on the approach.

This table showcases three types of categorical data: Education Level (an ordinal variable reflecting academic attainment), Department (a nominal variable with a limited set of categories), and City (a nominal variable with a broader range of distinct values).
Education_Level Department City
0 High School HR London
1 Bachelor Engineering Manchester
2 Master Marketing Birmingham
3 PhD HR Liverpool
4 High School Sales Leeds
5 Master Engineering Bristol

Category value counts

It’s not just the number of categories that matters; the frequency of each category is also important.

You can’t learn much from just one example.

If a category appears only once or twice in your dataset, it may be difficult (or even impossible) for your model to learn anything meaningful from it. In some cases, depending on the technique you’re using, you might be able to keep rare categories; for example, with regularised target encoding.

But more often than not, it’s better to remove rare categories, group them together, or replace them with a placeholder like "Other" to improve the model's generalisation and stability.

Methods

Timestamps

Two line charts show test R² vs. training set size (log-scaled) for Linear Regression and XGBoost under three categorical encoders—Ordinal, One-Hot, and Target—with shaded error bars across repeats.Two line charts show test R² vs. training set size (log-scaled) for Linear Regression and XGBoost under three categorical encoders—Ordinal, One-Hot, and Target—with shaded error bars across repeats.

Effect of categorical encoding on model performance across different sample sizes. For linear regression, one-hot encoding underperforms at small sample sizes due to the expanded feature space but improves as the dataset grows. Target encoding performs well in the low-data regime but tends to degrade as sample size increases. For XGBoost, both ordinal and one-hot encoding achieve strong performance when sufficient data is available.

Label encoding (ordinal encoding)

Label encoding categorical data

When working with ordinal features, you can assign them a ranked numerical order. For example:

happy1,neutral0,sad1\text{happy} \rightarrow 1,\quad \text{neutral} \rightarrow 0,\quad \text{sad} \rightarrow -1

The benefit of this approach is that it’s compact — you don’t create extra features, which helps keep your dataset simpler and denser.

If you're using a linear model, it's important that the steps between values are meaningful and reflect actual differences. In contrast, tree-based models don’t care about the exact numerical distance; only the order matters. Neural networks fall somewhere in between: they can learn these differences but may still be influenced by the scale.

This technique isn’t suitable for nominal features, and here’s why:

Let’s say you assign:

good83,bad84,awesome85\text{good} \rightarrow 83,\quad \text{bad} \rightarrow 84,\quad \text{awesome} \rightarrow 85

Your model would treat "bad" as if it sits between "good" and "awesome" — which makes no sense in this context. That’s because label encoding implies an order, even if none exists.

You could manually assign more meaningful values, or use methods that better reflect the nature of the data; we’ll cover those in the next sections.

Target encoding (mean encoding)

Target encoding data

A slightly more advanced technique is target encoding, also known as mean encoding. Instead of assigning arbitrary or ordered numbers to categories, this method uses the average value of the target for each category.

Unlike ordinal encoding, target encoding works well for both ordinal and nominal features — including those without any natural order, like postal codes or product IDs.

Simple example

Let’s walk through a simple example to see how it works.

Suppose we’re trying to predict someone’s happiness based on the city they live in, their income, and their age. In this case, the city feature is clearly nominal — there’s no meaningful order to the values. Using target encoding, we’d calculate the average happiness for each city in the training data and use that value to represent the city.

For example:

city happiness_score income_level age
0 Tokyo 7 86844 19
1 Paris 6 85586 47
2 Paris 4 40502 65
3 Paris 4 54585 69
4 Tokyo 8 48375 59

We can then calculate the average happiness score per city, and use that value as the input for our model instead of the raw city name.

However, this process is part of model training, so it’s important that the test data remains completely secluded during this step. If we include test data when calculating the averages, we risk data leakage, which can lead to overly optimistic performance and unreliable results.

city
Amsterdam 6.250000
New York 6.411765
Paris 6.360000
Tokyo 5.769231
Name: happiness_score, dtype: float64
city happiness_score income_level age city_happiness
0 Tokyo 7 86844 19 5.769231
1 Paris 6 85586 47 6.360000
2 Paris 4 40502 65 6.360000
3 Paris 4 54585 69 6.360000
4 Tokyo 8 48375 59 5.769231

Smoothing

Target encoding works well when each category has enough data (support). However, for rare categories, the average target value can be unreliable — leading to high variance and potential overfitting.

A common way to handle this is by combining both a local component (the category-specific mean) and a global component (the overall mean across all data). These are blended using a smoothing factor, which controls how much we trust the category's own data versus the global average.

For a binary classification target, the smoothed target encoding for a category can be calculated as:

Si=λi(category component)+(1λi)(global component)=λiniYni+(1λi)nYn\begin{align*} S_i &= \lambda_i (\text{category component}) + (1 - \lambda_i) (\text{global component}) \\ &= \lambda_i \frac{n_{iY}}{n_i} + (1 - \lambda_i) \frac{n_Y}{n} \end{align*}

where

  • SiS_i is the encoding for category ii,
  • niYn_{iY} is the number of observations with Y=1Y = 1 and category ii,
  • nin_i is the number of observations with category ii,
  • nYn_Y is the number of observations with Y=1Y = 1,
  • nn is the total number of observations, and
  • λi\lambda_i is a shrinkage factor for category ii.

The shrinkage factor is given by:

λi=nim+ni\lambda_i = \frac{n_i}{m + n_i}

where

  • mm is a smoothing factor, which is controlled with the smooth parameter in TargetEncoder.
Two-panel plot showing the effect of the smoothing parameter in target encoding on model performance. The left panel displays multiple lines per setting, while the right panel shows average trends for different sample sizes.Two-panel plot showing the effect of the smoothing parameter in target encoding on model performance. The left panel displays multiple lines per setting, while the right panel shows average trends for different sample sizes.

Effect of smoothing in target encoding for categorical variables. This figure shows how the smoothing parameter influences model performance when encoding cut, color, and clarity in the diamonds dataset. The left panel shows individual model scores across multiple random seeds, revealing variability for small datasets. The right panel summarises average performance by sample size. Smoothing balances between using class-specific and global target means, helping avoid overfitting in small samples and underfitting in large ones. In this case the effect is mostly minimal, but has a big impact in some cases.

K-fold

When you use the target to improve your features — as in target encoding — you need to be careful. You're introducing information from the target into the input, which can lead to data leakage.

This problem becomes especially severe when dealing with small category sizes. Let’s take an extreme example: a category with only one sample. In that case, the category’s encoded value will be exactly the same as the target — which essentially gives the model the answer.

Take for example a dummy variable with slightly larger categories — say, 5 samples per category — the encoded value can still leak a lot of information from the target. As a result, this feature might appear far more predictive than it truly is, and the model can overfit.

To prevent this, we can use a K-fold target encoding approach. Here’s the idea:

  1. You divide the training data into K folds.
  2. For each fold, you compute the target encoding using only the other folds (i.e. excluding the fold you're currently encoding).
  3. This way, no row ever contributes to the calculation of its own encoded value.

This technique removes data leakage at the cost of introducing a bit of noise into the feature; but that’s usually a good trade-off.

Line plot showing model performance versus sample size for target encoding with and without K-fold logic. Two curves represent the encoding strategies, with performance gaps largest at small sample sizes.Line plot showing model performance versus sample size for target encoding with and without K-fold logic. Two curves represent the encoding strategies, with performance gaps largest at small sample sizes.

Effect of K-fold logic in target encoding on model performance. This figure compares model scores when encoding categorical features using target encoding, with and without K-fold logic. K-fold encoding avoids target leakage by ensuring that target values used for encoding are not also used for fitting. The gap between methods is most pronounced at small sample sizes, where leakage has a greater impact. Using K-fold encoding leads to more reliable generalisation estimates, especially in low-data regimes.

One hot encoding

One hot encoding data

Some categorical values can’t be mapped to a linear scale without losing important information. In such cases, a better approach is to assign each category its own dimension.

For example:

happy[1,0,0],neutral[0,1,0],sad[0,0,1]\text{happy} \rightarrow [1, 0, 0],\quad \text{neutral} \rightarrow [0, 1, 0],\quad \text{sad} \rightarrow [0, 0, 1]

The main advantage of one hot encoding is that there is no information loss, and the order of the categories doesn’t matter — since they’re treated independently.

The main drawback, however, is that it creates very sparse data, especially when there are many categories.

Sparsity and density
  • Sparsity: A property of a dataset or matrix where most elements are zero or empty. Sparse structures are common in high-dimensional data like text, images, or recommender systems.
  • Dense: The opposite of sparse; most elements are non-zero or filled.
  • Sparse matrix: A matrix in which the majority of elements are zero. Special storage and computation techniques are often used to handle these efficiently.

Dropping one feature

It’s common practice to drop one of the encoded features after applying one hot encoding. This is because one of the columns is usually redundant; because its value can be inferred from the others.

For example, suppose you have a column for RSVP responses with three options: yes, no, and maybe (and let’s assume everyone responds for simplicity). If you one-hot encode it into two columns yes and no then: [0, 0] must mean maybe. So there's no need for a third column

However, this logic breaks down if you have missing values. In that case, a [0, 0] row might mean missing, not maybe — and dropping the third category could result in information loss. So always consider the context before dropping a column.

Embedding

One hot encoding data

The One hot encoding is nice but it results in a matrix with many 0 values (sparse matrix). Could we make it more dense? In other words, can we express every city as a list of number (vector) that tells something about the city?

Well, I think that you know where this is going, yes you can! It's called embedding. It's frequently used in deep learning and not so much in combination with other models. So we won't to in more depth about this technique in this course.

Scatter plot showing a 2D projection of word embeddings learned from the Reuters news dataset. Each point represents a word positioned according to its embedding values. A few words are annotated for clarity, with extreme values in the first or second embedding dimension highlighted in green, red, blue, or orange to indicate their relative positions in the embedding space.Scatter plot showing a 2D projection of word embeddings learned from the Reuters news dataset. Each point represents a word positioned according to its embedding values. A few words are annotated for clarity, with extreme values in the first or second embedding dimension highlighted in green, red, blue, or orange to indicate their relative positions in the embedding space.

Example of representing categorical features in an embedding space, here using words from the Reuters dataset. Each point corresponds to a word embedded into a multi-dimensional vector space, with two dimensions shown. Words positioned close together are more similar in the learned representation. This is comparable with label encoding but then in multiple dimensions. The colours are just for visual purposes and do not carry any meaning.

Code

Timestamps

Label encoding

# This is optional but allows you to set the order
cuts = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
colors = ['J', 'I', 'H', 'G', 'F', 'E', 'D']
clarities = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
custom_order = [cuts, colors, clarities]

preprocessing = ColumnTransformer(
transformers=[
('cat', OrdinalEncoder(categories=custom_order), categorical_features)
],
remainder='passthrough'
)

preprocessing.fit(X_train)

X_train_encoded = pd.DataFrame(preprocessing.transform(X_train),
# columns=preprocessing.get_feature_names_out(), # Normally you would do this but the columns are unchanged
columns=X_train.columns, # So you can do this instead
index=X_train.index)

One hot encoding

encoder = OneHotEncoder(drop='first', sparse_output=False)

preprocessing = ColumnTransformer(
transformers=[
('cat', encoder, categorical_features)
],
remainder='passthrough'
)

preprocessing.fit(X_train)

pd.DataFrame(preprocessing.transform(X_train), columns=preprocessing.get_feature_names_out(), index=X_train.index).head()

Target encoding

encoder = TargetEncoder(target_type='continuous') # Only continuous if target if numerical

preprocessing = ColumnTransformer(
transformers=[
('cat', encoder, categorical_features)
],
remainder='passthrough'
)

preprocessing.fit_transform(X_train, y_train)

Leave your thoughts

Rating