Categorical features
Machine learning models rely on maths; and maths relies on numbers. But real-world data often includes all sorts of non-numerical values, such as:
- City names stored as text
- Words or labels in general
- Postcodes, which may look like numbers but shouldn’t be treated as such
These are known as categorical features, and they require a different approach when preparing your data for modelling.
Introduction
Examples of categorical features
Emotional state
Values like Happy, Neutral, and Sad could be placed along a spectrum; so they can be treated as ordinals.
However, when dealing with a larger range of emotional states that don’t fit neatly into a linear order, it’s better to treat them as nominal categories.
Country Countries are highly diverse and don’t follow any inherent order, so they should be handled as nominal features.
Colours In a visual context, colours are usually considered nominal. However, in some situations they can be seen as ordinal (e.g. light to dark), or even represented by their RGB values, especially in fields like image recognition.
Words Words can be thought of as categorical features, but they’re often handled as sequences of ordinal tokens (especially in NLP tasks). Each word, or even each character, can be encoded based on position or frequency, depending on the approach.
| Education_Level | Department | City | |
|---|---|---|---|
| 0 | High School | HR | London |
| 1 | Bachelor | Engineering | Manchester |
| 2 | Master | Marketing | Birmingham |
| 3 | PhD | HR | Liverpool |
| 4 | High School | Sales | Leeds |
| 5 | Master | Engineering | Bristol |
Category value counts
It’s not just the number of categories that matters; the frequency of each category is also important.
You can’t learn much from just one example.
If a category appears only once or twice in your dataset, it may be difficult (or even impossible) for your model to learn anything meaningful from it. In some cases, depending on the technique you’re using, you might be able to keep rare categories; for example, with regularised target encoding.
But more often than not, it’s better to remove rare categories, group them together, or replace them with a placeholder like "Other" to improve the model's generalisation and stability.
Methods
Effect of categorical encoding on model performance across different sample sizes. For linear regression, one-hot encoding underperforms at small sample sizes due to the expanded feature space but improves as the dataset grows. Target encoding performs well in the low-data regime but tends to degrade as sample size increases. For XGBoost, both ordinal and one-hot encoding achieve strong performance when sufficient data is available.
Label encoding (ordinal encoding)
When working with ordinal features, you can assign them a ranked numerical order. For example:
The benefit of this approach is that it’s compact — you don’t create extra features, which helps keep your dataset simpler and denser.
If you're using a linear model, it's important that the steps between values are meaningful and reflect actual differences. In contrast, tree-based models don’t care about the exact numerical distance; only the order matters. Neural networks fall somewhere in between: they can learn these differences but may still be influenced by the scale.
This technique isn’t suitable for nominal features, and here’s why:
Let’s say you assign:
Your model would treat "bad" as if it sits between "good" and "awesome" — which makes no sense in this context. That’s because label encoding implies an order, even if none exists.
You could manually assign more meaningful values, or use methods that better reflect the nature of the data; we’ll cover those in the next sections.
Target encoding (mean encoding)
A slightly more advanced technique is target encoding, also known as mean encoding. Instead of assigning arbitrary or ordered numbers to categories, this method uses the average value of the target for each category.
Unlike ordinal encoding, target encoding works well for both ordinal and nominal features — including those without any natural order, like postal codes or product IDs.
Simple example
Let’s walk through a simple example to see how it works.
Suppose we’re trying to predict someone’s happiness based on the city they live in, their income, and their age. In this case, the city feature is clearly nominal — there’s no meaningful order to the values. Using target encoding, we’d calculate the average happiness for each city in the training data and use that value to represent the city.
For example:
| city | happiness_score | income_level | age | |
|---|---|---|---|---|
| 0 | Tokyo | 7 | 86844 | 19 |
| 1 | Paris | 6 | 85586 | 47 |
| 2 | Paris | 4 | 40502 | 65 |
| 3 | Paris | 4 | 54585 | 69 |
| 4 | Tokyo | 8 | 48375 | 59 |
We can then calculate the average happiness score per city, and use that value as the input for our model instead of the raw city name.
However, this process is part of model training, so it’s important that the test data remains completely secluded during this step. If we include test data when calculating the averages, we risk data leakage, which can lead to overly optimistic performance and unreliable results.
city
Amsterdam 6.250000
New York 6.411765
Paris 6.360000
Tokyo 5.769231
Name: happiness_score, dtype: float64
| city | happiness_score | income_level | age | city_happiness | |
|---|---|---|---|---|---|
| 0 | Tokyo | 7 | 86844 | 19 | 5.769231 |
| 1 | Paris | 6 | 85586 | 47 | 6.360000 |
| 2 | Paris | 4 | 40502 | 65 | 6.360000 |
| 3 | Paris | 4 | 54585 | 69 | 6.360000 |
| 4 | Tokyo | 8 | 48375 | 59 | 5.769231 |
Smoothing
Target encoding works well when each category has enough data (support). However, for rare categories, the average target value can be unreliable — leading to high variance and potential overfitting.
A common way to handle this is by combining both a local component (the category-specific mean) and a global component (the overall mean across all data). These are blended using a smoothing factor, which controls how much we trust the category's own data versus the global average.
For a binary classification target, the smoothed target encoding for a category can be calculated as:
where
- is the encoding for category ,
- is the number of observations with and category ,
- is the number of observations with category ,
- is the number of observations with ,
- is the total number of observations, and
- is a shrinkage factor for category .
The shrinkage factor is given by:
where
- is a smoothing factor, which is controlled with the
smoothparameter inTargetEncoder.
Effect of smoothing in target encoding for categorical variables. This figure shows how the smoothing parameter influences model performance when encoding cut, color, and clarity in the diamonds dataset. The left panel shows individual model scores across multiple random seeds, revealing variability for small datasets. The right panel summarises average performance by sample size. Smoothing balances between using class-specific and global target means, helping avoid overfitting in small samples and underfitting in large ones. In this case the effect is mostly minimal, but has a big impact in some cases.
K-fold
When you use the target to improve your features — as in target encoding — you need to be careful. You're introducing information from the target into the input, which can lead to data leakage.
This problem becomes especially severe when dealing with small category sizes. Let’s take an extreme example: a category with only one sample. In that case, the category’s encoded value will be exactly the same as the target — which essentially gives the model the answer.
Take for example a dummy variable with slightly larger categories — say, 5 samples per category — the encoded value can still leak a lot of information from the target. As a result, this feature might appear far more predictive than it truly is, and the model can overfit.
To prevent this, we can use a K-fold target encoding approach. Here’s the idea:
- You divide the training data into K folds.
- For each fold, you compute the target encoding using only the other folds (i.e. excluding the fold you're currently encoding).
- This way, no row ever contributes to the calculation of its own encoded value.
This technique removes data leakage at the cost of introducing a bit of noise into the feature; but that’s usually a good trade-off.
Effect of K-fold logic in target encoding on model performance. This figure compares model scores when encoding categorical features using target encoding, with and without K-fold logic. K-fold encoding avoids target leakage by ensuring that target values used for encoding are not also used for fitting. The gap between methods is most pronounced at small sample sizes, where leakage has a greater impact. Using K-fold encoding leads to more reliable generalisation estimates, especially in low-data regimes.
One hot encoding
Some categorical values can’t be mapped to a linear scale without losing important information. In such cases, a better approach is to assign each category its own dimension.
For example:
The main advantage of one hot encoding is that there is no information loss, and the order of the categories doesn’t matter — since they’re treated independently.
The main drawback, however, is that it creates very sparse data, especially when there are many categories.
Sparsity and density
- Sparsity: A property of a dataset or matrix where most elements are zero or empty. Sparse structures are common in high-dimensional data like text, images, or recommender systems.
- Dense: The opposite of sparse; most elements are non-zero or filled.
- Sparse matrix: A matrix in which the majority of elements are zero. Special storage and computation techniques are often used to handle these efficiently.
Dropping one feature
It’s common practice to drop one of the encoded features after applying one hot encoding. This is because one of the columns is usually redundant; because its value can be inferred from the others.
For example, suppose you have a column for RSVP responses with three options: yes, no, and maybe (and let’s assume everyone responds for simplicity).
If you one-hot encode it into two columns yes and no then: [0, 0] must mean maybe.
So there's no need for a third column
However, this logic breaks down if you have missing values. In that case, a [0, 0] row might mean missing, not maybe — and dropping the third category could result in information loss. So always consider the context before dropping a column.
Embedding
The One hot encoding is nice but it results in a matrix with many 0 values (sparse matrix). Could we make it more dense? In other words, can we express every city as a list of number (vector) that tells something about the city?
Well, I think that you know where this is going, yes you can! It's called embedding. It's frequently used in deep learning and not so much in combination with other models. So we won't to in more depth about this technique in this course.
Example of representing categorical features in an embedding space, here using words from the Reuters dataset. Each point corresponds to a word embedded into a multi-dimensional vector space, with two dimensions shown. Words positioned close together are more similar in the learned representation. This is comparable with label encoding but then in multiple dimensions. The colours are just for visual purposes and do not carry any meaning.
Code
Label encoding
# This is optional but allows you to set the order
cuts = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
colors = ['J', 'I', 'H', 'G', 'F', 'E', 'D']
clarities = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
custom_order = [cuts, colors, clarities]
preprocessing = ColumnTransformer(
transformers=[
('cat', OrdinalEncoder(categories=custom_order), categorical_features)
],
remainder='passthrough'
)
preprocessing.fit(X_train)
X_train_encoded = pd.DataFrame(preprocessing.transform(X_train),
# columns=preprocessing.get_feature_names_out(), # Normally you would do this but the columns are unchanged
columns=X_train.columns, # So you can do this instead
index=X_train.index)
One hot encoding
encoder = OneHotEncoder(drop='first', sparse_output=False)
preprocessing = ColumnTransformer(
transformers=[
('cat', encoder, categorical_features)
],
remainder='passthrough'
)
preprocessing.fit(X_train)
pd.DataFrame(preprocessing.transform(X_train), columns=preprocessing.get_feature_names_out(), index=X_train.index).head()
Target encoding
encoder = TargetEncoder(target_type='continuous') # Only continuous if target if numerical
preprocessing = ColumnTransformer(
transformers=[
('cat', encoder, categorical_features)
],
remainder='passthrough'
)
preprocessing.fit_transform(X_train, y_train)