Ga naar hoofdinhoud

Objectives

There are many usecases for which you want to apply machine learning to data. Whether you aim to extract insights or build powerful predictive models, understanding your objective is key.

Understanding vs. Predicting

Machine learning can be applied with two primary objectives in mind:

Explaining the data – You want to gain insights from the data to guide decisions or change policies.
Example: Which factors contribute most to customers leaving our service?

Making the best predictions – You want to use data to make highly accurate forecasts or automate tasks.
Example: Predicting customer churn with the highest precision

There are two situations, in one situation you want to explain the imporance of features in other example you want the best predictor. this is illustrated by features in model and target is used to understand features vs features in the model to get the best target predictionThere are two situations, in one situation you want to explain the imporance of features in other example you want the best predictor. this is illustrated by features in model and target is used to understand features vs features in the model to get the best target prediction
The goal of applying machine learning to your data might be to understand how features influence a target (left) or to create the best predictor (right).

Some machine learning models, especially deep learning models, function as "black boxes." They may achieve high accuracy but provide little interpretability. While explainable models might perform slightly worse in prediction, they can be more valuable when understanding the underlying data is crucial.

Supervised vs. Unsupervised Learning

When you have a very specific target you want to predict with your model, you will probably need a lot of label data. However, this is not always possible or even the goal.

Supervised learning – You have labeled examples (input-output pairs) and want to train a model to predict labels.
Examples:

  • Predicting if a customer will churn.
  • Diagnosing diseases based on patient data.
  • Estimating house prices based on features like location and size.

Unsupervised learning – You have no predefined labels and want to explore patterns in the data.
Examples:

  • Grouping similar customers for targeted marketing.
  • Generate a new sample out of an existing dataset. Like image generation.

Sometimes, semi-supervised learning is used when labeling data is expensive. This approach combines a small amount of labeled data with a large amount of unlabeled data to improve predictions.

Predictions

A major use of machine learning is to make predictions based on historical data. This can be broadly categorized into:

Classification

Classification models predict discrete categories based on input features.
For example, in the Iris dataset, we predict the species of a flower based on its petal and sepal measurements.

Scatter plot of sepal width against petal widthScatter plot of sepal width against petal width
A scatter plot of sepal width against petal width. The different iris species are visibly separable.

Regression

Regression models predict continuous numerical values instead of categories.
For example, estimating house prices based on features like location, square footage, and number, average income of residents in the neighborhood of bedrooms.

Scatter plot of house prices against median incomeScatter plot of house prices against median income
A scatter plot showing house prices as a function of median income.

Data Exploration

Machine learning is not just for prediction—it’s also a powerful tool for exploring and understanding data.

Clustering

Clustering algorithms group data points based on similarity.
For example:

  • Customer segmentation – Grouping users for targeted advertising.
  • Product recommendation – Identifying similar items based on purchase history.

Unlike classification, clusters do not have predefined labels. They need to be interpreted manually to understand their meaning.

It's also crucial to choose the right features for clustering.
For instance, eye color is probably irrelevant if you're trying to cluster customers based on shopping behavior.

Dimension reduction

High-dimensional data can be difficult to interpret. Dimensionality reduction simplifies data by reducing the number of features while preserving its structure.
For example: Reducing 50 product categories to a few meaningful parameters to compare the similarity of products

UMAP visualization of clustered productsUMAP visualization of clustered products
A UMAP dimensionality reduction plot of products. Products bought by the same customers are placed closer together, helping to create meaningful customer segments.

Generative Models

Machine learning is also used for generation—creating new content that resembles existing data.
Examples include:

  • Generating images (e.g., AI-generated artwork or faces).
  • Creating synthetic training data when real data is limited.
  • Text generation (e.g., chatbots and AI-powered writing assistants).

Recommendations

Recommender systems suggest relevant items to users based on their preferences and behavior.
For example:

  • Content recommendations recommend content by for example YouTube, Netflix, Instagram and other streaming and social media services.
  • Product suggestions suggest similar products or products like by similar users.
  • Drug recommendations recommend a drug that worked well for other patients.

Outlier detection and Predictive maintainance

Machine learning can also be used for detecting anomalies in data. Outliers may indicate errors, fraud, or rare but significant events that require attention.

Outlier Detection Outlier detection is crucial in many domains, such as:

  • Fraud detection – Identifying unusual credit card transactions that may indicate fraud.
  • Network security – Detecting suspicious network activity that could be an attack.
  • Medical diagnostics – Spotting rare but important health conditions in patient data.
Outlier detection for fraud detection. A Isolation Forest model fits the data, outliers have a big overlap with the fraud datapointsOutlier detection for fraud detection. A Isolation Forest model fits the data, outliers have a big overlap with the fraud datapoints
Fraud detection with outlier detection. On the left is the data with the accurate fraud no fraud labels. On the right is the results of outlier detection using a Isolation Forest on 29 features (2 of which plotted here). As you can see, Some of the outliers are overlap with fraudulant activities.

Leave your thoughts

Rating