Ga naar hoofdinhoud

Data

Data comes in various shapes, sizes, types, and structures—there is no one-size-fits-all approach when building a machine learning model. Before creating a model, it is crucial to understand your data and how it might impact the learning process.

This involves both standard preprocessing techniques, which we will cover in this course, and domain-specific adjustments tailored to the particular dataset and problem.

For example, in some cases, data may become outdated over time, which needs to be considered when designing a model. Take the housing market: real estate data from 100 years ago is largely irrelevant for predicting today’s prices. On the other hand, astronomical measurements of objects in deep space remain valid for centuries since these objects change on much longer timescales.

Understanding these nuances ensures that your machine learning approach is both relevant and effective for the problem at hand.

Value types

Values in a dataset can take different forms. They can be numerical, categorical, or binary.

In the table below, you can see various types of values from the Titanic dataset:

  • Numerical values include age, fare, sibsp (number of siblings/spouses aboard), and parch (number of parents/children aboard). These values represent measurable quantities (continues values).
  • Ordinal categorical values include pclass (passenger class: 1st, 2nd, 3rd), which has an inherent order—first class is ranked higher than third class.
  • Nominal categorical values include sex, embarked (port of embarkation), and who (classification as man, woman, or child). These categories have no natural ranking.
  • Binary values include survived (0 = did not survive, 1 = survived) and adult_male (True = adult male, False = not an adult male). These values can also be considered categorical but are commonly treated as numerical (0 and 1) in machine learning.
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse
111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse
213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue
311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse
403male35.0008.0500SThirdmanTrueNaNSouthamptonnoTrue

Numerical values

The easiest values to work with in machine learning are numerical values, which can be positive or negative. They can be whole numbers (integers) or decimal numbers (floats).

Numerical values can also be further categorized into integers, long integers, floats, double precision, and half precision, depending on their range and level of precision. However, for most machine learning models, these distinctions don’t matter much—what's important is that they represent continuous numerical data, with differences in precision affecting only the range and number of decimal places that can be stored.

TypeDescriptionPrecisionExample Values
Integer (int)Whole numbers, no decimal points.Fixed..., -2, -1, 0, 1, 2, ...
Long Integer (long)Extended precision for large whole numbers.Fixed (larger range than int)..., -1012, -106, 0, 106, 1012, ...
Floating Point (float)Decimal numbers with single precision.Single (32-bit)-3.14, 0.0, 2.71, 100.5
Double Precision (double)Higher precision decimal numbers.Double (64-bit)-3.1415926535, 0.0, 2.7182818284
Half Precision (half)Lower precision decimal numbers.Half (16-bit)-1.5, 0.0, 3.25

Categorical values

Categorical values are discrete variables that represent different categories or groups rather than continuous numbers. These values are often used to label or classify data points based on qualitative attributes.

Categorical values can be further divided into two types:

Nominal Values (No Order)

  • These categories have no inherent order or ranking.
  • Example: Colors (red, blue, green), city names (New York, London, Tokyo), or types of pets (dog, cat, rabbit).
  • The order of the categories does not affect their meaning.

Ordinal Values (Ordered Categories)

  • These categories have a meaningful order or ranking, but the differences between them are not necessarily uniform.
  • Example: Emotions (sad, neutral, happy), education levels (high school, bachelor's, master's, PhD), or customer satisfaction ratings (poor, average, good, excellent).
  • The order is important, but numerical differences between the levels are not necessarily equal (e.g., the difference between neutral and happy is not necessarily the same as sad and neutral).

Boolean

Boolean values are a special type of categorical data that represent binary states (e.g., True/False or 1/0). Most machine learning models can handle them directly as numerical features, making them useful for representing conditions like survival (1 = survived, 0 = not survived) or loan approval (1 = approved, 0 = denied). While they usually require no transformation, they may need conversion from text ("Yes"/"No"1/0).

Sequential data

Many types of data follow a sequential structure, meaning their values depend on their position in the sequence. Examples include:

  • Stock market data – price movements over time
  • Heart rate monitoring – beats per minute recorded continuously
  • Weather data – temperature, humidity, and pressure trends

Beyond these, some specialized sequential data types include text, audio, and video, which add complexity by having multiple dimensions (e.g., language structure, frequency variations, or frame sequences). Even images can be treated as sequences of pixels in certain models, like recurrent image generation.

What all sequential data has in common is directionality—earlier values influence later ones, making order essential in analysis and prediction.

From left to right: Stockmarket (source) and Mel-Spectrum (audio frequency vs time image) (source) Examples of Sequential data

Text data

A unique type of sequential data that deserves its own section is text data. Text holds valuable information, but extracting meaning from it is often challenging.

Major breakthroughs in deep learning have made it possible to unlock this knowledge with the rice of the Large Language Models (LLMs), enabling AI systems to process and understand text like never before. Models like ChatGPT, Gemini, Llama, Mistral, Claude, and DeepSeek have had a huge impact across industries—some of which have been transformed forever.

Graph data

Some types of data are best represented as a graph or network, where entities (nodes) are connected by relationships (edges). These structures capture complex interactions and dependencies. Examples include:

  • Social networks – users connected by friendships or interactions
  • Knowledge graphs – concepts linked based on semantic relationships
  • Transportation networks – cities connected by roads, railways, or flight routes
  • Biological networks – protein interactions, neural connections in the brain
  • Recommendation systems – products connected to users based on purchase history
  • Bank transactions – transactions between banks accounts
Network with users linked productsNetwork with users linked products
An example of a graph network of products and users. These data could be used for recommending products to useVersions.

Graphs are incredibly flexible. Nodes can have extra attributes, and edges can not only connect nodes but also have types or additional properties that add more meaning to the relationships.

Image data

Images are a rich and complex type of data, containing patterns, textures, and structures that require specialized techniques to analyze. Images have spatial relationships, meaning the position of pixels matters just as much as their values.

Each image is a matrix of pixel values, where colors are stored as intensity values (grayscale) or RGB channels. It requires specialist maths to efficiently extract the paterns.

Breakthroughs in computer vision have enabled AI to extract meaningful insights from images. Models such as Convolutional Neural Networks (CNNs) power applications like:

  • Image classification – Identifying objects in images (e.g., cat vs. dog).
  • Object detection – Locating multiple objects within an image (e.g., self-driving cars detecting pedestrians).
  • Image segmentation – Classifying each pixel (e.g., medical imaging for tumor detection).
  • Image generation – Creating artistic images using AI (e.g., Stable Diffusion, DALL·E).

Image-based AI has transformed fields such as:

  • Healthcare – Automated X-ray and MRI analysis.
  • Retail – Visual search and personalized recommendations.
  • Security – Face recognition and surveillance.
  • Autonomous Systems – Self-driving cars and robotic vision.

With advancements in deep learning, the ability to process and generate images is rapidly evolving, making AI-powered vision systems an essential part of modern technology.

Geographic data

Data that is linked to a location

Multimodal

In some cases it's required to insert different types of data, for example texts and images, into a model in order to make a meaningful conclusion.

A model accepting both text and image data to make a predictionA model accepting both text and image data to make a prediction
Models that accept multiple types of data (for example both image as text data) are called multimodal

Environments

In some situations it's not possible or hard to obtain the required data. In that case you train an AI in an evironment. This environment could be a simulation of the real world or the actual real world. In this case an AI (called an agent) can perform actions and learn from the responds directly. This method of training an AI is called reinforcement learning.

You can train a model directly on an environment if you apply a reinforcement learning strategy source

Use cases of these are:

  • Chess AI
  • Improve recommendations in the real world
  • Robotics

Controversies

Image generation comes with its own controversies, particularly around ownership and intellectual property rights. The companies behind these models profit from data that is often sourced from the public, raising ethical questions about fair use, attribution, and compensation.

Similiarly, data used to train large language models have been obtained with the help of pooly paid laybor in Africa with none ideal working conditions manually labeling racial, violent and sexual texts.

Leave your thoughts

Rating