Data

Data comes in various shapes, sizes, types, and structures—there is no one-size-fits-all approach when building a machine learning model. Before creating a model, it is crucial to understand your data and how it might impact the learning process.

This involves both standard preprocessing techniques, which we will cover in this course, and domain-specific adjustments tailored to the particular dataset and problem.

For example, in some cases, data may become outdated over time, which needs to be considered when designing a model. Take the housing market: real estate data from 100 years ago is largely irrelevant for predicting today’s prices. On the other hand, astronomical measurements of objects in deep space remain valid for centuries since these objects change on much longer timescales.

Understanding these nuances ensures that your machine learning approach is both relevant and effective for the problem at hand.

Value types

Values in a dataset can take different forms. They can be numerical, categorical, or binary.

In the table below, you can see various types of values from the Titanic dataset:

Numerical values include age, fare, sibsp (number of siblings/spouses aboard), and parch (number of parents/children aboard). These values represent measurable quantities (continues values).
Ordinal categorical values include pclass (passenger class: 1st, 2nd, 3rd), which has an inherent order—first class is ranked higher than third class.
Nominal categorical values include sex, embarked (port of embarkation), and who (classification as man, woman, or child). These categories have no natural ranking.
Binary values include survived (0 = did not survive, 1 = survived) and adult_male (True = adult male, False = not an adult male). These values can also be considered categorical but are commonly treated as numerical (0 and 1) in machine learning.

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

Numerical values

The easiest values to work with in machine learning are numerical values, which can be positive or negative. They can be whole numbers (integers) or decimal numbers (floats).

Numerical values can also be further categorized into integers, long integers, floats, double precision, and half precision, depending on their range and level of precision. However, for most machine learning models, these distinctions don’t matter much—what's important is that they represent continuous numerical data, with differences in precision affecting only the range and number of decimal places that can be stored.

Type	Description	Precision	Example Values
`Integer (int)`	Whole numbers, no decimal points.	Fixed	..., -2, -1, 0, 1, 2, ...
`Long Integer (long)`	Extended precision for large whole numbers.	Fixed (larger range than int)	..., -10¹², -10⁶, 0, 10⁶, 10¹², ...
`Floating Point (float)`	Decimal numbers with single precision.	Single (32-bit)	-3.14, 0.0, 2.71, 100.5
`Double Precision (double)`	Higher precision decimal numbers.	Double (64-bit)	-3.1415926535, 0.0, 2.7182818284
`Half Precision (half)`	Lower precision decimal numbers.	Half (16-bit)	-1.5, 0.0, 3.25

Categorical values

Categorical values are discrete variables that represent different categories or groups rather than continuous numbers. These values are often used to label or classify data points based on qualitative attributes.

Categorical values can be further divided into two types:

Nominal Values (No Order)

These categories have no inherent order or ranking.
Example: Colors (red, blue, green), city names (New York, London, Tokyo), or types of pets (dog, cat, rabbit).
The order of the categories does not affect their meaning.

Ordinal Values (Ordered Categories)

These categories have a meaningful order or ranking, but the differences between them are not necessarily uniform.
Example: Emotions (sad, neutral, happy), education levels (high school, bachelor's, master's, PhD), or customer satisfaction ratings (poor, average, good, excellent).
The order is important, but numerical differences between the levels are not necessarily equal (e.g., the difference between neutral and happy is not necessarily the same as sad and neutral).

Boolean

Boolean values are a special type of categorical data that represent binary states (e.g., True/False or 1/0). Most machine learning models can handle them directly as numerical features, making them useful for representing conditions like survival (1 = survived, 0 = not survived) or loan approval (1 = approved, 0 = denied). While they usually require no transformation, they may need conversion from text ("Yes"/"No" → 1/0).

Sequential data

Many types of data follow a sequential structure, meaning their values depend on their position in the sequence. Examples include:

Stock market data – price movements over time
Heart rate monitoring – beats per minute recorded continuously
Weather data – temperature, humidity, and pressure trends

Beyond these, some specialized sequential data types include text, audio, and video, which add complexity by having multiple dimensions (e.g., language structure, frequency variations, or frame sequences). Even images can be treated as sequences of pixels in certain models, like recurrent image generation.

What all sequential data has in common is directionality—earlier values influence later ones, making order essential in analysis and prediction.

From left to right: Stockmarket (source) and Mel-Spectrum (audio frequency vs time image) (source) Examples of Sequential data

Text data

A unique type of sequential data that deserves its own section is text data. Text holds valuable information, but extracting meaning from it is often challenging.

Major breakthroughs in deep learning have made it possible to unlock this knowledge with the rice of the Large Language Models (LLMs), enabling AI systems to process and understand text like never before. Models like ChatGPT, Gemini, Llama, Mistral, Claude, and DeepSeek have had a huge impact across industries—some of which have been transformed forever.

Graph data

Some types of data are best represented as a graph or network, where entities (nodes) are connected by relationships (edges). These structures capture complex interactions and dependencies. Examples include:

Social networks – users connected by friendships or interactions
Knowledge graphs – concepts linked based on semantic relationships
Transportation networks – cities connected by roads, railways, or flight routes
Biological networks – protein interactions, neural connections in the brain
Recommendation systems – products connected to users based on purchase history
Bank transactions – transactions between banks accounts

Network with users linked products — An example of a graph network of products and users. These data could be used for recommending products to useVersions.

Graphs are incredibly flexible. Nodes can have extra attributes, and edges can not only connect nodes but also have types or additional properties that add more meaning to the relationships.

Image data

Images are a rich and complex type of data, containing patterns, textures, and structures that require specialized techniques to analyze. Images have spatial relationships, meaning the position of pixels matters just as much as their values.

Each image is a matrix of pixel values, where colors are stored as intensity values (grayscale) or RGB channels. It requires specialist maths to efficiently extract the paterns.

Breakthroughs in computer vision have enabled AI to extract meaningful insights from images. Models such as Convolutional Neural Networks (CNNs) power applications like:

Image classification – Identifying objects in images (e.g., cat vs. dog).
Object detection – Locating multiple objects within an image (e.g., self-driving cars detecting pedestrians).
Image segmentation – Classifying each pixel (e.g., medical imaging for tumor detection).
Image generation – Creating artistic images using AI (e.g., Stable Diffusion, DALL·E).

Image-based AI has transformed fields such as:

Healthcare – Automated X-ray and MRI analysis.
Retail – Visual search and personalized recommendations.
Security – Face recognition and surveillance.
Autonomous Systems – Self-driving cars and robotic vision.

With advancements in deep learning, the ability to process and generate images is rapidly evolving, making AI-powered vision systems an essential part of modern technology.

Geographic data

Data that is linked to a location

Multimodal

In some cases it's required to insert different types of data, for example texts and images, into a model in order to make a meaningful conclusion.

A model accepting both text and image data to make a prediction — Models that accept multiple types of data (for example both image as text data) are called multimodal

Environments

In some situations it's not possible or hard to obtain the required data. In that case you train an AI in an evironment. This environment could be a simulation of the real world or the actual real world. In this case an AI (called an agent) can perform actions and learn from the responds directly. This method of training an AI is called reinforcement learning.

You can train a model directly on an environment if you apply a reinforcement learning strategy source

Use cases of these are:

Chess AI
Improve recommendations in the real world
Robotics

Controversies

Image generation comes with its own controversies, particularly around ownership and intellectual property rights. The companies behind these models profit from data that is often sourced from the public, raising ethical questions about fair use, attribution, and compensation.

Similiarly, data used to train large language models have been obtained with the help of pooly paid laybor in Africa with none ideal working conditions manually labeling racial, violent and sexual texts.

Value types​

Numerical values​

Categorical values​

Nominal Values (No Order)​

Ordinal Values (Ordered Categories)​

Boolean​

Sequential data​

Text data​

Graph data​

Image data​

Geographic data​

Multimodal​

Environments​

Controversies​