Analytics Artificial Intelligence Data and Information Decision Support

Virtualization & Containers for Data Science Newbies

Data Engineering Data Governance Data Ingestion Data Streaming Data Visualization

Dr. Owns

February 11, 2025

Virtualization makes it possible to run multiple virtual machines (VMs) on a single piece of physical hardware. These VMs behave like independent computers, but share the same physical computing power. A computer within a computer, so to speak.

Many cloud services rely on virtualization. But other technologies, such as containerization and serverless computing, have become increasingly important.

Without virtualization, many of the digital services we use every day would not be possible. Of course, this is a simplification, as some cloud services also use bare-metal infrastructures.

In this article, you will learn how to set up your own virtual machine on your laptop in just a few minutes — even if you have never heard of Cloud Computing or containers before.

Table of Contents
1 — The Origins of Cloud Computing: From Mainframes to Serverless Architecture
2 — Understanding Virtualization: Why it’s the Basis of Cloud Computing
3 — Create a Virtual Machine with VirtualBox
Final Thoughts
Where can you continue learning?

1 — The Origins of Cloud Computing: From Mainframes to Serverless Architecture

Cloud computing has fundamentally changed the IT landscape — but its roots go back much further than many people think. In fact, the history of the cloud began back in the 1950s with huge mainframes and so-called dumb terminals.

The era of mainframes in the 1950s: Companies used mainframes so that several users could access them simultaneously via dumb terminals. The central mainframes were designed for high-volume, business-critical data processing. Large companies still use them today, even if cloud services have reduced their relevance.
Time-sharing and virtualization: In the next decade (1960s), time-sharing made it possible for multiple users to access the same computing power simultaneously — an early model of today’s cloud. Around the same time, IBM pioneered virtualization, allowing multiple virtual machines to run on a single piece of hardware.
The birth of the internet and web-based applications in the 1990s: Six years before I was born, Tim Berners-Lee developed the World Wide Web, which revolutionized online communication and our entire working and living environment. Can you imagine our lives today without internet? At the same time, PCs were becoming increasingly popular. In 1999, Salesforce revolutionized the software industry with Software as a Service (SaaS), allowing businesses to use CRM solutions over the internet without local installations.
The big breakthrough of cloud computing in the 2010s:
The modern cloud era began in 2006 with Amazon Web Services (AWS): Companies were able to flexibly rent infrastructure with S3 (storage) and EC2 (virtual servers) instead of buying their own servers. Microsoft Azure and Google Cloud followed with PaaS and IaaS services.
The modern cloud-native era: This was followed by the next innovation with containerization. Docker made Containers popular in 2013, followed by Kubernetes in 2014 to simplify the orchestration of containers. Next came serverless computing with AWS Lambda and Google Cloud Functions, which enabled developers to write code that automatically responds to events. The infrastructure is fully managed by the cloud provider.

Cloud computing is more the result of decades of innovation than a single new technology. From time-sharing to virtualization to serverless architectures, the IT landscape has continuously evolved. Today, cloud computing is the foundation for streaming services like Netflix, AI applications like ChatGPT and global platforms like Salesforce.

2 — Understanding Virtualization: Why Virtualization is the Basis of Cloud Computing

Virtualization means abstracting physical hardware, such as servers, storage or networks, into multiple virtual instances.

Several independent systems can be operated on the same physical infrastructure. Instead of dedicating an entire server to a single application, virtualization enables multiple workloads to share resources efficiently. For example, Windows, Linux or another environment can be run simultaneously on a single laptop — each in an isolated virtual machine.

This saves costs and resources.

Even more important, however, is the scalability: Infrastructure can be flexibly adapted to changing requirements.

Before cloud computing became widely available, companies often had to maintain dedicated servers for different applications, leading to high infrastructure costs and limited scalability. If more performance was suddenly required, for example because webshop traffic increased, new hardware was needed. The company had to add more servers (horizontal scaling) or upgrade existing ones (vertical scaling).

This is different with virtualization: For example, I can simply upgrade my virtual Linux machine from 8 GB to 16 GB RAM or assign 4 cores instead of 2. Of course, only if the underlying infrastructure supports this. More on this later.

And this is exactly what cloud computing makes possible: The cloud consists of huge data centers that use virtualization to provide flexible computing power — exactly when it is needed. So, virtualization is a fundamental technology behind cloud computing.

How does serverless computing work?

What if you didn’t even have to manage virtual machines anymore?

Serverless computing goes one step further than Virtualization and containerization. The cloud provider handles most infrastructure tasks — including scaling, maintenance and resource allocation. Developers should focus on writing and deploying code.

But does serverless really mean that there are no more servers?

Of course not. The servers are there, but they are invisible for the user. Developers no longer have to worry about them. Instead of manually provisioning a virtual machine or container, you simply deploy your code, and the cloud automatically executes it in a managed environment. Resources are only provided when the code is running. For example, you can use AWS Lambda, Google Cloud Functions or Azure Functions.

What are the advantages of serverless?

As a developer, you don’t have to worry about scaling or maintenance. This means that if there is a lot more traffic at a particular event, the resources are automatically adjusted. Serverless computing can be cost-efficient, especially in Function-as-a-Service (FaaS) models. If nothing is running, you pay nothing. However, some serverless services have baseline costs (e.g. Firestore).

Are there any disadvantages?

You have much less control over the infrastructure and no direct access to the servers. There is also a risk of vendor lock-in. The applications are strongly tied to a cloud provider.

A concrete example of serverless: API without your own server

Imagine you have a website with an API that provides users with the current weather. Normally, a server runs around the clock — even at times when no one is using the API.

With AWS Lambda, things work differently: A user enters ‘Mexico City’ on your website and clicks on ‘Get weather’. This request triggers a Lambda function in the background, which retrieves the weather data and sends it back. The function is then stopped automatically. This means you don’t have a permanently running server and no unnecessary costs — you only pay when the code is executed.

3 — What Data Scientists should Know about Containers and VMs — What’s the Difference?

You’ve probably heard of containers. But what is the difference to virtual machines — and what is particularly relevant as a data scientist?

Both containers and virtual machines are virtualization technologies.

Both make it possible to run applications in isolation.

Both offer advantages depending on the use case: While VMs provide strong security, containers excel in speed and efficiency.

The main difference lies in the architecture:

Virtual machines virtualize the entire hardware — including the operating system. Each VM has its own operational system (OS). This in turn requires more memory and resources.
Containers, on the other hand, share the host operating system and only virtualize the application layer. This makes them significantly lighter and faster.

Put simply, virtual machines simulate entire computers, while containers only encapsulate applications.

Why is this important for data scientists?

Since as a data scientist you will come into contact with machine learning, data engineering or data pipelines, it is also important to understand something about containers and virtual machines. Sure, you don’t need to have in-depth knowledge of it like a DevOps Engineer or a Site Reliability Engineer (SRE).

Virtual machines are used in data science, for example, when a complete operating system environment is required — such as a Windows VM on a Linux host. Data science projects often need specific environments. With a VM, it is possible to provide exactly the same environment — regardless of which host system is available.

A VM is also needed when training deep learning models with GPUs in the cloud. With cloud VMs such as AWS EC2 or Azure Virtual Machines, you have the option of training the models with GPUs. VMs also completely separate different workloads from each other to ensure performance and security.

Containers are used in data science for data pipelines, for example, where tools such as Apache Airflow run individual processing steps in Docker containers. This means that each step can be executed in isolation and independently of each other — regardless of whether it involves loading, transforming or saving data. Even if you want to deploy machine learning models via Flask / FastAPI, a container ensures that everything your model needs (e.g. Python libraries, framework versions) runs exactly as it should. This makes it super easy to deploy the model on a server or in the cloud.

3 — Create a Virtual Machine with VirtualBox

Let’s make this a little more concrete and create an Ubuntu VM.

I use the VirtualBox software with my Windows Lenovo laptop. The virtual machine runs in isolation from your main operating system so that no changes are made to your actual system. If you have Windows Pro Edition, you can also enable Hyper-V (pre-installed by default, but disabled). With an Intel Mac, you should also be able to use VirtualBox. With an Apple Silicon, Parallels Desktop or UTM is apparently the better alternative (not tested myself).

1) Install Virtual Box

The first step is to download the installation file for VirtualBox from the official Virtual Box website and install VirtualBox. VirtualBox is installed including all necessary drivers.

You can ignore the note about missing dependencies Python Core / win32api as long as you do not want to automate VirtualBox with Python scripts.

Then we start the Oracle VirtualBox Manager:

2) Download the Ubuntu ISO file

Next, we download the Ubuntu ISO file from the Ubuntu website. An ISO Ubuntu file is a compressed image file of the Ubuntu operating system. This means that it contains a complete copy of the installation data. I download the LTS version because this version receives security and maintenance updates for 5 years (Long Term Support). Note the location of the .iso file as we will use it later in VirtualBox.

3) Create a virtual machine in VirtualBox

Next, we create a new virtual machine in the VirtualBox Manager and give it the name Ubuntu VM 2025. Here we select Linux as the type and Ubuntu (64-bit) as the version. We also select the previously downloaded ISO file from Ubuntu as the ISO image. It would also be possible to add the ISO file later in the mass storage menu.

Next, we select a user name vboxuser2025 and a password for access to the Ubuntu system. The hostname is the name of the virtual machine within the network or system. It must not contain any spaces. The domain name is optional and would be used if the network has multiple devices.

We then assign the appropriate resources to the virtual machine. I choose 8 GB (8192 MB) RAM, as my host system has 64 GB RAM. I recommend 4GB (4096) as a minimum. I assign 2 processors, as my host system has 8 cores and 16 logical processors. It would also be possible to assign 4 cores, but this way I have enough resources for my host system. You can find out how many cores your host system has by opening the Task Manager in Windows and looking at the number of cores under the Performance tab under CPU.

Next, we click on ‘Create a virtual hard disk now’ to create a virtual hard disk. A VM requires its own virtual hard disk to install the OS (e.g. Ubuntu, Windows). All programs, files and configurations of the VM are stored on it — just like on a physical hard disk. The default value is 25 GB. If you want to use a VM for machine learning or data science, more storage space (e.g. 50–100 GB) would be useful to have room for large data sets and models. I keep the default setting.

We can then see that the virtual machine has been created and can be used:

4) Use Ubuntu VM

We can now use the newly created virtual machine like a normal separate operating system. The VM is completely isolated from the host system. This means you can experiment in it without changing or jeopardizing your main system.

If you are new to Linux, you can try out basic commands like ls, cd, mkdir or sudo to get to know the terminal. As a data scientist, you can set up your own development environments, install Python with Pandas and Scikit-learn to develop data analysis and machine learning models. Or you can install PostgreSQL and run SQL queries without having to set up a local database on your main system. You can also use Docker to create containerized applications.

Final Thoughts

Since the VM is isolated, we can install programs, experiment and even destroy the system without affecting the host system.

Let’s see if virtual machines remain relevant in the coming years. As companies increasingly use microservice architectures (instead of monoliths), containers with Docker and Kubernetes will certainly become even more important. But knowing how to set up a virtual machine and what it is used for is certainly useful.

I simplify tech for curious minds. If you enjoy my tech insights on Python, data science, data engineering, machine learning and AI, consider subscribing to my substack.

Where Can You Continue Learning?

AWS Documentation — Create your first Lambda function
AWS — Getting started with Amazon S3
DataCamp Course — Understanding Cloud Computing (only the first part is free — no affiliate link)
GeeksForGeeks — What is Cloud Computing?
Kubernetes Documentation — Learn Kubernetes Basics
Net Ninja — Video Docker Crash Course

The post Virtualization & Containers for Data Science Newbies appeared first on Towards Data Science.

Virtualization makes it possible to run multiple virtual machines (VMs) on a single piece of physical hardware. These VMs behave like independent computers, but share the same physical computing power. A computer within a computer, so to speak. Many cloud services rely on virtualization. But other technologies, such as containerization and serverless computing, have become
The post Virtualization & Containers for Data Science Newbies appeared first on Towards Data Science. Data Science, Cloud Computing, Containers, Getting Started, virtual machines, Virtualization Towards Data ScienceRead More

Add to favorites

Dr. Owns

February 11, 2025

0 Comments

Submit a Comment Cancel reply

You must be registered in the site to post a comment. Please Login if you already have account or Register.

Knowledge is the Competitive Edge in the Information Age

Dr. Owns

1 — The Origins of Cloud Computing: From Mainframes to Serverless Architecture

2 — Understanding Virtualization: Why Virtualization is the Basis of Cloud Computing

How does serverless computing work?

A concrete example of serverless: API without your own server

3 — What Data Scientists should Know about Containers and VMs — What’s the Difference?

Why is this important for data scientists?

3 — Create a Virtual Machine with VirtualBox

1) Install Virtual Box

2) Download the Ubuntu ISO file

3) Create a virtual machine in VirtualBox

4) Use Ubuntu VM

Final Thoughts

Where Can You Continue Learning?

Dr. Owns

Recent Posts

0 Comments

Submit a Comment Cancel reply

Menu

Company

Company

Get Started

Get Started

Resources

Resources

Newsletter

Dr. Owns

1 — The Origins of Cloud Computing: From Mainframes to Serverless Architecture

2 — Understanding Virtualization: Why Virtualization is the Basis of Cloud Computing

How does serverless computing work?

A concrete example of serverless: API without your own server

3 — What Data Scientists should Know about Containers and VMs — What’s the Difference?

Why is this important for data scientists?

3 — Create a Virtual Machine with VirtualBox

1) Install Virtual Box

2) Download the Ubuntu ISO file

3) Create a virtual machine in VirtualBox

4) Use Ubuntu VM

Final Thoughts

Where Can You Continue Learning?

Dr. Owns

Recent Posts

Top Viewed Post

0 Comments

Submit a Comment Cancel reply