Analytics Artificial Intelligence Data and Information Decision Support

🤷 Quantifying Uncertainty – A Data Scientist’s Intro To Information Theory – Part 2/4: Entropy

Data Engineering Data Governance Data Ingestion Data Streaming Data Visualization

Dr. Owns

February 3, 2025

Life is like a box of chocolate. Generated using DALL-E

My momma always said “Life was like a box of chocolates. You never know what you’re gonna get.”

F. Gump (fictional philosopher and entrepreneur)

This is the second article in a series on information quantification – an essential framework for data scientists. Learning to measure information unlocks powerful tools for improving statistical analyses and refining decision criteria in machine learning.

In this article we focus on entropy – a fundamental concept that quantifies “on average, how surprising is an outcome?” As a measure of uncertainty, it bridges probability theory and real-world applications, offering insights into applications from data diversity to decision-making.

We’ll start with intuitive examples, like coin tosses and roles of dice , to build a solid foundation. From there, we’ll explore entropy’s diverse applications, such as evaluating decision tree splits and quantifying DNA diversity . Finally, we’ll dive into fun puzzles like the Monty Hall problem and I’ll refer to a tutorial for optimisation of the addictive WORDLE game .

No prior knowledge is required – just a basic understanding of probabilities. If you’d like to revisit the foundations, I’ll briefly recap key concepts from the first article and encourage you to explore its nuances further. Whether you’re here to expand your machine learning or data analysis toolkit, or to deepen your understanding of Information Theory, this article will guide you through the concept of entropy and its practical applications.

Throughout I provide python code , and try to keep formulas as intuitive as possible. If you have access to an integrated development environment (IDE) you might want to plug and play around with the numbers to gain a better intuition.

About the Series

Note: This section is mostly copied from the previous article, feel free to skip to the next section.

This series is divided into four articles, each exploring a key aspect of information theory:

Quantifying Surprise: In the opening article, you learnt how to quantify the “surprise” of an event using self-information and understand its units of measurement, such as bits and nats. Mastering self-information is essential for building intuition about the subsequent concepts, as all later heuristics are derived from it.
Quantifying Uncertainty: YOU ARE HEREBuilding on self-information, in this article we shift focus to the uncertainty – or “average surprise” – associated with a variable, known as entropy. We’ll dive into entropy’s wide-ranging applications, from Machine Learning and data analysis to solving fun puzzles, showcasing its adaptability.
Quantifying Misalignment: In the third article we’ll explore how to measure the distance between two probability distributions using entropy-based metrics like cross-entropy and KL-divergence. These measures are particularly valuable for tasks like comparing predicted versus true distributions, as in classification loss functions and other alignment-critical scenarios.
Quantifying Gain: Expanding from single-variable measures, this final article investigates the relationships between two. You’ll discover how to quantify the information gained about one variable (e.g, target Y) by knowing another (e.g, predictor X). Applications include assessing variable associations, feature selection, and evaluating clustering performance.

Each article is crafted to stand alone while offering cross-references for deeper exploration. Together, they provide a practical, data-driven introduction to information theory, tailored for data scientists, analysts and machine learning practitioners.

Disclaimer: Unless otherwise mentioned the formulas analysed are for categorical variables with c≥2 classes (2 meaning binary). Continuous variables will be addressed in a separate article.

Articles (3) and (4) are currently under construction. I will share links once available. Follow me to be notified

Fundamentals: Self-Information and Bits

Note: This section is a brief recap of the first article.

Self-information is considered the building block of quantification of information. It is a way of quantifying the amount of “surprise” of a specific outcome.

Formally self-information, denoted here as _h_ₓ, quantifies the surprise of an event x occurring based on its probability, p(x):

Self-information _h_ₓ is the information of event x that occurs with probability p(x).

The units of measure are called bits. One bit (binary digit) is the amount of information for an event x that has probability p(x)=½. Let’s plug in to verify: hₓ=-log₂(½)= log₂(2)=1.

The choice of -log₂(p) was made as it satisfies several key axioms of information quantification:

An event with probability 100% is not surprising and hence does not yield any information.
This becomes clear when we see that if p(x)=1, then hₓ=0. A useful analogy is a trick coin (where both sides show HEAD).
Less probable events are more surprising and provide more information.
This is apparent in the other extreme: p(x) → 0 causes hₓ →∞.
The property of Additivity – where the total self-information of two independent events equals the sum of individual contributions – will be explored further in the upcoming Mutual Information article.

In this series I choose to use the units of measure bits due to the notion of the 50% chance of an event to happen. In section Visual Intuition of Bits with a Box of Chocolate below we illustrate its usefulness in the context of entropy.

An alternative commonly used in machine learning is the natural logarithm, which introduces a different unit of measure called nats (short for natural units of information). One nat corresponds to the information gained from an event occurring with a probability of 1/e where e is Euler’s number (≈2.71828). In other words, 1 nat = -ln(p=(1/e)).

For further interesting details, examples and python code about self-information and bits please refer to the first article.

Quantifying Surprise – A Data Scientist’s Intro To Information Theory – Part 1/4: Foundations

Entropy: Quantifying Average Information

Entropy - Quantifying how much "I don't know" — Entropy – Quantifying how much “I don’t know”

So far we’ve discussed the amount of information in bits per event. This raises the question – what is the average amount of information we may learn from the distribution of a variable?

This is called entropy – and may be considered as the uncertainty or average “element of surprise” ¯_(ツ)_//¯. This means how much information may be learnt when the variable value is determined (i.e, the average self-information).

Formally: given a categorical random variable X, with c possible outcomes xᵢ , i∈{1…c}, each with probability _p_ₓᵢ(xᵢ), the entropy _H_ₓ is

Intuition: Entropy _H_ₓ is the average the self-entropy hₓ _o_f all possible outputs xᵢ of variable X. — Intuition: Entropy _H_ₓ is the average the self-entropy hₓ _o_f all possible outputs xᵢ *of variable X.*

Note that here we use capital _H_ₓ (i.e, of variable X) for entropy and lower case _h_ₓᵢ for self-information of event xᵢ. (From here on I will drop the ᵢ both for convenience and also because Medium does not handle LaTeX.)

The intuition here is that

_h_ₓ=-log₂(_p_ₓ(x)): the self-information for each event x (as discussed in the previous section).

This is multiplied by

_p_ₓ(x): the weight to the expectancy of its occurrence (i.e, think of the _p_ₓ(x) not under the log as a weight _w_ₓ of event x).

A naïve pythonic calculation would look something like this

import numpy as np
pxs = [0.5, 0.5] # fair coin distribution: 50% tails, 50% heads
np.sum([-px * np.log2(px) for px in pxs])

# yields 1 bit

However, it is more pragmatic to use the scipy module:

from scipy.stats import entropy
entropy(pxs, base=2) # note the base keyword!

# yields 1 bit

This function is preferable because it addresses practical issues that the naïve script above it doesn’t, e.g:

Handling of zero values. This is crucial – try plugging in pxs=[1., 0.] in the numpy version and you will obtain nan due to a RuntimeWarning. This is because 0 is not a valid input to log functions.
If you plug it into the scipy version you will obtain the correct 0 bits.
Normalised counts. You can feed counts instead of frequencies, e.g, try plugging in pxs=[14, 14] you will still get 1 bit.

Developing an Intuition for Entropy

To gain an intuition it’s instructive to examine a few examples.

Plugging in pxs=[0.75, 025] yields 0.81 bits, i.e, less than 1 bit when using pxs=[0.5, 0.5].

Can you guess what we should expect if reversing the values pxs=[0.25, 075] ?

We said that the outcome of pxs=[1., 0.] is zero bits. How about pxs=[0., 1.] ?

To address these it’s imperative to examine a spectrum of outcomes. Using a simple script:

We obtain an insightful figure that all data scientists should have ingrained:

There are many learnings from this graph:

Max Uncertainty: The max entropy (uncertainty) happens in the case of a fair coin p=½ → H=1 bit.
The intuition is: if all potential outcomes are equally likely the uncertainty of the average is maximum. This may be extrapolated to any categorical variable with c dimensions.
In the binary case this max uncertainty equals 1 bit. If we examine a categorical variable with c≥3 classes (e.g, with a roll of a die) we will get a larger number of bits, since we are even less certain than in a coin flip. E.g, in the case of c=3, the fair distribution is -log₂(⅓)=1.58. In sections in which we discuss applications we will address the option of standardising to bound entropy between 0 and 1. It involves setting base=c but realising that the units of measure are no longer bits (although related by log₂(c)).
Zero valued uncertainty points: We see that in the cases of p=0 and 1 there is no uncertainty. In other words H=0 bits means full certainty of the outcome of the variable. For the c dimensions case this is when all classes have _p_ₓ(x)=0 except for one category, which we call x*, which has 100% probability _p_ₓ(x*)=1. (And yes, you are correct if you are thinking about classification; We’ll address this below.)
The intuition is: The variable outcome is certain and there is nothing learnt in a random draw. (This is the first axiom.)
Symmetry: By definition H(x) is symmetric around p=1/c. The graph above demonstrates this in the binary case around p=½.

To make the concept more tangible let’s pick a point on the graph and assume that we are analysing simplistic weather reports of sun and rain , e.g, p( ) = 95%, p( )=5% (pxs=[0.95, 0.05]).

Using entropy we calculate that these weather reports contain on average 0.286 bits of information.

The intuition is:

Rain would be very surprising (quantified as h( )=-log₂(0.05)=4.32 bits),
but this only happens p( )=5% of the time,
and p( ) =95% of the time it is sunny which provides only __ h( )=-log₂(0.95)= 0.07 bits.
Hence on average we don’t expect to be as surprised as we would if were p( ) = p( )= ½.

From here:

H = p( )_h( ) +_ p( __ )h( )=0.286

I realised that it would be neat to create an entropy graph for the c=3 scenario. Visualising 3D is always tricky, but a key insight to do this is that, defining x, y and z to be the outcomes of a three sided die I take advantage of the relationship p(x)+p(y)+p(z)=1.

Le voila:

Entropy as a function of probability p(x,y,z) of ternary events x,y,z

Here we see the maximum entropy is, as expected at p(x)=p(y)=p(z)=⅓, and as we get to closer to p(x)=1, p(y)=1 or (p(z)=1; at p(x)=0, p(y)=0) H(p) drops to 0. The symmetry around the maximum entropy point also holds, but that’s harder to gauge by eye.

The script to generate this is the following (full disclosure: since I don’t enjoy handling mesh grids so it’s based on iterations between a generative model and my fine tuning):

To summarise this intro to entropy, the main takeaway is:

Entropy reaches its peak when all outcome probabilities are equal – i.e, max uncertainty. As certainty grows, entropy reduces until 100% predictability where it is reduced to zero.

For this reason it is also used as a measure of purity of a system. We will discuss this further in applications of decision trees and DNA pooling.

Hopefully, you have gained an intuition about entropy and its relationship with self-information.

While we’ve touched on bits as the unit of measure for information, there’s another way to deepen our understanding. Inspired by 3blue1brown, a YouTube channel renowned for its mathematical visualisations, we will explore a fresh perspective on bits and their significance in quantifying information.

Visual Intuition of Bits with a Box of Chocolate

You never know what you're gonna get. Credit: Wikipedia — You never know what you’re gonna get. Credit: Wikipedia

Since a bit is logarithmic in nature, it can be counterintuitive for our linear-thinking brains to grasp. Earlier, we touched on bits within the context of self-information.

Here, we explore bits from a different perspective – through the lens of entropy, which represents the average self-information.

While researching for this series, I was inspired by visual explanations crafted by Grant Sanderson, the creator of the outstanding 3blue1brown mathematics tutorials. Building on his insights, I offer an interpretation that sheds new light on understanding bits.

In brief he demonstrates that

[The number of bits expresses] “how many times have you cut down the possibilities by half?” Grant Sanderson (3Blue1Brown)

In the Resources section you can watch his amazing video¹ explaining how to use entropy to optimise solving for a popular word game called WORDLE .

Here I demonstrate a similar interpretation using a different type of set of observations which fictional philosopher Forrest Gump can relate to: 256 pieces of emoji shaped chocolate.

For simplicity we’ll assume that all are equally distributed. Our first question of interest is:

“Which one chocolate emoji did Forrest get?”

Let’s visualise and quantify:

256 chocolate emojis. Considering all things even means entropy of 8 bits.

Assuming all bonbons are equal:

Each has a probability p=1/256 to be chosen
Meaning each has self-information of h=-log₂(1/256)= 8 bits
The entropy of the system is the average over all the self-information which is also 8 bits. (Remember that by all emojis being equal we are at peak uncertainty.)

Let’s assume that an observation has been made: “the chosen chocolate piece has an emoji shape with an odd ascii value” (e.g, has a hex representation: 1F642).

This changes the possibility space in the following way:

Left: Same full possibility space of 256 objects→ Entropy=8 bits. Right: by reducing the possibility spaced by ½ means 1 bit of information has been gained.

From here we learn:

The possibility set was reduced by 2 (p=½).
The subspace entropy is 7 bits.
Compared to the 8 bits of the full possibility space we have gained 1 bit of information.

What would the picture look like if the observation was “the ascii value of the chosen emoji has a modulo 4 of zero“?

Left: Same as before. Right: reducing the possibility spaced by ¼ mean 2 bits of information have been gained.

The possibility was reduced by 4: p=¼.
The subspace entropy is 6 bits.
We have gained 2 bits of information.

Let’s continue cutting down possibilities until we are left with only 8 emojis (2³ → 3 bits). We can see

Top Left: Same as before. Top Right: Reducing the possibility spaced by 1/32 means 5 bits of information have been gained. Bottom: For comparison demonstrating all examples of information gain from 1–4 bits.

These examples clearly illustrate that, assuming all emojis are equally likely, the bit count represents the number of times the possibility space must be halved to identify the selected emoji. Each step can also be understood through a c-sided die analogy (c = 256, 128, …, 8).

Both perspectives emphasise the logarithmic nature of bits, as expressed in the self-information formula hₓ=-log₂(p), which is averaged to compute entropy.

Analogy: Bits as Information Currency

Here’s another way to think about bits (last one, I promise!). Since we’ve seen that entropy decreases with information gain, it can be useful to consider bits as a form of currency ($, £, etc.; no bitcoin ₿ pun intended…).

Imagine having an “uncertainty ‍ account” .

When a specific question in probability is posed, an account is opened, holding a balance of “uncertainty capital” . As new information is received, this balance decreases, much like making a withdrawal from the account . You can think of each piece of information gained as reducing uncertainty (or increasing certainty), akin to a negative deposit.

Unlike traditional bank accounts, this one cannot go into debt – there is no overdraft. The lowest possible balance is 0 bits, representing complete certainty. This corresponds to the case where you have full knowledge about a situation, i.e, entropy H=0 → 100% certainty. Recall the first axiom: An event with a 100% probability is perfectly unsurprising and provides no new information.

This is the same idea we saw above with the 256 piece chocolate box. We effectively opened an ‍ account with a capital of 8 bits . Each time that the possibility space was reduced we had an uncertainty withdraw .

While not a perfect analogy (transactions are only one way, cannot be negative), it offers an intuitive way to grasp the exchange bits from entropy to gain.

Note that in these last two sections, I’ve simplified the examples by using powers of 2 and assuming equal probabilities for each event. In real-world scenarios, however, distributions are rarely so neat or balanced, and many applications don’t neatly align with factors of 2. We’ll dive into more complex, non-uniform distributions in the upcoming sections.

Entropy Applications

Now that we have a solid grasp of entropy and bits as its unit of measure, let’s explore some real-world applications – ranging from machine learning and applied sciences to math puzzles – that leverage these concepts in practical and often unexpected ways.

ML Application: Purity of Decision Tree Splits

Decision trees are a fundamental component of popular supervised machine learning algorithms, such as Random Forests and Gradient Boosting Trees, often used for tabular data.

A decision tree follows a top-down, greedy search strategy. It uses recursive partitioning, meaning it repeatedly splits the data into smaller subsets. The objective is to create clusters that are increasingly homogeneous, or “pure.”

To achieve this, the algorithm asks a series of questions about the data at each step to decide how to divide the dataset. In classification tasks, the goal is to maximise the increase in purity from parent nodes to child nodes with each split. (For those who don’t mind double negatives: this corresponds to a decrease in impurity.)

The impurity after each split may be quantified as the average of the children which may very loosely may be written as:

where:

L and R represent the left and right sides of a splitting, respectively.
_n_ⁱ are the children node sizes for i=L,R.
_n=nᴸ+n_ᴿ is the parent node size.
Hⁱ are the entropies for each child i=L,R

Let’s learn by example where the parent node has 1,000 entries with a 9:1 split between target positive to negative entries, respectfully. The parameter split creates two children with the following splits:

The left child has a 7:1 split (_H_ᴸ=0.54 bits) and the right child is purely positive (_H_ᴿ=0).

The result is an average children impurity of:

Children Impurity = 800/1000 0.54 + 200/100 0 = 0.43 bits

One important feature of using entropy is that the children’s average impurity is lower than their parent’s.

Children Average Entropy < Parent’s Entropy

In our example we obtained a children’s average of 0.43 bits compared to their parent’s 0.54 bits.

The reason for this assertion is the concave shape of the entropy distribution.

To understand this let’s revisit the c=2 entropy graph and add points of interest. We’ll use a slightly different numerical example that nicely visualises the point of purity increase (impurity decrease).

First let’s script up the impurity calculation:

Our working example will include a parent with [700, 300] that splits into a near even node [300, 290] and its complementary near pure one [400, 10]:

# node class frequencies
ps_parent = [700, 300]
ps_childL = [300, 290]
ps_childR = [ps_parent[0] - ps_childL[0], ps_parent[1] - ps_childL[1]]

# node entropies
H_parent = entropy(ps_parent, base=2)
H_childL = entropy(ps_childL, base=2)
H_childR = entropy(ps_childR, base=2)
H_childrenA = split_impurity(ps_childL, ps_childR)

print(f"parent entropy: {H_parent:0.2f}")
print(f"childL entropy: {H_childL:0.4f}")
print(f"childR entropy: {H_childR:0.2f}")
print("-" * 20)
print(f"average child impurity: {H_childrenA:0.2f}")

# Yields
# parent entropy: 0.88
# childL entropy: 0.9998  # nearly even
# childR entropy: 0.17    # nearly deterministic
# --------------------
# average child impurity: 0.66

We can visualise this as the following:

The purple solid line is the continuous entropy, as before. The parent node entropy is the black dot over the orange dashed line. The children node entropies are the orange dots at the extremes of the dashed line. We see that even though one child has an undesired higher entropy (less purity; higher impurity) than the parent, this is compensated by the much higher purity of the other child. The x on the orange dashed line is the average child entropy, where the arrow indicates how much purer their average is than that of their parent.

The reduction in impurity from parent to child node average is a consequence of the concave shape of the entropy curve. In the Resources section, I’ve included an article² that highlights this feature, which is shared with another heuristic called the Gini index. This characteristic is often cited as a key reason for choosing entropy over other metrics that lack this property.

For the visual above I used this script:

# calculating entropies for all value ranging 0-1 in intervals of 0.01
entropies = {p_: entropy([p_, 1-p_], base=2) for p_ in np.arange(0, 1.01, 0.01)}

# plotting
plt.plot(list(entropies.keys()), entropies.values(), color='purple')
plt.title('Entropy of a Bernoulli trial')
plt.xlabel(r'$p$')
plt.ylabel('bits')

# node frequencies
p_parent = ps_parent[0] / sum(ps_parent)
p_childL = ps_childL[0] / sum(ps_childL)
p_childR = ps_childR[0] / sum(ps_childR)

plt.scatter([p_parent,], [H_parent], color='black', label='parent')
plt.scatter([p_childL, p_childR], [H_childL, H_childR], color='orange', label='children')
plt.plot([p_childL, p_childR], [H_childL, H_childR], color='orange', linestyle='--')
plt.scatter([p_parent], [H_childrenA], color='green', label='children average', marker="x", linewidth=2)

# draw narrow arrow between parent and children average
plt.annotate('', xy=(p_parent, H_childrenA + 0.01), xytext=(p_parent, H_parent - 0.015), arrowprops=dict(facecolor='black', linewidth=1, arrowstyle="-|>, head_width=0.3, head_length=0.5"))

plt.legend(title="Nodes")

In this section, I’ve demonstrated how entropy can be used to evaluate the purity of the leaf (child) nodes in a Decision Tree. Those paying close attention will notice that I focused solely on the target variable and ignored the predictors and their splitting values, assuming this choice as a given.

In practice, each split is determined by optimising based on both the predictors and target variables. As we’ve seen, entropy only addresses one variable. To account for both a predictor and a target variable simultaneously, we need a heuristic that captures the relationship between them. We’ll revisit this Decision Tree application when we discuss the Mutual Information article.

Next we’ll continue exploring the concept of population purity, as it plays a key role in scientific applications.

Diversity Application: DNA Library Verification

Biology inspired dice that Kelly Boukra 3D printed during our time in a biotech startup LabGenius. They display symbols of the building blocks of DNA (nucleotide bases A-T-C-G) and proteins (20 Amino Acids). From left to right eight sided A-T-C-G (X 2), 20 sided Amino Acids and four sided A-T-C-G.

In the decision tree example, we saw how entropy serves as a powerful tool for quantifying im/purity. Similarly, in the sciences, entropy is often used as a diversity index to measure the variety of different types within a dataset. In certain fields this application is also referred to as Shannon’s diversity index or the Shannon-Wiener index.

An interesting implementation of diversity assessment arises in DNA sequencing. When testing candidate molecules for therapeutics, biologists quantify the diversity of DNA segments in a collection known as a DNA library – an essential step in the process.

These libraries consist of DNA strands that represent slightly different versions of a gene, with variations in their building blocks, called nucleotide bases (or for short nucleotides or bases), at specific positions. These bases are symbolised by the letters A, C, T, and G.

Protein engineers have various demands for possible diversities of nucleotide bases at a given position, e.g, full-degeneracy (i.e, even distribution) or non-degenerate (i.e, pure). (There are other demands but that is beyond the scope of this article. Also out of scope: for those interested how base nucleotides are measured in practice with a device called a DNA sequencer I briefly discuss in a Supplementary section below.)

My former colleague Staffan Piledahl at LabGenius (now Briefly-Bio), sought to quantify the diversity of a DNA library, and we realised that entropy is an excellent tool for the task.

He aimed to classify the diversity at each position, e.g, either full degeneracy or non-degeneracy. (For completeness I mention that he also worked on partial-degeneracy but will ignore for simplicity.)

Let’s examine an example position that requires full-degeneracy: which has an ideal distribution p(🅐)=p(🅒)=p(🅣)=p(🅖)=¼. From what we learnt so far this would mean that the self-information is -log₂(¼)= 2 _b_its. Since all four are equally distributed the entropy is the average of all these yielding entropy H=2 _b_its. This is, of course, the max entropy since all possibilities are equal.

Since we have a system of four it may be beneficial to work in base 4, i.e use log₄ instead of base 2. The advantage of this is standardising the entropy between 0 (no degeneracy) to 1 (full degeneracy). One last note before continuing is that by using a different base we are no longer using bits as the unit of measure but rather four-digits where for lack of creativity I will call fits for short.

In other words in the ideal full-degeneracy case the entropy is maximum at H=2 bits=1 fit.

In reality biology data can be a bit messy and one should create tolerance bounds. We can imagine accepting [0.3, 0.2, 0.25, 0.25] (→ H=0.993 fits) or other permutations like [0.3, 0.2, 0.3, 0.2] (→ H=0.985 fits). Setting the boundaries of entropy H to define full-degeneracy is beyond the scope of this article and is case specific.

Staffan also pointed out that it is not enough to have a reasonable H range, but also have the “right diversity”. This is apparent in the non-degenerate (or pure) case.

Let’s say that at a given position we want ideally to have p(🅐)=1, p(🅒)=p(🅣)=p(🅖)=0 or [1, 0, 0, 0] (→ H=0 _f_its). This means that reasonable upper boundaries may be [0.95, 0.05, 0, 0] (target base at 95% and another at 5%→ H=0._1_43 fits) or slightly higher at [0.95, 0.016, 0.016, 0.016] (target base at 95% and the rest equally distributed → H=0._1_77 fits), depending on the use case.

However, even though [0, 1, 0, 0] (i.e, p(🅒)=1, p(🅐)=p(🅣)=p(🅖)=0) also yields the desired H=0 _f_its it is the wrong target diversity (our target is 🅐 not 🅒).

The takeaway is that entropy is a great tool to quantify the diversity, but context is king and should be incorporated in the decision making process, especially when this is automated.

For those interested in the details I will share Staffan Piledahl‘s article when it’s ready. Working title: Easy verification of DNA libraries using Sanger Sequencing and Shannon Diversity index.

In our final example we’ll apply entropy to better understand information in a popular puzzle.

Math Application: The Monty Hall Problem

The Monty Hall problem is one of the most well known stats brain twisters, and has captured a generation’s imagination due to the simplicity of the setup and … how easy it is to be fooled by it. It even eluded leading statisticians.

The premise is a game show in which a contestant has to choose one of three closed doors to win a prize car. Behind the other two are goats. After choosing a door the host opens one of the remaining two revealing one of the goats. The contestant then needs to decide:

Switch or stay?

The trader chooses door ☝ and the the host 🎩 reveals door C showing a goat. — The trader chooses door and the the host reveals door C showing a goat.

Spoiler alert: Below I’m going to reveal the answer.

If you are interested in guessing for yourself and learning more about why this is confusing you may want to check out my deep dive article. Alternatively you could skip to the next section and after reading the other article return to this section which adds information theory context which the deep dive does not.

** Lessons in Decision Making from the Monty Hall Problem

The correct answer is that ( last chance to look away! ) it is better to switch. The reason is that when switching the probability of winning the prize is ⅔ and when staying it remains ⅓ as before the host’s intervention.

At first glance, most people, including myself, incorrectly think that it doesn’t matter if one switches or stays arguing that the choice is a 50:50 split.

Why does the initial chosen door still have ⅓ after the intervention and the remaining one has ⅔?

To understand this we have to examine the probability distributions prior and post the host intervention. Even though prior to the intervention each door has a probability of concealing the prize of [⅓,⅓,⅓], it is better to quantify a “macro probability distribution” before the intervention as classified [chosen; not chosen]=[⅓;⅔]. This visual illustrates this:

This is useful because this macro distribution does not change after the intervention:

It is crucial to realise that the micro distribution does change to [⅓, ⅔,0]. Armed with self-information and entropy we learn that:

The self-information of the doors shifted from [1.58,1.58,1.58] (all -log₂(p(⅓))=1.58) to [1.58, 0.58,0].
The entropy is lowered from the maximum point of 1.585 (even distribution) to a lower value of 0.918 (skewed distribution).

This puzzle is more interesting when posing this question with many doors.

Imagine 100 doors where after the contender chooses one the host reveals 98 goats leaving two doors to choose from.

Examine this scenario and decide if the contestant should remaining with the original choice or switch.

The 100 Door Monty Hall problem after the host intervention. Should you stick with your door 👇 or switch? — The 100 Door Monty Hall problem after the host intervention. Should you stick with your door or switch?

The switch option is now quite obvious.

Why is it so obvious that one should switch when the number of doors is large but not when there are only three?

This is because by opening a door (or multiple doors if c>3), the host provides a lot of information.

Let’s use entropy to quantify how much. We’ll explore with some Python code but first, the following visual shows the “macro probability distribution” similar to above but for 100 doors:

The chosen door remains with p=1/100 but the remaining door gets all the probabilities from the revealed ones p=99/100. So we moved from the maximum entropy (even distribution) to a much lower one (highly skewed distribution).

We’ll now use Python scripts to quantify and visualise as well as describe analytically for any number of doors c (as an analogy to a c-sided die).

We’ll start with the standard c=3 problem:

p_before = [1./3, 1./3, 1./3]  # probability of car being behind door A,B,C before revealing
p_after =  [1./3, 2./3,   0.]  # probability of car being behind door A,B,C after  revealing

By now this sort of setup should look familiar.

We can calculate the entropy before and after from and infer the information gain:

entropy_before = entropy(p_before, base=2)  # for 3 doors yields 1.58
entropy_after = entropy(p_after, base=2)    # for 3 doors yields 0.92
information_gain = entropy_before - entropy_after

# yields 2/3 bits

If we do a similar calculation for the 4 door problem p_before=[1/4,1/4,1/4,1/4] and p_after=[1/4,3/4,0,0] the host reveals 0.77 bits, meaning slightly more than the three door case.

For the c=k door problem we should use

p_before=[1/c]*c
p_after=[1/c,(c-1)/c] + [0]*(c-1)

In the following graph we calculate for c=3–60, where the tail of the arrow is the entropy before the host reveals a door and the arrow head after. The length of the arrow is the information gain. I also added the horizontal entropy=1 line which indicates the incorrect assumption that the choice is a 50:50 split between the two remaining doors.

Arrow tails are the Monty Hall problem entropy before the host intervention. The arrow heads are the entropy after the interventions where the host opens all but two doors revealing only goats. The arrow lengths are the information gain per scenario.

We see a few trends of interest:

The entropy before the host’s intervention (arrow tails) grows with c. In a supplementary section I show this is by log₂(c).
The entropy after the host’s intervention decreases with c. In a supplementary section I show this function is log₂(c) – (c-1)/c * log₂(c-1)
As such the difference between them (the arrow length), is the the information gain grows by (c-1)/c * log₂(c-1).

The dotted line visualises that in the c=3 case the information gain is quite small (compared to the 50:50 split) but gradually increases by the number of doors k and hence makes switching the more obvious choice.

Entropy Everywhere

Once I started thinking about writing on Information Theory, entropy became my personal Roy Kent from Ted Lasso (for the uninitiated: it’s a show about British footballers and their quirky American coach):

It’s here, it’s there it’s every ****-ing-where.

For example, when attending my toddler’s gym class entropy seemed to merge from distributions of colourful balls in hoops:

Very low entropy (highly skewed towards blue balls)

Green and yellow are pure (i.e, H=0), and red is nearly pure.

Clearly, most toddlers’ internal colour-classification neural networks are functioning reasonably well.

Another example is conversations with my partner. She once told me that whenever I start with, “Do you know something interesting?” there was no idea what to expect: science, entertainment, sports, philosophy, work related, travel, languages, politics. I’m thinking – high entropy conversation topics.

But when we discuss income that is in the low entropy zone: I tend to stick to the same one liner “it’s going straight to the mortgage”.

Finally, some might be familiar with the term entropy from physics. Is there a connection? Shannon’s choice for the term entropy is borrowed from statistical mechanics because it closely resembles a formula used in fields such as thermodynamics – both involve summing terms weighted by probability in the form _p_log(p). A thorough discussion is beyond the scope of this article, but while the connection is mathematically elegant in practical terms they don’t appear to be relatable³.

Conclusion

In this article, we explored entropy as a tool to quantify uncertainty – the “average surprise” expected from a variable. From estimating fairness in coin tosses and dice rolls to assessing the diversity of DNA libraries and evaluating the purity of decision tree splits, entropy emerged as a versatile concept bridging probability theory and real-world applications.

However, as insightful as entropy is, it focuses on problems where the true distribution is known. What happens when we need to compare a predicted distribution to a ground truth?

In the next article, we’ll tackle this challenge by exploring cross-entropy and KL-divergence. These heuristics quantify the misalignment between predicted and true distributions, forming the foundation for critical machine learning tasks, such as classification loss functions. Stay tuned as we continue our journey deeper into the fascinating world of information theory.

Loved this post?

Follow me here, join me on LinkedIn or buy me some chocolate!

Credits

Unless otherwise noted, all images were created by the author.

Many thanks to Staffan Piledahl and Will Reynolds for their useful comments.

About This Series

Even though I have twenty years of experience in data analysis and predictive modelling I always felt quite uneasy about using concepts in information theory without truly understanding them. For example last year I wanted to calculate the information gain in the Monty Hall problem and wasn’t sure how to do this in practice.

The purpose of this series was to put me more at ease with concepts of information theory and hopefully provide for others the explanations I needed.

Quantifying Surprise – A Data Scientist’s Intro To Information Theory – Part 1/4: Foundations

Check out my other articles which I wrote to better understand Causality and Bayesian Statistics:

Lessons in Decision Making from the Monty Hall Problem
A journey into three intuitions: Common, Bayesian and Causal
Causality – Mental Hygiene for Data Science
Harness The Power of Why with Causal Tools
Start Asking Your Data “Why?” – A Gentle Intro To Causality
A beginner’s guide to thinking beyond correlations
Mastering Simpson’s Paradox – My Gateway to Causality
Warning: You’ll never look at data in the same way

Resources and Footnotes

¹ 3blue1brown tutorial showing how to solve WORDLE using entropy .⁴

² Decision Tree Splitting: Entropy vs. Misclassification Error by Pooja Tambe

³ Entropy (information theory)/Relationship to thermodynamic entropy (Wikipedia)

⁴I’m quite proud of my WORDLE stats – no entropy optimisation applied!

As of 2025–01–30. Hoping to make the half year mark :-) — As of 2025–01–30. Hoping to make the half year mark

Supplementary: Monty Hall Information Gain Calculations

Here I briefly demonstrate the derivation for the Information Gain in the Monty Hall problem.

As mentioned above the underlying assumptions are that for c doors:

The probability distribution before the host intervention:
p(before)=[1/c, …,1/c] which is of length c.
E.g, for c=3 this would be [⅓, ⅓, ⅓].
The probability distribution after the host intervention of revealing c-2 doors that have goats:
p(after) = [1/c, (c-1)/c, 0 X {c-2 times}]
E.g, for c=3 this would be [⅓, ⅔, 0] and for c=8 [⅛, ⅞, 0, 0, 0, 0, 0, 0]

Let’s apply p(before) and p(after) to the standard entropy equation:

Entropy of probability distribution of variable X with c outcomes xᵢ, i=1...c. — Entropy of probability distribution of variable X with c outcomes xᵢ, i=1…c.

In the case of p(before) all the outcomes have the same probability:

Monty Hall problem entropy prior to opening any of the c doors.

In the case of p(after) only the first two outcomes have non zero probabilities:

Monty Hall problem entropy after the host revealed c-2 doors with goats 🐐 . — Monty Hall problem entropy after the host revealed c-2 doors with goats .

, where in the square brackets we have only the first two non-zero terms and the “…” represents many cancellations of terms to obtain the final equation.

The Information Gain is the difference H(before)-H(after). We see that the log₂(c) term cancels out and we remain with:

The amount of information conveyed by the Monty Hall host when by revealing c-2 doors with goats 🐐 . — The amount of information conveyed by the Monty Hall host when by revealing c-2 doors with goats .

This information gain is shown as the length of arrows in the graph which I copied here:

(Copied from above.) Arrow tails are the Monty Hall problem entropy before the host intervention. The arrow heads are the entropy after the interventions where the host opens all but two doors revealing only goats. The arrow lengths are the information gain per scenario.

For completeness below I provide the scripts used to generate the image.

Generating data:

n_stats = {}

for n_ in range(3,61):
    p_before = [1./n_] * n_
    p_after = [1./n_, (n_-1)/n_] + [0.] * (n_-2) 

    system_entropy_before = entropy(pd.Series(p_before), base=2)
    system_entropy_after = entropy(pd.Series(p_after), base=2)
    information_gain = system_entropy_before - system_entropy_after

    #print(f"before: {system_entropy_before:0.2f}, after {system_entropy_after:0.2f} bit ({information_gain:0.2}) ")

    n_stats[n_] = {"bits_before": system_entropy_before, "bits_after": system_entropy_after, "information_gain": information_gain}

df_n_stats = pd.DataFrame(n_stats).T
df_n_stats.index.name = "doors"p

Visualising:

alpha_ = 0.1

# Plot arrows from before to after
for i in df_n_stats.index:
    label = None
    if i == 3:
        label = "Information gain"
    information_gain = df_n_stats.loc[i, "bits_after"] - df_n_stats.loc[i, "bits_before"]
    plt.arrow(i, df_n_stats.loc[i, "bits_before"], 0, information_gain,
              head_width=0.5, head_length=0.1, fc='gray', ec='gray',
              alpha=alpha_ * 3, label=label)

plt.xlabel('Number of Doors')
plt.ylabel('Bits')
plt.title('Entropy Before and After Intervention')
plt.plot([3, 60.1], [1,1],':', color="gray", label="50:50 chance")
plt.legend()

# Alternative scripts to plot the analytical equations:

# Aux
#ln2_n = np.log2(df_n_stats.index)
#n_minus_1_div_n_logn_minus_1 = (df_n_stats.index - 1)/df_n_stats.index * np.log2(df_n_stats.index - 1)

# H(before)
# h_before = ln2_n
#plt.plot(df_n_stats.index, h_before, ':')

# h_after = ln2_n - n_minus_1_div_n_logn_minus_1
#plt.plot(df_n_stats.index, h_after, ':')

Supplementary: Sanger DNA Sequencing

In the main section I’ve discussed distributions of DNA building blocks called nucleotide bases. Here I describe a bit how these are measured in practice. (Full disclaimer: I don’t have a formal background in biology, but learnt a bit on the job in a biotech startup. This explanation is based on communications with a colleague and some clarification using a generative model.)

Devices known as DNA sequencers measure proxies for each building block nucleotide base and tally their occurrences. For example, a Sanger sequencer detects fluorescence intensities for each base at specific positions.

The sequencer output is typically visualised as in this diagram:

At each position, the sequencer provides an intensity measurement, colour-coded by base: A (green), T (red), C (blue), and G (black). These intensities represent the relative abundance of each nucleotide at that position. In most cases, the positions are monochromatic, indicating the presence of a single dominant base, what we called above non-degenerative or pure.

However, some positions show multiple colours, which reflect genetic variation within the sample, indicating diversity. (For simplicity, we’ll assume perfect sequencing and disregard any potential artefacts.)

As an example of a partial-degenerate base combination in the following visual we can see at middle position shown there is a 50:50 split between C and T:

This reads DNA strands with either C🅒TT or C🅣TT.

The post 🤷 Quantifying Uncertainty – A Data Scientist’s Intro To Information Theory – Part 2/4: Entropy appeared first on Towards Data Science.

Gain intuition into Entropy and master its applications in Machine Learning and Data Analysis. Python code included. 🐍
The post 🤷 Quantifying Uncertainty – A Data Scientist’s Intro To Information Theory – Part 2/4: Entropy appeared first on Towards Data Science. Data Science, Machine Learning, Statistics, Deep Dives, Information Theory Towards Data ScienceRead More

Add to favorites

Dr. Owns

February 3, 2025

0 Comments

Submit a Comment Cancel reply

You must be registered in the site to post a comment. Please Login if you already have account or Register.

Knowledge is the Competitive Edge in the Information Age

Dr. Owns

About the Series

Fundamentals: Self-Information and Bits

Entropy: Quantifying Average Information

Developing an Intuition for Entropy

Visual Intuition of Bits with a Box of Chocolate

Analogy: Bits as Information Currency

Entropy Applications

ML Application: Purity of Decision Tree Splits

Diversity Application: DNA Library Verification

Math Application: The Monty Hall Problem

Entropy Everywhere

Conclusion

Loved this post?

Credits

About This Series

Resources and Footnotes

Supplementary: Monty Hall Information Gain Calculations

Supplementary: Sanger DNA Sequencing

Dr. Owns

Recent Posts

0 Comments

Submit a Comment Cancel reply

Menu

Company

Company

Get Started

Get Started

Resources

Resources

Newsletter

Dr. Owns

About the Series

Fundamentals: Self-Information and Bits

Entropy: Quantifying Average Information

Developing an Intuition for Entropy

Visual Intuition of Bits with a Box of Chocolate

Analogy: Bits as Information Currency

Entropy Applications

ML Application: Purity of Decision Tree Splits

Diversity Application: DNA Library Verification

Math Application: The Monty Hall Problem

Entropy Everywhere

Conclusion

Loved this post?

Credits

About This Series

Resources and Footnotes

Supplementary: Monty Hall Information Gain Calculations

Supplementary: Sanger DNA Sequencing

Dr. Owns

Recent Posts

Top Viewed Post

0 Comments

Submit a Comment Cancel reply