DATA PREPROCESSING
Transforming adult-sized data for child-like models
Numerical features in raw datasets are like adults in a world built for grown-ups. Some tower like skyscrapers (think billion-dollar revenues), while others are barely visible (like 0.001 probabilities). But our machine learning models? They’re children, struggling to make sense of this adult world.
Data scaling (including what some call “normalization) is the process of transforming these adult-sized numbers into child-friendly proportions. It’s about creating a level playing field where every feature, big or small, can be understood and valued appropriately.
We’re gonna see five distinct scaling techniques, all demonstrated on one little dataset (complete with some visuals, of course). From the gentle touch of normalization to the mathematical acrobatics of Box-Cox transformation, you’ll see why picking the right scaling method can be the secret sauce in your machine learning recipe.
All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.
Understanding Which Data Needs Transformation
Before we get into the specifics of scaling techniques, it’s good to understand which types of data benefit from scaling and which don’t:
Data That Usually Doesn’t Need Scaling:
Categorical variables: These should typically be encoded rather than scaled. This includes both nominal and ordinal categorical data.Binary variables: Features that can only take two values (0 and 1, or True and False) generally don’t need scaling.Count data: Integer counts often make sense as they are and scaling may make them harder to understand. Treat them as categorical instead. There are some exceptions, especially with very wide ranges of counts.Cyclical features: Data with a cyclical nature (like days of the week or months of the year) often benefit more from cyclical encoding rather than standard scaling techniques.
Data That Usually Needs Scaling:
Continuous numerical features with wide ranges: Features that can take on a wide range of values often benefit from scaling to prevent them from dominating other features in the model.Features measured in different units: When your dataset includes features measured in different units (e.g., meters, kilograms, years), scaling helps to put them on a comparable scale.Features with significantly different magnitudes: If some features have values in thousands while others are between 0 and 1, scaling can help balance their influence on the model.Percentage or ratio features: While these are already on a fixed scale (typically 0–100 or 0–1), scaling might still be beneficial, especially when used alongside features with much larger ranges.Bounded continuous features: Features with a known minimum and maximum often benefit from scaling, especially if their range is significantly different from other features in the dataset.Skewed distributions: Features with highly skewed distributions often benefit from certain types of scaling or transformation to make them more normally distributed and improve model performance.
Why Scale Your Data?
Now, you might be wondering, “Why bother scaling at all? Can’t we just let the data be?” Well, actually, many machine learning algorithms perform their best when all features are on a similar scale. Here’s why scaling is needed:
Equal Feature Importance: Unscaled features can accidentally dominate the model. For instance, wind speed (0–50 km/h) might overshadow temperature (10–35°C) simply because of its larger scale, not because it’s more important.Faster Convergence: Many optimization algorithms used in machine learning converge faster when features are on a similar scale.Improved Algorithm Performance: Some algorithms, like K-Nearest Neighbors and Neural Networks, explicitly require scaled data to perform well.Interpretability: Scaled coefficients in linear models are easier to interpret and compare.Avoiding Numerical Instability: Very large or very small values can lead to numerical instability in some algorithms.
Now that we understand which and why numerical data need scaling, let’s take a look at our dataset and see how we can scale its numerical variables using five different scaling methods. It’s not just about scaling — it’s about scaling right.
The Dataset
Before we get into the scaling techniques, let’s see our dataset. We’ll be working with data from this fictional golf club.
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from scipy import stats
# Read the data
data = {
‘Temperature_Celsius’: [15, 18, 22, 25, 28, 30, 32, 29, 26, 23, 20, 17],
‘Humidity_Percent’: [50, 55, 60, 65, 70, 75, 80, 72, 68, 62, 58, 52],
‘Wind_Speed_kmh’: [5, 8, 12, 15, 10, 7, 20, 18, 14, 9, 6, 11],
‘Golfers_Count’: [20, 35, 50, 75, 100, 120, 90, 110, 85, 60, 40, 25],
‘Green_Speed’: [8.5, 9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 11.0, 10.5, 10.0, 9.5, 9.0]
}
df = pd.DataFrame(data)
This dataset is perfect for our scaling tasks because it contains features with different units, scales, and distributions.
Let’s get into all the scaling methods now.
Method 1: Min-Max Scaling
Min Max Scaling transforms all values to a fixed range, typically between 0 and 1, by subtracting the minimum value and dividing by the range.
📊 Common Data Types: Features with a wide range of values, where a specific range is desired.
🎯 Goals:
– Constrain features to a specific range (e.g., 0 to 1).
– Preserve the original relationships between data points.
– Ensure interpretability of scaled values.
In Our Case: We apply this to Temperature because temperature has a natural minimum and maximum in our golfing context. It preserves the relative differences between temperatures, making 0 the coldest day, 1 the hottest, and 0.5 an average temperature day.
# 1. Min-Max Scaling for Temperature_Celsius
min_max_scaler = MinMaxScaler()
df[‘Temperature_MinMax’] = min_max_scaler.fit_transform(df[[‘Temperature_Celsius’]])
Method 2: Standard Scaling
Standard Scaling centers data around a mean of 0 and scales it to a standard deviation of 1, achieved by subtracting the mean and dividing by the standard deviation.
📊 Common Data Types: Features with varying scales and distributions.
🎯 Goals:
– Standardize features to have a mean of 0 and a standard deviation of 1.
– Ensure features with different scales contribute equally to a model.
– Prepare data for algorithms sensitive to feature scales (e.g., SVM, KNN).
In Our Case: We use this for Wind Speed because wind speed often follows a roughly normal distribution. It allows us to easily identify exceptionally calm or windy days by how many standard deviations they are from the mean.
# 2. Standard Scaling for Wind_Speed_kmh
std_scaler = StandardScaler()
df[‘Wind_Speed_Standardized’] = std_scaler.fit_transform(df[[‘Wind_Speed_kmh’]])
Method 3: Robust Scaling
Robust Scaling centers data around the median and scales using the interquartile range (IQR)
📊 Common Data Types: Features with outliers or noisy data.
🎯 Goals:
– Handle outliers effectively without being overly influenced by them.
– Maintain the relative order of data points.
– Achieve a stable scaling in the presence of noisy data.
In Our Case: We apply this to Humidity because humidity readings can have outliers due to extreme weather conditions or measurement errors. This scaling ensures our measurements are less sensitive to these outliers.
# 3. Robust Scaling for Humidity_Percent
robust_scaler = RobustScaler()
df[‘Humidity_Robust’] = robust_scaler.fit_transform(df[[‘Humidity_Percent’]])
So far, we’ve looked at a few ways to scale data using. Now, let’s explore a different approach — using transformations to achieve scaling, starting with the common technique of log transformation.
Method 4: Log Transformation
It applies a logarithmic function to the data, compressing the scale of very large values.
📊 Common Data Types:
– Right-skewed data (long tail).
– Count data.
– Data with multiplicative relationships.
🎯 Goals:
– Address right-skewness and normalize the distribution.
– Stabilize variance across the feature’s range.
– Improve model performance for data with these characteristics.
In Our Case: We use this for Golfers Count because count data often follows a right-skewed distribution. It makes the difference between 10 and 20 golfers more significant than between 100 and 110, aligning with the real-world impact of these differences.
# 4. Log Transformation for Golfers_Count
df[‘Golfers_Log’] = np.log1p(df[‘Golfers_Count’])
Method 5: Box-Cox Transformation
This is a family of power transformations (that includes log transformation as a special case) that aims to normalize the distribution of data by applying a power transformation with a parameter lambda (λ), which is optimized to achieve the desired normality.
Common Data Types: Features needing normalization to approximate a normal distribution.
🎯 Goals:
– Normalize the distribution of a feature.
– Improve the performance of models that assume normally distributed data.
– Stabilize variance and potentially enhance linearity.
In Our Case: We apply this to Green Speed because it might have a complex distribution not easily normalized by simpler methods. It allows the data to guide us to the most appropriate transformation, potentially improving its relationships with other variables.
# 5. Box-Cox Transformation for Green_Speed
df[‘Green_Speed_BoxCox’], lambda_param = stats.boxcox(df[‘Green_Speed’])
After performing transformation, it is also common to further scale it so it follows a certain distribution (like normal). We can do this to both of the transformed columns we had.
df[‘Golfers_Count_Log’] = np.log1p(df[‘Golfers_Count’])
df[‘Golfers_Count_Log_std’] = standard_scaler.fit_transform(df[[‘Golfers_Count_Log’]])
box_cox_transformer = PowerTransformer(method=’box-cox’) # By default already has standardizing
df[‘Green_Speed_BoxCox’] = box_cox_transformer.fit_transform(df[[‘Green_Speed’]])print(“nBox-Cox lambda parameter:”, lambda_param)
print(“nBox-Cox lambda parameter:”, lambda_param)
Conclusion: The Power of Scaling
So, there you have it. Five different scaling techniques, all applied to our golf course dataset. Now, all numerical features are transformed and ready for machine learning models.
Here’s a quick recap of each method and its application:
Min-Max Scaling: Applied to Temperature, normalizing values to a 0–1 range for better model interpretability.Standard Scaling: Used for Wind Speed, standardizing the distribution to reduce the impact of extreme values.Robust Scaling: Applied to Humidity to handle potential outliers and reduce their effect on model performance.Log Transformation: Used for Golfers Count to normalize right-skewed count data and improve model stability.Box-Cox Transformation: Applied to Green Speed to make the distribution more normal-like, which is often required by machine learning algorithms.
Each scaling method serves a specific purpose and is chosen based on the nature of the data and the requirements of the machine learning algorithm. By applying these techniques, we’ve prepared our numerical features for use in various machine learning models, potentially improving their performance and reliability.
🌟 Scaling Numerical Data, Code Summarized
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, PowerTransformer
# Read the data
data = {
‘Temperature_Celsius’: [15, 18, 22, 25, 28, 30, 32, 29, 26, 23, 20, 17],
‘Humidity_Percent’: [50, 55, 60, 65, 70, 75, 80, 72, 68, 62, 58, 52],
‘Wind_Speed_kmh’: [5, 8, 12, 15, 10, 7, 20, 18, 14, 9, 6, 11],
‘Golfers_Count’: [20, 35, 50, 75, 100, 120, 90, 110, 85, 60, 40, 25],
‘Green_Speed’: [8.5, 9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 11.0, 10.5, 10.0, 9.5, 9.0]
}
df = pd.DataFrame(data)
# 1. Min-Max Scaling for Temperature_Celsius
min_max_scaler = MinMaxScaler()
df[‘Temperature_MinMax’] = min_max_scaler.fit_transform(df[[‘Temperature_Celsius’]])
# 2. Standard Scaling for Wind_Speed_kmh
std_scaler = StandardScaler()
df[‘Wind_Speed_Standardized’] = std_scaler.fit_transform(df[[‘Wind_Speed_kmh’]])
# 3. Robust Scaling for Humidity_Percent
robust_scaler = RobustScaler()
df[‘Humidity_Robust’] = robust_scaler.fit_transform(df[[‘Humidity_Percent’]])
# 4. Log Transformation for Golfers_Count
df[‘Golfers_Log’] = np.log1p(df[‘Golfers_Count’])
df[‘Golfers_Log_std’] = standard_scaler.fit_transform(df[[‘Golfers_Log’]])
# 5. Box-Cox Transformation for Green_Speed
box_cox_transformer = PowerTransformer(method=’box-cox’) # By default already has standardizing
df[‘Green_Speed_BoxCox’] = box_cox_transformer.fit_transform(df[[‘Green_Speed’]])
# Display the results
transformed_data = df[[
‘Temperature_MinMax’,
‘Humidity_Robust’,
‘Wind_Speed_Standardized’,
‘Green_Speed_BoxCox’,
‘Golfers_Log_std’,
]]
transformed_data = transformed_data.round(2)
print(transformed_data)
⚠️ Clarifying “Scaling,” “Normalization,” and “Transformation”
As these terms are often used inconsistently in data science, let me clarify the distinctions:
Scaling: This is a broader term that refers to changing the range of values. It includes techniques like:
– Min-Max scaling (scales to a fixed range, often 0–1)
– Standard scaling (scales to mean 0 and standard deviation 1)Normalization: In a strict statistical sense, this typically refers to adjusting values measured on different scales to a common scale, often to make features have the properties of a normal distribution. Techniques include:
– Z-score normalization (same as standard scaling)
– Log normalization
– Box-Cox transformationTransformation: This is the broadest term, referring to any mathematical operation applied to change the values or distribution of a dataset. It includes both scaling and normalization, as well as other operations like:
– Power transformations (e.g., square root, cube root)
– Logarithmic transformations
– Exponential transformations
But, in practice:
– Some people use “normalization” specifically to mean scaling to a [0,1] interval (Min-Max scaling).
– Others use “normalization” and “scaling” almost interchangeably.
– “Transformation” is sometimes used interchangeably with both “scaling” and “normalization,” but it’s actually a more general term.
Given this overlap and inconsistent usage, for a beginner-focused article, I decided to use “Scaling” for simplicity. It’s better to focus on what each technique does rather than getting caught up in the terminology debate.
Further Reading
For a detailed explanation of the MinMaxScaler, StandardScaler, RobustScaler and its implementation in scikit-learn, readers can refer to the official documentation [1], which provides comprehensive information on its usage and parameters.
Technical Environment
This article uses Python 3.7 and scikit-learn 1.5. While the concepts discussed are generally applicable, specific code implementations may vary slightly with different versions.
About the Illustrations
Unless otherwise noted, all images are created by the author, incorporating licensed design elements from Canva Pro.
For a concise visual summary, check out the companion Instagram post.
Reference
[1] F. Pedregosa et al., Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
Scaling Numerical Data, Explained: A Visual Guide with Code Examples for Beginners was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
DATA PREPROCESSINGTransforming adult-sized data for child-like modelsNumerical features in raw datasets are like adults in a world built for grown-ups. Some tower like skyscrapers (think billion-dollar revenues), while others are barely visible (like 0.001 probabilities). But our machine learning models? They’re children, struggling to make sense of this adult world.Data scaling (including what some call “normalization) is the process of transforming these adult-sized numbers into child-friendly proportions. It’s about creating a level playing field where every feature, big or small, can be understood and valued appropriately.We’re gonna see five distinct scaling techniques, all demonstrated on one little dataset (complete with some visuals, of course). From the gentle touch of normalization to the mathematical acrobatics of Box-Cox transformation, you’ll see why picking the right scaling method can be the secret sauce in your machine learning recipe.All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.Understanding Which Data Needs TransformationBefore we get into the specifics of scaling techniques, it’s good to understand which types of data benefit from scaling and which don’t:Data That Usually Doesn’t Need Scaling:Categorical variables: These should typically be encoded rather than scaled. This includes both nominal and ordinal categorical data.Binary variables: Features that can only take two values (0 and 1, or True and False) generally don’t need scaling.Count data: Integer counts often make sense as they are and scaling may make them harder to understand. Treat them as categorical instead. There are some exceptions, especially with very wide ranges of counts.Cyclical features: Data with a cyclical nature (like days of the week or months of the year) often benefit more from cyclical encoding rather than standard scaling techniques.Data That Usually Needs Scaling:Continuous numerical features with wide ranges: Features that can take on a wide range of values often benefit from scaling to prevent them from dominating other features in the model.Features measured in different units: When your dataset includes features measured in different units (e.g., meters, kilograms, years), scaling helps to put them on a comparable scale.Features with significantly different magnitudes: If some features have values in thousands while others are between 0 and 1, scaling can help balance their influence on the model.Percentage or ratio features: While these are already on a fixed scale (typically 0–100 or 0–1), scaling might still be beneficial, especially when used alongside features with much larger ranges.Bounded continuous features: Features with a known minimum and maximum often benefit from scaling, especially if their range is significantly different from other features in the dataset.Skewed distributions: Features with highly skewed distributions often benefit from certain types of scaling or transformation to make them more normally distributed and improve model performance.Why Scale Your Data?Now, you might be wondering, “Why bother scaling at all? Can’t we just let the data be?” Well, actually, many machine learning algorithms perform their best when all features are on a similar scale. Here’s why scaling is needed:Equal Feature Importance: Unscaled features can accidentally dominate the model. For instance, wind speed (0–50 km/h) might overshadow temperature (10–35°C) simply because of its larger scale, not because it’s more important.Faster Convergence: Many optimization algorithms used in machine learning converge faster when features are on a similar scale.Improved Algorithm Performance: Some algorithms, like K-Nearest Neighbors and Neural Networks, explicitly require scaled data to perform well.Interpretability: Scaled coefficients in linear models are easier to interpret and compare.Avoiding Numerical Instability: Very large or very small values can lead to numerical instability in some algorithms.Now that we understand which and why numerical data need scaling, let’s take a look at our dataset and see how we can scale its numerical variables using five different scaling methods. It’s not just about scaling — it’s about scaling right.The DatasetBefore we get into the scaling techniques, let’s see our dataset. We’ll be working with data from this fictional golf club.import pandas as pdimport numpy as npfrom sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScalerfrom scipy import stats# Read the datadata = { ‘Temperature_Celsius’: [15, 18, 22, 25, 28, 30, 32, 29, 26, 23, 20, 17], ‘Humidity_Percent’: [50, 55, 60, 65, 70, 75, 80, 72, 68, 62, 58, 52], ‘Wind_Speed_kmh’: [5, 8, 12, 15, 10, 7, 20, 18, 14, 9, 6, 11], ‘Golfers_Count’: [20, 35, 50, 75, 100, 120, 90, 110, 85, 60, 40, 25], ‘Green_Speed’: [8.5, 9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 11.0, 10.5, 10.0, 9.5, 9.0]}df = pd.DataFrame(data)This dataset is perfect for our scaling tasks because it contains features with different units, scales, and distributions.Let’s get into all the scaling methods now.Method 1: Min-Max ScalingMin Max Scaling transforms all values to a fixed range, typically between 0 and 1, by subtracting the minimum value and dividing by the range.📊 Common Data Types: Features with a wide range of values, where a specific range is desired.🎯 Goals:- Constrain features to a specific range (e.g., 0 to 1).- Preserve the original relationships between data points.- Ensure interpretability of scaled values.In Our Case: We apply this to Temperature because temperature has a natural minimum and maximum in our golfing context. It preserves the relative differences between temperatures, making 0 the coldest day, 1 the hottest, and 0.5 an average temperature day.# 1. Min-Max Scaling for Temperature_Celsiusmin_max_scaler = MinMaxScaler()df[‘Temperature_MinMax’] = min_max_scaler.fit_transform(df[[‘Temperature_Celsius’]])Method 2: Standard ScalingStandard Scaling centers data around a mean of 0 and scales it to a standard deviation of 1, achieved by subtracting the mean and dividing by the standard deviation.📊 Common Data Types: Features with varying scales and distributions.🎯 Goals:- Standardize features to have a mean of 0 and a standard deviation of 1.- Ensure features with different scales contribute equally to a model.- Prepare data for algorithms sensitive to feature scales (e.g., SVM, KNN).In Our Case: We use this for Wind Speed because wind speed often follows a roughly normal distribution. It allows us to easily identify exceptionally calm or windy days by how many standard deviations they are from the mean.# 2. Standard Scaling for Wind_Speed_kmhstd_scaler = StandardScaler()df[‘Wind_Speed_Standardized’] = std_scaler.fit_transform(df[[‘Wind_Speed_kmh’]])Method 3: Robust ScalingRobust Scaling centers data around the median and scales using the interquartile range (IQR)📊 Common Data Types: Features with outliers or noisy data.🎯 Goals:- Handle outliers effectively without being overly influenced by them.- Maintain the relative order of data points.- Achieve a stable scaling in the presence of noisy data.In Our Case: We apply this to Humidity because humidity readings can have outliers due to extreme weather conditions or measurement errors. This scaling ensures our measurements are less sensitive to these outliers.# 3. Robust Scaling for Humidity_Percentrobust_scaler = RobustScaler()df[‘Humidity_Robust’] = robust_scaler.fit_transform(df[[‘Humidity_Percent’]])So far, we’ve looked at a few ways to scale data using. Now, let’s explore a different approach — using transformations to achieve scaling, starting with the common technique of log transformation.Method 4: Log TransformationIt applies a logarithmic function to the data, compressing the scale of very large values.📊 Common Data Types: – Right-skewed data (long tail).- Count data.- Data with multiplicative relationships.🎯 Goals:- Address right-skewness and normalize the distribution.- Stabilize variance across the feature’s range.- Improve model performance for data with these characteristics.In Our Case: We use this for Golfers Count because count data often follows a right-skewed distribution. It makes the difference between 10 and 20 golfers more significant than between 100 and 110, aligning with the real-world impact of these differences.# 4. Log Transformation for Golfers_Countdf[‘Golfers_Log’] = np.log1p(df[‘Golfers_Count’])Method 5: Box-Cox TransformationThis is a family of power transformations (that includes log transformation as a special case) that aims to normalize the distribution of data by applying a power transformation with a parameter lambda (λ), which is optimized to achieve the desired normality.Common Data Types: Features needing normalization to approximate a normal distribution.🎯 Goals:- Normalize the distribution of a feature.- Improve the performance of models that assume normally distributed data.- Stabilize variance and potentially enhance linearity.In Our Case: We apply this to Green Speed because it might have a complex distribution not easily normalized by simpler methods. It allows the data to guide us to the most appropriate transformation, potentially improving its relationships with other variables.# 5. Box-Cox Transformation for Green_Speeddf[‘Green_Speed_BoxCox’], lambda_param = stats.boxcox(df[‘Green_Speed’])After performing transformation, it is also common to further scale it so it follows a certain distribution (like normal). We can do this to both of the transformed columns we had.df[‘Golfers_Count_Log’] = np.log1p(df[‘Golfers_Count’]) df[‘Golfers_Count_Log_std’] = standard_scaler.fit_transform(df[[‘Golfers_Count_Log’]])box_cox_transformer = PowerTransformer(method=’box-cox’) # By default already has standardizingdf[‘Green_Speed_BoxCox’] = box_cox_transformer.fit_transform(df[[‘Green_Speed’]])print(“nBox-Cox lambda parameter:”, lambda_param)print(“nBox-Cox lambda parameter:”, lambda_param)Conclusion: The Power of ScalingSo, there you have it. Five different scaling techniques, all applied to our golf course dataset. Now, all numerical features are transformed and ready for machine learning models.Here’s a quick recap of each method and its application:Min-Max Scaling: Applied to Temperature, normalizing values to a 0–1 range for better model interpretability.Standard Scaling: Used for Wind Speed, standardizing the distribution to reduce the impact of extreme values.Robust Scaling: Applied to Humidity to handle potential outliers and reduce their effect on model performance.Log Transformation: Used for Golfers Count to normalize right-skewed count data and improve model stability.Box-Cox Transformation: Applied to Green Speed to make the distribution more normal-like, which is often required by machine learning algorithms.Each scaling method serves a specific purpose and is chosen based on the nature of the data and the requirements of the machine learning algorithm. By applying these techniques, we’ve prepared our numerical features for use in various machine learning models, potentially improving their performance and reliability.🌟 Scaling Numerical Data, Code Summarizedimport pandas as pdimport numpy as npfrom sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, PowerTransformer# Read the datadata = { ‘Temperature_Celsius’: [15, 18, 22, 25, 28, 30, 32, 29, 26, 23, 20, 17], ‘Humidity_Percent’: [50, 55, 60, 65, 70, 75, 80, 72, 68, 62, 58, 52], ‘Wind_Speed_kmh’: [5, 8, 12, 15, 10, 7, 20, 18, 14, 9, 6, 11], ‘Golfers_Count’: [20, 35, 50, 75, 100, 120, 90, 110, 85, 60, 40, 25], ‘Green_Speed’: [8.5, 9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 11.0, 10.5, 10.0, 9.5, 9.0]}df = pd.DataFrame(data)# 1. Min-Max Scaling for Temperature_Celsiusmin_max_scaler = MinMaxScaler()df[‘Temperature_MinMax’] = min_max_scaler.fit_transform(df[[‘Temperature_Celsius’]])# 2. Standard Scaling for Wind_Speed_kmhstd_scaler = StandardScaler()df[‘Wind_Speed_Standardized’] = std_scaler.fit_transform(df[[‘Wind_Speed_kmh’]])# 3. Robust Scaling for Humidity_Percentrobust_scaler = RobustScaler()df[‘Humidity_Robust’] = robust_scaler.fit_transform(df[[‘Humidity_Percent’]])# 4. Log Transformation for Golfers_Countdf[‘Golfers_Log’] = np.log1p(df[‘Golfers_Count’])df[‘Golfers_Log_std’] = standard_scaler.fit_transform(df[[‘Golfers_Log’]])# 5. Box-Cox Transformation for Green_Speedbox_cox_transformer = PowerTransformer(method=’box-cox’) # By default already has standardizingdf[‘Green_Speed_BoxCox’] = box_cox_transformer.fit_transform(df[[‘Green_Speed’]])# Display the resultstransformed_data = df[[ ‘Temperature_MinMax’, ‘Humidity_Robust’, ‘Wind_Speed_Standardized’, ‘Green_Speed_BoxCox’, ‘Golfers_Log_std’, ]]transformed_data = transformed_data.round(2)print(transformed_data)⚠️ Clarifying “Scaling,” “Normalization,” and “Transformation”As these terms are often used inconsistently in data science, let me clarify the distinctions:Scaling: This is a broader term that refers to changing the range of values. It includes techniques like:- Min-Max scaling (scales to a fixed range, often 0–1)- Standard scaling (scales to mean 0 and standard deviation 1)Normalization: In a strict statistical sense, this typically refers to adjusting values measured on different scales to a common scale, often to make features have the properties of a normal distribution. Techniques include:- Z-score normalization (same as standard scaling)- Log normalization- Box-Cox transformationTransformation: This is the broadest term, referring to any mathematical operation applied to change the values or distribution of a dataset. It includes both scaling and normalization, as well as other operations like:- Power transformations (e.g., square root, cube root)- Logarithmic transformations- Exponential transformationsBut, in practice:- Some people use “normalization” specifically to mean scaling to a [0,1] interval (Min-Max scaling).- Others use “normalization” and “scaling” almost interchangeably.- “Transformation” is sometimes used interchangeably with both “scaling” and “normalization,” but it’s actually a more general term.Given this overlap and inconsistent usage, for a beginner-focused article, I decided to use “Scaling” for simplicity. It’s better to focus on what each technique does rather than getting caught up in the terminology debate.Further ReadingFor a detailed explanation of the MinMaxScaler, StandardScaler, RobustScaler and its implementation in scikit-learn, readers can refer to the official documentation [1], which provides comprehensive information on its usage and parameters.Technical EnvironmentThis article uses Python 3.7 and scikit-learn 1.5. While the concepts discussed are generally applicable, specific code implementations may vary slightly with different versions.About the IllustrationsUnless otherwise noted, all images are created by the author, incorporating licensed design elements from Canva Pro.For a concise visual summary, check out the companion Instagram post.Reference[1] F. Pedregosa et al., Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.Scaling Numerical Data, Explained: A Visual Guide with Code Examples for Beginners was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story. data-transformation, data-science, machine-learning, scaling, tips-and-tricks Towards Data Science – MediumRead More


0 Comments