Normalization
In the context of data preprocessing, normalization typically refers to the process of adjusting the values in your data so that they fall within a certain range, often [0, 1] or [-1, 1]. This helps ensure that different features (or datasets) are on a similar scale, which can improve the performance of machine learning models.
There are a few different ways to normalize data, depending on what you want to achieve:
- Min-Max Normalization: Adjusts the values in the dataset so that the smallest value becomes 0 and the largest value becomes 1.
- Z-score Normalization (Standardization): Centers the data around 0 with a standard deviation of 1, so the mean becomes 0 and the standard deviation becomes 1.
- Rescaling: Adjusts the data to a new specified range, like [0, 100].
Simple Example: Given the dataset, in this case a list of lists (or a 2D array):
dataset = [
[34, 63, 88, 71, 29],
[90, 78, 51, 27, 45],
[63, 37, 85, 46, 22],
[51, 22, 34, 11, 18]
]
To normalize the values in each sublist, we need to adjust the numbers in each sub-list so that they fall within a certain range ([0, 1]), which is what is typically meant by normalization in data preprocessing.
For each sublist, you would:
- Find the minimum and maximum values in the sublist.
- Adjust each value using the formula:
-
normalized value = (value - min) / (max - min)
-
Here's how you can do this in Python:
def normalize(lst):
min_val = min(lst)
max_val = max(lst)
return [(x - min_val) / (max_val - min_val) for x in lst]
dataset = [
[34, 63, 88, 71, 29],
[90, 78, 51, 27, 45],
[63, 37, 85, 46, 22],
[51, 22, 34, 11, 18]
]
# using map
normalized_data = map(lambda sublist: normalize(sublist), dataset)
# using list comprehension
normalized_data = [normalize(sublist) for sublist in dataset]
Now, all the numbers in each sublist are scaled between 0 and 1.
Min-Max Normalization is a common method where values are scaled between 0 and 1.