Normalization

In the context of data preprocessing, normalization typically refers to the process of adjusting the values in your data so that they fall within a certain range, often [0, 1] or [-1, 1]. This helps ensure that different features (or datasets) are on a similar scale, which can improve the performance of machine learning models.

There are a few different ways to normalize data, depending on what you want to achieve:

  • Min-Max Normalization: Adjusts the values in the dataset so that the smallest value becomes 0 and the largest value becomes 1.
  • Z-score Normalization (Standardization): Centers the data around 0 with a standard deviation of 1, so the mean becomes 0 and the standard deviation becomes 1.
  • Rescaling: Adjusts the data to a new specified range, like [0, 100].

Simple Example: Given the dataset, in this case a list of lists (or a 2D array):

dataset = [ [34, 63, 88, 71, 29], [90, 78, 51, 27, 45], [63, 37, 85, 46, 22], [51, 22, 34, 11, 18] ]

To normalize the values in each sublist, we need to adjust the numbers in each sub-list so that they fall within a certain range ([0, 1]), which is what is typically meant by normalization in data preprocessing.

For each sublist, you would:

  1. Find the minimum and maximum values in the sublist.
  2. Adjust each value using the formula:
    • normalized value = (value - min) / (max - min)

Here's how you can do this in Python:

def normalize(lst): min_val = min(lst) max_val = max(lst) return [(x - min_val) / (max_val - min_val) for x in lst] dataset = [ [34, 63, 88, 71, 29], [90, 78, 51, 27, 45], [63, 37, 85, 46, 22], [51, 22, 34, 11, 18] ] # using map normalized_data = map(lambda sublist: normalize(sublist), dataset) # using list comprehension normalized_data = [normalize(sublist) for sublist in dataset]

Now, all the numbers in each sublist are scaled between 0 and 1.

Min-Max Normalization is a common method where values are scaled between 0 and 1.