Feature scaling - Huang Xiao

Feature scaling is an important step in feature engineering, the purpose is to transform the feature values such that the ML models can process the features easier, e.g., improve convergence rate, less sensitive to outliers and so on. Here are several common feature scaling approaches used in machine learning, including when to use them and their pros and cons: ### 1. **Min-Max Normalization (Scaling to a range)** - **Description**: Scales features to a fixed range, typically between 0 and 1. $X' = \frac{X - X_{min}}{X_{max} - X_{min}}$ - **When to Use**: When you need to ensure all features are within a specific range, often required for models sensitive to the scale of data (e.g., neural networks). - **Pros**: - Preserves the original distribution. - Ensures all features are in the same range, which can improve convergence for gradient-based models. - **Cons**: - Sensitive to outliers since the range is directly dependent on min and max values. - Can distort data if there are extreme values. - Why feature scaling can improve convergence rate for gradient-based models - When feature scales vary vastly, the gradient of the parameter w.r.t. the feature can also vary significantly. This may cause the gradient of certain direction to dominate the optimisation process, and makes the convergence very slowly. The loss surface will be smoother after feature scaling, so finding the path to optimal can be easier. Besides, when features are at different scales, it is also hard to choose a proper learning rate for all features. ### 2. **Standardization (Z-score Normalization)** - **Description**: Centers the data around the mean (0) with a standard deviation of 1. $ X' = \frac{X - \mu}{\sigma} $ - **When to Use**: When the dataset contains features with different units or the features have a Gaussian distribution. This is commonly used for models that assume normality (e.g., logistic regression, linear regression). - **Pros**: - Robust to changes in the mean and variance of features. - Helps many algorithms that rely on distance metrics (e.g., k-nearest neighbors, support vector machines). - **Cons**: - Sensitive to outliers. - May not work well with non-Gaussian distributed features. ### 3. **Robust Scaling** - **Description**: Uses the median and interquartile range to scale features. $ X' = \frac{X - \text{median}}{IQR} $ - **When to Use**: When your data contains many outliers that could heavily influence the scaling. - **Pros**: - Robust to outliers. - Does not depend on min and max values. - **Cons**: - May not work well when the data is normally distributed. - Ignores the effect of outliers, which may contain important information. ### 4. **Max Absolute Scaling** - **Description**: Scales the data by dividing by the maximum absolute value. $ X' = \frac{X}{|X_{max}|} $ - **When to Use**: When you have sparse data, and you want to maintain sparsity. Commonly used for models such as sparse linear models and matrix factorization. - **Pros**: - Preserves the sparsity of data. - Simple to implement. - **Cons**: - Sensitive to outliers. - Only scales between -1 and 1, which may not be sufficient for certain applications. ### 5. **Log Transformation** - **Description**: Applies a logarithm to compress large values and make the data distribution more uniform. $ X' = \log(X + 1) $ - **When to Use**: When features have a skewed distribution (e.g., heavy-tailed) and you want to reduce skewness. - **Pros**: - Reduces the impact of outliers. - Makes skewed distributions more Gaussian-like. - **Cons**: - Not defined for negative or zero values. - May require a shift to handle non-positive values. ### 6. **Power Transformation (Box-Cox or Yeo-Johnson)** - **Description**: Applies a power function to stabilize variance and make the data more Gaussian. - **When to Use**: When features have skewed distributions, and you need to normalize them while retaining non-linear relationships. - **Pros**: - Useful for non-normal features with non-linear relationships. - Can reduce skewness and stabilize variance. - **Cons**: - Box-Cox requires positive values. - May require parameter tuning to identify the best transformation. ### 7. **Quantile Transformation (Rank Scaling)** - **Description**: Transforms the features to follow a specific distribution, often a uniform or Gaussian distribution. - **When to Use**: When you want all features to have the same distribution, useful in non-linear models or for improving robustness to outliers. - **Pros**: - Ensures that the data follows a desired distribution. - Reduces the effect of outliers by transforming them into the distribution range. - **Cons**: - Can distort the relationships between variables. - Computationally expensive for large datasets. ### 8. **L2 Normalization** - **Description**: Scales the feature vector for each sample to have a unit L2 norm (Euclidean norm). $ X' = \frac{X}{\|X\|_2} $ - **When to Use**: When you need to normalize the feature vectors across all samples. Often used in text classification (e.g., TF-IDF features). - **Pros**: - Used for models relying on cosine similarity (e.g., k-means clustering). - Reduces the influence of magnitude differences. - **Cons**: - Only suitable when each sample represents a vector of features. - May not be appropriate if magnitude differences are relevant to model learning. ### Summary Table | **Scaling Technique** | **When to Use** | **Pros** | **Cons** | |-----------------------------|-----------------------------------------|------------------------------------------------|------------------------------------------| | **Min-Max Normalization** | Gradient-based models | Preserves distribution, effective for NN | Sensitive to outliers | | **Standardization** | Features with Gaussian distributions | Suitable for distance-based models | Sensitive to outliers | | **Robust Scaling** | Data with many outliers | Robust to outliers | Ignores possible useful outliers | | **Max Absolute Scaling** | Sparse data | Maintains sparsity | Sensitive to outliers | | **Log Transformation** | Heavy-tailed distributions | Reduces skewness | Not defined for non-positive values | | **Power Transformation** | Non-linear, skewed data | Reduces skewness | Requires parameter tuning | | **Quantile Transformation** | Ensuring a specific distribution | Handles outliers, ensures distribution | Distorts variable relationships | | **L2 Normalization** | Normalizing vector lengths | Effective for cosine similarity-based models | Inappropriate for non-vector samples | These scaling techniques should be selected based on the model requirements, the data distribution, and specific challenges such as outliers or varying scales among features.