A brief introduction on ResNet

ResNet, short for Residual Networks, is a highly successful deep learning architecture that won the ImageNet competition in 2015. Designed by researchers at Microsoft Research, ResNet revolutionized deep neural networks by introducing a novel approach to address the challenges of training very deep networks. ### 1. **Architecture of ResNet** ResNet's architecture revolves around the concept of **residual learning**. Traditional neural networks face difficulties when they become too deep, often leading to vanishing/exploding gradients and a performance degradation. ResNet addresses these problems using a **residual block**, a key building block of the architecture. - **Residual Block**: Each residual block adds a shortcut or **skip connection**, allowing the input of one layer to skip one or more layers and go directly to a later layer. Mathematically, a residual block can be represented as: $ y = F(x) + x $ where$F(x)$represents the transformation through convolutional layers, and$x$is the input, which is added back to$F(x)$at the end of the block. This skip connection enables the network to learn the residual (difference) between the input and the output rather than the full transformation. - **Stacking Residual Blocks**: ResNet stacks these residual blocks to form deep architectures, with popular versions like ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152. Each version refers to the number of layers in the model. For example, ResNet-50 has 50 layers (mostly convolutional and pooling layers) with bottleneck designs for increased efficiency in deeper networks. - **Bottleneck Design**: Deeper ResNet models, like ResNet-50 and beyond, use a "bottleneck" block structure with three layers: 1x1, 3x3, and 1x1 convolutions. This setup reduces dimensionality while still learning meaningful features. **Diagram of a Basic Residual Block**: ![[Pasted image 20241027125524.png]] ### 2. **How ResNet Works** The core of ResNet lies in **learning residuals** instead of the full mapping. Here’s how it works: - **Skip Connections for Better Gradient Flow**: In traditional deep networks, information must pass through all layers sequentially. But with skip connections, information can bypass certain layers, reducing the **difficulty of optimization**. The gradients can flow directly through the skip paths, minimizing issues with vanishing gradients in very deep networks. - **Learning Identity Mappings**: When the residual function$F(x)$is close to zero, the output$y$becomes approximately equal to$x$. Thus, each block can easily learn an identity function if required. This flexibility lets deeper networks avoid overfitting or unnecessary complexity by “skipping” layers when they don't improve the model. - **Improved Convergence**: The residual connections make it easier to propagate information, allowing very deep networks (like ResNet-152 with 152 layers) to converge better than standard architectures without residual learning. ### 3. **Why ResNet Works** ResNet's effectiveness can be attributed to its ability to **address the degradation problem**. In deep networks, performance tends to plateau and then degrade as we add more layers due to vanishing/exploding gradients and diminishing returns on learning complex features. Let's go through the **mathematical derivation** of how ResNet works, focusing on the **residual learning** mechanism and how it allows training very deep networks effectively. #### Traditional Deep Neural Network Transformation In a traditional deep neural network, each layer applies some transformation to the input. Let’s denote the input to a certain layer as $\mathbf{x}$. The layer applies a transformation $F(\mathbf{x}, W)$, where$W$represents the weights and biases of the layer. This transformation could include operations such as convolution, batch normalization, and activation functions. Thus, the output of a layer can be represented as: $ \mathbf{y} = F(\mathbf{x}, W) $ Where: - $F(\mathbf{x}, W)$ is the transformation learned by the network. - $\mathbf{y}$ is the output of the layer. In a deep network, $F$ needs to learn a **complex mapping** as the depth increases, which can lead to problems such as **vanishing gradients** and **degradation** of accuracy. #### Residual Learning: Motivation ResNet addresses these issues by rethinking the representation learned by each layer. Instead of learning a direct mapping $F(\mathbf{x})$, ResNet forces each block to learn the **residual function**. In other words, it learns the difference between the input and the desired output. The output of a residual block is defined as: $ \mathbf{y} = F(\mathbf{x}, W) + \mathbf{x} $ Where: - $F(\mathbf{x}, W)$is now the **residual function** that needs to be learned. - $\mathbf{x}$ is the input that is added to the output of$F$. In the residual block, $F(\mathbf{x}, W)$ typically represents two or more convolutional layers. The key idea here is that the network does not need to learn a full transformation, but rather an **incremental adjustment** to the input. #### Mathematical Analysis of Gradient Flow Residual connections help with gradient flow and address the **vanishing gradient problem**. Let’s analyze how residual connections affect the gradients during backpropagation. Consider a deep network with an input$\mathbf{x}$that propagates through multiple residual blocks: $ \mathbf{y}_L = \mathbf{x} + \sum_{i=1}^{L} F_i(\mathbf{x}_i, W_i) $ where$L$is the number of residual blocks. The output$\mathbf{y}_L$is a combination of the original input$\mathbf{x}$and the summation of all residuals learned by each block. During backpropagation, we are interested in computing the gradient of the loss$\mathcal{L}$with respect to the input $\mathbf{x}$: $ \frac{\partial \mathcal{L}}{\partial \mathbf{x}} $ Using the chain rule, for a traditional deep network without residual connections, the gradient can be written as: $ \frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}_L} \cdot \frac{\partial \mathbf{y}_L}{\partial \mathbf{x}} $ For a very deep network,$\frac{\partial \mathbf{y}_L}{\partial \mathbf{x}}$involves the multiplication of many gradient terms, which can easily lead to **vanishing or exploding gradients**. However, in a residual block, the gradient calculation changes because of the **skip connection**. The output of each residual block is: $ \mathbf{y} = F(\mathbf{x}, W) + \mathbf{x} $ The gradient of the loss with respect to the input$\mathbf{x}$now becomes: $ \frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \left( \frac{\partial F(\mathbf{x}, W)}{\partial \mathbf{x}} + 1 \right) $ The **identity term (1)** in the gradient means that the gradient can be directly propagated back through the skip connection, avoiding the issue of vanishing gradients. Essentially, even if $\frac{\partial F(\mathbf{x}, W)}{\partial \mathbf{x}}$ becomes very small, the presence of the identity mapping ensures that there is always a direct path for the gradient to flow back to earlier layers. #### Why Residual Learning Works - **Identity Mapping Simplifies Learning**: Instead of forcing every layer to learn a completely new transformation from scratch, residual learning allows each block to learn the residual—that is, the small changes required from the input to reach the desired output. This makes it easier for each block to "adjust" the information rather than having to build complex representations. - **Gradient Flow and Stability**: As shown in the gradient derivation, the identity connections allow gradients to flow more easily through the network, reducing the risk of vanishing gradients and making it feasible to train very deep networks. - **Flexibility to Learn Identity Functions**: When residual blocks are stacked, they can easily learn identity functions if necessary. This means that if adding more layers does not improve performance, the residual blocks can simply learn to pass the input through unchanged, avoiding a decrease in accuracy often seen in very deep models. #### Key reasons for ResNet’s success: - **Enabling Training of Very Deep Networks**: The skip connections enable gradient flow, allowing ResNet models with over 100 layers to be effectively trained. ResNet was the first to show that increasing network depth does not necessarily degrade performance. - **Fewer Parameters and Better Generalization**: Thanks to residual connections, ResNet models require fewer parameters than traditional deep networks, making them less prone to overfitting and more robust. - **Flexible Architecture**: ResNet serves as a base for several architectures in computer vision, such as Faster R-CNN for object detection and Mask R-CNN for segmentation, where residual learning has shown strong performance and generalization. ### 4. **Importance of ResNet** ResNet is one of the most widely used architectures in computer vision and beyond due to the following reasons: - **State-of-the-Art Results in Image Classification**: ResNet set new benchmarks for image classification, beating previous models on ImageNet and achieving a top-5 error rate of 3.57%. - **Foundation for Modern Architectures**: ResNet is the backbone for many advanced architectures. For example, in tasks like object detection (YOLO, Faster R-CNN) and segmentation (UNet, Mask R-CNN), ResNet serves as the feature extractor due to its strong performance and efficiency. - **Broader Applications**: Beyond image classification, ResNet's approach to residual learning has inspired methods in other fields, such as Natural Language Processing (NLP) and time-series analysis. ### 5. **Caveats of Using ResNet** While ResNet has become a cornerstone model, there are some caveats to consider: - **Increased Computational Cost**: Deep ResNet models, especially ResNet-101 or ResNet-152, require considerable computational resources, making them challenging to deploy on devices with limited processing power. - **Risk of Overfitting in Small Datasets**: Although ResNet is efficient with parameters, it can still overfit if used on smaller datasets. For such cases, transfer learning (fine-tuning a pre-trained model) or using smaller versions like ResNet-18 can help. - **Optimization Challenges with Very Deep Versions**: Extremely deep ResNets (like ResNet-1000+) are rarely used because training times and computational requirements grow significantly, and gains are marginal beyond a certain depth. - **Architecture Choices in Applications**: Depending on the problem, simpler models or variations (e.g., ResNeXt or DenseNet) might be more appropriate, offering better results with less computational expense. ### Summary - **Architecture**: ResNet uses skip (or residual) connections to enable deep networks by learning residual mappings. - **How It Works**: Residual blocks enable efficient training by alleviating vanishing gradients and allowing identity mappings. - **Why It Works**: Residual learning enables deep architectures without performance degradation, allowing for effective feature learning. - **Importance**: ResNet is a foundation in computer vision, enabling breakthroughs in classification, detection, and segmentation. - **Caveats**: High computational demands, risk of overfitting on small datasets, and diminishing returns with extremely deep versions. ResNet remains fundamental for both research and real-world applications, setting the stage for a new generation of deep learning architectures.