Gray to Glory: Performance Analysis of Image Coloration Models

Tushar Deshpande
Fall 2024 ECE 4554/5554 Computer Vision: Course Project
Virginia Tech

Abstract

Automatic image coloration is a challenging problem in computer vision, motivated by its potential applications in image restoration, enhancement, and creative tasks. This project explores the performance of various deep learning architectures in transforming grayscale images into realistic color versions, aiming to understand the theoretical underpinnings and practical performance of these models. The approach involves experimenting with a progression of architectures: a simple CNN-based encoder-decoder, a U-Net with a pre-trained ResNet backbone, and a conditional GAN with the aforementioned U-Net as the generator, incorporating adversarial loss. Preliminary results indicate that as model complexity increases, the generated images exhibit improved color fidelity and realism, highlighting the potential of GANs in achieving context-aware and high-quality colorization. We observe an improvement in Mean Absolute Error (MAE) from 0.0848 for the CNN-based Encoder-Decoder to 0.0807 for cGAN. Similarly we also observe an improvement Learned Perceptual Image Patch Similarity (LPIPS) score (using VGG) from 0.148 for the CNN-based Encoder-Decoder to 0.132 for cGAN.

Trulli

Introduction

Image colorization—the process of converting grayscale images into plausible color versions—has garnered significant interest due to its wide-ranging applications, including the restoration of historical photographs, enhancement of medical imaging, and the generation of visually enriched content for creative industries. This task is inherently challenging, as it requires a model to infer appropriate colors based on contextual and semantic information present in the grayscale input. Traditional methods for image colorization often relied on manual intervention or heuristic algorithms, which were labor-intensive and lacked generalizability across diverse image datasets. The advent of deep learning has revolutionized this field, enabling the development of models that can learn complex mappings from grayscale to color images directly from data. Notable approaches include convolutional neural networks (CNNs) and generative adversarial networks (GANs), which have demonstrated remarkable success in producing realistic colorizations. For instance, Zhang et al. proposed a deep learning approach that combines a CNN with high-level features extracted from a pre-trained model, achieving impressive results in automatic colorization tasks [1]. Similarly, Deshpande et al. introduced a variational approach to produce diverse and contextually appropriate colorizations, highlighting the potential of probabilistic models in this domain [3]. Despite these advancements, challenges remain, particularly in generating contextually accurate and diverse colorizations. Recent research has focused on leveraging generative models to address these issues. Wu et al. introduced a method that utilizes a pretrained GAN to provide rich and diverse color priors, enabling the production of vivid and varied colorizations [2]. In this project, we aim to conduct a comparative study of various deep learning architectures for image colorization, including simple CNN-based encoder-decoders, U-Net architectures with pre-trained backbones, and GAN-based frameworks. By systematically evaluating these models across multiple datasets, we seek to identify the most effective architecture for producing realistic and high-quality colorizations, thereby advancing the current state of the art in automatic image colorization.

Approach

To solve the problem of automatic image coloration, we employed a progressive approach involving multiple architectures. This approach was designed to analyze and compare the effectiveness of various models in transforming grayscale images into realistic colorized versions. The process included data preprocessing and three main modeling phases: baseline CNN encoder-decoder, U-Net architecture with a ResNet-18 backbone, and a Conditional GAN (cGAN) with a PatchGAN discriminator [4].

Data Preprocessing

The images from the COCO dataset were converted into the Lab color space, where the L* channel (lightness) represents grayscale information, and the A* and B* channels encode color information.
The L* channel was used as the input to the model, while the A* and B* channels served as the ground truth outputs for supervised learning.
To ensure consistency, all images were resized to 256x256 pixels and normalized to a range between -1 and 1 for both input L channel and output AB channel.
Random horizontal flipping was applied to the images before breakdown into L and AB channel to enhance model generalization and robustness.

Model Architectures

CNN Encoder-Decoder: We started with a simple encoder-decoder CNN architecture [8] as our baseline. The encoder extracted latent grayscale features, while the decoder reconstructed the A* and B* color channels. This model served as a foundational benchmark for further comparisons.
- Encoder: This module captures the spatial features by applying multiple convolution layers followed by downsampling like max pooling. This helps in reducing the dimensions and extracts important features.
- Lantent Representation: The encoder outputs a feature representation which is compact in nature and gives the most vital spatial and semantic information
- Decoder: The decoder takes the input from the encoder and performs upsampling with operations like transposed convolutions or interpolation. It thereby increases the spatial dimensions, performs convolutions to refine the features to output the finaloutput.
- Output layer: We have a final convolution layer that outputs the desired output which is in the form of a 2-dimensional AB channel. This output is concatenated with the L channel to produce a LAB image which is then converted to an RGB image.

U-Net with ResNet-18 Backbone: The U-Net architecture [6], with a ResNet-18 backbone in the encoder, was the second phase of our approach. Pre-trained weights from ImageNet were loaded into the encoder to leverage transfer learning, enhancing the model's ability to extract meaningful features. The pre-trained U-Net was fine-tuned on our dataset for 20 epochs to adapt to the specific task of image colorization. The U-Net model was implemented with the DynamicUnet class in fastai [9].
- Encoder: Similar to the Encoder-Decoder module, the Encoder in UNet captures importatant spatial features, reducing the image size while extracting important features. Here we replace the traditional encoder with a ResNet-18 architecture. ResNet-18 consists of convolutional layers and skip connections (in the form of residual blocks) that help to learn deeper representations without having the vanishing gradient problem.
- Connection: The deepest part of the UNet from the encoder gets connected to the decoder, which has captured the most important spatial features
- Decoder: This model performs upsamling and recaptures the image resolution by concantenating them with corresponding feature maps from encoder.

Conditional GAN: The pre-trained U-Net was integrated as the generator in a Conditional GAN (cGAN) framework. cGANs were chosen because they condition the output (colorized image) on the input (grayscale image), enabling the generator to produce outputs that are contextually consistent with the input image. Our implementation is inspired by the pix2pix implementation [4].
- PatchGAN Discriminator: We used a PatchGAN discriminator, which evaluates the realism of small image patches rather than the entire image. This approach ensures local coherence, as the discriminator classifies each patch of the generated image as real or fake. PatchGAN was chosen because it is effective in preserving fine details and textures [5], which are critical for realistic image colorization.
- Generator: The cGAN uses the UNet module as it's generator output. It creates images using the UNet module by effectively capturing spatial dimensions with skip layers. This enables accurate mappings from grayscale to colorized outputs.
- Discriminator: This is a custom CNN module which differentiates between the real images or generated ones. This assesses the authenticity, which makes use of convolutions to differentiate between real or generated real and generated colorizations.
- Training: We trained the cGAN using BCE Loss for 20 epochs. We used the adam optimizer again for training the discriminator with decay rate taken as 0.9.

Implementation Details

Dataset: A subset of COCO dataset was used, consisting of 10,000 images. The dataset was split into 8,000 images for training and 2,000 for testing.
Training Setup: The CNN encoder-decoder was trained for 100 epochs as a baseline. The U-Net architecture, pre-trained on ImageNet, was fine-tuned on our dataset for 20 epochs. The Conditional GAN was trained by integrating the this U-Net as a generator with the PatchGAN discriminator.
Pre-trained Models: The ImageNet pre-trained weights for the ResNet-18 backbone was utilized for U-net generator pre-trainig to expedite the training process and to get better performance compared to a model trained from scratch.

Design Choices

Why Conditional GAN: cGANs condition the output on the input, ensuring that the generated image aligns with the grayscale input's semantic content. Conditional GANs (cGANs) learn a conditional generative model [10]. This makes cGANs suitable for image-to-image translation tasks, where we condition on an input image and generate a corresponding output image.[4]
Why PatchGAN Discriminator: PatchGAN discriminates at the level of image patches, which is effective for ensuring local coherence in generated images. This is particularly useful for image colorization, where local color consistency is critical.

Experiments and Results

Experimental Setup

Datasets: A subset of the COCO dataset with 10,000 images. The training set consisted of 8,000 images, and the test set included 2,000 images.
Metrics: L1 Error was used to evaluate pixel-wise accuracy during training. Mean Absolute Error (MAE) and Learned Perceptual Image Patch Similarity (LPIPS) score (using latent features from pretrained VGG) [7] are used to quantitatively compare the model performances.
Hardware: The experiments were performed on Google Colab with the L4 GPU.

Results

CNN Encoder-Decoder: Achieved baseline results with limited ability to capture complex features, showing noticeable artifacts in the colorized outputs. We got a validation loss of 0.0848 after training it for 100 epochs. We trained the module using the L1 loss criterion with learning rate set as 0.001
U-Net with ResNet-18 Backbone: Improved performance compared to the baseline, with enhanced feature extraction and better contextual colorization. The UNet was trained for 20 epochs and acheieved a L1 loss of 0.07586. We set the learning rate to 0.0001
Conditional GAN: Produced the most realistic and visually consistent outputs. The incorporation of adversarial loss with the PatchGAN discriminator significantly reduced artifacts and improved local color coherence. We achieved a discriminator loss of 0.59404 and discriminator loss of 11.37 after training for 20 epochs.

Table 1: Results

Model	MAE	LPIPS
CNN Encoder-Decoder	0.0848	0.148
U-Net (ResNet-18 backbone)	0.0846	0.141
cGAN with U-Net	0.0807	0.132

Qualitative Results

Outputs by each model on 5 sample images from the test dataset are listed below: Trulli

Parameter Analysis

Adjusting the patch size in the PatchGAN discriminator affected the realism of local features. Larger patches reduced sensitivity to finer details, while smaller patches improved local consistency at the expense of global structure.

Epoch: We have tried to use the number of epochs for getting better accuracy but also making sure we do not overfit any model. Hence for training the UNet we used 20 epochs, after which we were seeing results getting converged with no frther reductions in losses. Similarly for cGAN, we kept the epoch numbers to 20 so that the generator does not overflow and discriminator does not overfit.
Learning Rate: Since learning rate is a vital parameter in training neural networks since they control by how much the weights should be adjusted, we kept it as 0.01 so as to avoid any irregularities and unstable updates that would have destabilized the training.
Beta1: Beta1 used in training the GAN is also a important parameter for the ADAM optimizer to take into account past weight updates for the current weights. We kept it as 0.5 to prevent any rapid updates, which allowed for stable training especially in earlier stages.
Criterion: We used L1 loss as the criterion for training because it allows for pixelwise accuracy which helps in generating realistic images. We used the BCE loss for training the GAN which is a common method, the discriminator learns to differentiate between real and fake images and generator is given the taks to fool the discriminator into classfying the generated images as real ones.
Batch Size: The batch size allows the neural network to update the weights in batches (controls the frequency as to when weights are updated). Since larger batch size required more memory and smaller ones lead to noiser gradients, we set the batch size as 32.

Trends

The progression from traditional CNN-based encoder-decoder architectures to more advanced models, such as U-Nets and conditional GANs (cGANs), demonstrates notable advancements in image colorization quality. While CNN-based encoder-decoders often produce desaturated, brownish tones due to L1 loss minimization during training, U-Nets, equipped with a ResNet-18 backbone and skip connections, show improved colorization with greater semantic fidelity but still lean toward grayish hues. The integration of U-Nets as generators within a cGAN framework, coupled with a PatchGAN discriminator, significantly enhances colorization quality by leveraging adversarial training. The paradigm provides the U-Net the necessary feedback to produce more realistic images. As expected, cGAN performs better than the baseline encoder-decoder and U-Net methods due to the adversarial training which gives the generator better feedback for generating colorized images.

Conclusion

This report has described the design, implementation, and evaluation of various deep learning architectures for automatic image coloration. Starting with a baseline CNN encoder-decoder, progressing to a U-Net with a ResNet-18 backbone, and culminating with a Conditional GAN incorporating a PatchGAN discriminator, the study highlighted the strengths and limitations of each approach. The results demonstrated that increasing model complexity, along with the integration of adversarial loss, significantly improved the realism and coherence of the colorized outputs. The U-Net architecture with pre-trained ResNet-18 weights provided a strong foundation for feature extraction and contextual understanding, while the Conditional GAN produced the most visually realistic results by leveraging adversarial training and patch-level discrimination. These findings validated the effectiveness of combining local and global feature learning to achieve high-quality colorizations.

Future Improvements

To further enhance the approach, several improvements could be considered:

Data Augmentation: Expanding the dataset with more diverse images and applying advanced augmentation techniques could help the model generalize better to unseen images.
Loss Functions: Experimenting with perceptual loss or combining MSE with adversarial loss might improve the model’s ability to generate outputs that are both quantitatively and qualitatively superior.
Higher-Resolution Models: Incorporating super-resolution techniques or adapting the model for higher-resolution inputs could enhance the fine details in the colorized images.
Diffusion Models: Diffusion models are state of the art image generation models which iteratively take noisy or partially incomplete images and refine them to produce a high-quality output.
- Models like Denoising Diffusion Probabilistic Models (DDPMs) [11] can be used, which add noise to images in the feed forward process and learns to reverse this during the training period. DDPM and Guided Diffusion [12] are some framework that can be explored in future.
- Another set of these models are Latent Diffusion Models (LDM) [13], which compress and compute the predictions in a compressed data space making them more efficient during computation. Stable Diffusion which are state of the art (SOTA) models nowadays, can be used to generate more realistic images.

Overall, this study serves as a foundation for understanding and improving automatic image coloration, paving the way for more robust and application-specific advancements in the field. We believe that going forward, we can make build a SOTA model that would take any grayscale image and produce it's colorized version, which has huge applications like in image compression. Since grayscale image stores data in reduced data space (only 1 channel), this can give huge potential for data storage techniques going forward.

References

R. Zhang, P. Isola, and A. A. Efros, "Colorful Image Colorization," Proceedings of the European Conference on Computer Vision (ECCV), pp. 649–666, 2016. [Online]. Available: https://arxiv.org/abs/1603.08511
Y. Wu, P. Zhang, and C. Zhang, "Generative Colorization for Diverse Images," arXiv preprint arXiv:2108.08826, 2021. [Online]. Available: https://arxiv.org/abs/2108.08826
A. Deshpande, J. Lu, M. Yeh, M. Jin, and D. Forsyth, "Learning Diverse Image Colorization," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6837–6845, 2017. [Online]. Available: https://arxiv.org/abs/1612.01958
Isola, Phillip, et al. "Image-to-image translation with conditional adversarial networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. ECCV, 2016
O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.
basu369victor. “Image Colorization Basic Implementation with CNN.” Kaggle, 28 Mar. 2020, www.kaggle.com/code/basu369victor/image-colorization-basic-implementation-with-cnn. Accessed 15 Nov. 2024.
Howard J, Gugger S. Fastai: A Layered API for Deep Learning. Information. 2020; 11(2):108. https://doi.org/10.3390/info11020108
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in neural information processing systems 33 (2020): 6840-6851.
Dhariwal, Prafulla, and Alexander Nichol. "Diffusion models beat gans on image synthesis." Advances in neural information processing systems 34 (2021): 8780-8794.
Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.