filename     : Boz24a.pdf
entry        : inproceedings
conference   : SIGGRAPH 2024, Denver, United States, 28 July - 1 August, 2024
pages        : 94:1-94:11
year         : 2024
month        : July
title        : Versatile Vision Foundation Model for Image and Video Colorization
subtitle     :
author       : Vukasin Bozic, Abdelaziz Djelouah, Yang Zhang, Radu Timofte, Markus Gross, Christopher Schroers
booktitle    : 
ISSN/ISBN    :
editor       : 
publisher    : Association for Computing Machinery
publ.place   :
volume       :
issue        :
language     : English
keywords     : Colorization, Image and Video Colorization, Image Restoration
abstract     : Image and video colorization are among the most common prob- lems in image restoration. This is an ill-posed problem and a wide variety of methods have been proposed, ranging from more tra- ditional computer vision strategies to most recent development with transformer-based or generative neural network models. In this work we show how a latent diffusion model, pre-trained on text-to-image synthesis, can be finetuned for image colorization and provide a flexible solution for a wide variety of scenarios: high quality direct colorization with diverse results, user guided colorization through colors hints, text prompts or reference image and finally video colorization. Some works already investigated using diffusion models for colorization, however the proposed solutions are often more complex and require training a side model guiding the denoising process (à la ControlNet). Not only is this approach increasing the number of parameters and compute time, it also results in sub optimal colorization as we show. Our evaluation demonstrates that our model is the only approach that offers a wide flexibility while either matching or outperforming existing methods specialized in each sub-task, by proposing a group of universal, architecture-agnostic mechanisms which could be applied to any pre-trained diffusion model.