back to the blog

ControlNet: Conditional Control to Text-to-Image Diffusion Models

Written on . Posted in Stable Diffusion API.
ControlNet: Conditional Control to Text-to-Image Diffusion Models

Stable Diffusion has changed the way images are generated through AI. It became very popular and many models were built on it by fine-tuning the stable diffusion with custom images to produce realistic images without any defects. However, there has been little success in this direction where we wanted to get images with our desired style and poses. Generally, the generated images have little few defects in them and a lot of inconsistency in the image generation process. One new technique to get more controlled and accurate images is ControlNet. 


ControlNet enables us to control diffusion models using sketches, depth maps, human poses or canny edges, etc., for a consistent style image generation from images through prompts. In this article, we will discuss ControlNet and the different inputs it can use to generate images.

What is ControlNet?

It is a neural network architecture that can enhance the performance of the pre-trained image diffusion models with task-specific conditions. Weights of neural network blocks are copied into a ‘locked’ copy and a ‘trainable copy’. Model is preserved by the ‘locked’ copy and learning is done by the ‘trainable’ copy. A ControlNet looks like the one below

control net

The above ControlNet is used to train the U-net in the stable diffusion for controlled image generation. It clones the different blocks of Stable Diffusion into a ‘locked’ copy and a ‘trainable’ copy. The ‘locked’ copy preserves the information of the model, whereas the ‘trainable’ copy can be used to train on different types of conditioning(canny edges, hough lines, depth maps, scribbles, etc.). The ControlNet in Stable Diffusion looks like the one below

Different kinds of Conditioning in ControlNet

There are different types of conditioning available for the ‘trainable’ copy for training in a ControlNet. They are as follows

  • Canny Edge
  • Hough Lines
  • Scribble
  • HED Edge
  • Segmentation Map
  • Human Pose
  • Normal Maps
  • Depth Maps

Some of the results using the above types of conditioning in the stable diffusion model are shown below

Controlling Stable Diffusion with Canny Edges

Controlling Stable Diffusion with Hough Line

Controlling Stable Diffusion with User Scribble(Human-made sketches)

Controlling Stable Diffusion with HED Edge

Controlling Stable Diffusion with Human Pose

We can use these models in pen-based image editing, if the diffusion process is masked. If the object to be generated is simple, then the model can achieve accurate control of the details in the image. One limitation of this ControlNet+Stable Diffusion is that if the semantic interpretation of the input image is wrong, then the model may have difficulty generating accurate images.


In conclusion, ControlNet has enabled us to generate more accurate images through different conditioning types of the input image. We can use this to generate images with accurate human poses preserving the spatial consistency. It enables us to better control the image generation process with stable diffusion!!

Also read: How to generate AI Avatars with Stable Diffusion API?- Blog