ComfyUI - An Image GenAI Tool

jacketzkt
Mar 24
4 min read

Updated: Mar 25

A Monalisa Made of My Face - Generated by ComfyUI

The Rise of AI-Generated Images

Back in 2023, the realism of AI-generated image tools like MidJourney, DALL·E, and Stable Diffusion took the world by storm. I remember my first experience using these tools - in order to generate a background image as what I desired, I generated a Japanese anime character with one text-to-image prompt and then used another prompt to create a sakura trees in a park and blended them into one by my conventional design tool (Affinity Photo).

While AI image generation was revolutionary, I quickly noticed its limitations. The tools were powerful, enabling me to create images I could never draw from scratch, yet they lacked precise control for fine-tuning results. That experience led me to wonder: What should an AI image generation tool truly offer?

As I kept an eye on new AI design tools, ComfyUI emerged - providing a more flexible and user-controlled approach to generative AI.

What is ComfyUI?

Prompt-based GenAI tools like MidJourney and Stable Diffusion are widely popular, but they offer limited user control over the generation process. ComfyUI takes a different approach by offering a node-based UI, allowing users to build their own customized AI workflows. As I have used Blender for modeling and rendering, I'm familiar with those nodes, with inputs and output on a modular-based workflow.

A Deeper Dive into ComfyUI

ComfyUI provides a visual interface where users can manipulate and configure key components of a GenAI model, effectively creating a personalized AI image generation pipeline with one or multiple processes and using multiple models. With a little bit of technology knowledge, I began exploring ComfyUI and searching and tesing a varios of the models from Hugging Face. Along the way, I discovered some insights that were not documented in ComfyUI doc.

Key Components of ComfyUI

Model - The core of image generation. Some models bundle all components into a single safetensors file, while others separate them, leaving only the U-Net model, which is responsible for generating images. For example, below is a node which loads Stable Diffusion 3.5 Large model (This model doesn't include the CLIP Encoder.).
Model Loader
CLIP Encoder - Converts text prompts into a format the model understands. If the model is packed with CLIP Encoder, this node output "CLIP" is the fragment to process the text prompts. As mentioned above the Stable Diffusion official model doesn't pack with CLIP Encoder, a separate node for CLIP Encoder has to be loaded. (Download the t5xxl_fp8_e4m3fn.safetensors from stable diffusion repository on Hugging Face. Note the macOS can't proceed with its t5xxl_fp16.safetensors CLIP Encoder.)

CLIP Encoder Node for SD3.5 Above Quoted
VAE (Variational Autoencoder) - Decodes the latent image representation into a final, high-quality image or video that is human-readable. If the model doesn't pack the VAE, a separate VAE has to be loaded for it to generate image.

VAE Node

With these modular components, ComfyUI offers a deeper-level granular controls, letting users refine and adjust each stage of the AI image generation process. Benefiting from its node-based structure, ComfyUI allows users to connect multiple model components in a modular fashion to achieve the desired outcome. (Similar to neurons in AI architecture, this modular approach is a key element of AI design.)

Here is an example of how I generate the monalisa photo as shown at the beginning with one of my photo and one photo of monalisa. It makes use of two models, IPAdapter and Stable Diffusion. IPAdapter is to keep the style of the two photos and Stable Diffusion is the model to generate the final image.

Use a photo of my face and a photo of Monalisa to generate a Monalisa of me 😁

Here is the workflow in a JSON file. To reuse the workflow, please make sure you have downloaded the IPAdapter and Stable Diffusion models and the relevant files of each model for ComfyUI.

Thoughts

Interaction Pattern

ComfyUI introduces a new interaction paradigm possibility for GenAI tools, moving away from the diaglog-based interface. This approach opens up more possibilities but still has room for improvement. Specifically, it lacks a more direct method for manipulating image or video. Additionally, requiring user to download and manage models and build up the workflows from scratch may be intimidating for non-professional users, making it less user-friendly.

Moreover, since AI models generate images based on text prompts, the interaction is inherently intention-based, meaning users need to describe what they want in words. While this works well for some, professionals who know the specific actions required to modify images might find it more convenient if the system could accept commands directly, rather than relying solely on text descriptions of the desired output

AI Models & Hardware

AI models are emerging and evolving rapidly in 2025. Yet, each model has its own strengths, which come from its pre-training and fine-tuning. It takes quite a bit of time to search and test out each models before you figure out which is the best one fitting your purpose.

Besides, those AI models are all optimized based on NVIDIA CUDA. When running the ComfyUI and models on my M1 Macbook Pro (16GB unified memory for CPU and GPU), it's taking much more time to both process the text prompt and genreate the image. In the worst case, the model even can't run on macOS due to limited GPU memory or its incompatibility with Metal Performance Shader (MPS), even though Apple has provided a way to convert the model to be MPS compatible. For optimal performance, at present, is either running the models locally on a high-spec PC with a dedicated GPU, or deploying them in the cloud.

The widespread application of AI models into design and industry still requires patience for further movements.