FaceChain-FACT: Open-source 10-second portrait generation, reusing massive LoRa styles, a base-model-friendly portrait application.

Community Article Published May 31, 2024

image/png GitHub open-source repository: visit https://github.com/modelscope/facechain Modelscope Space demo experience:visithttps://modelscope.cn/studios/CVstudio/FaceChain-FACT/summary

Experimental Results

he code and model checkpoint of FaceChain FACT have now been open-sourced on both GitHub and Modelscope Space simultaneously. FaceChain FACT features a simple interactive interface design. With just one character image and brief operations, it can generate portraits in infinite styles and fixed templates. At the same time, FaceChain FACT also supports more advanced functions including specifying human poses, customizing LoRA style models, and multi-person template portraits. Specific examples are as follows:

Infinite Style Portrait:

Choose a style, upload a character image, and a portrait in the corresponding style can be generated. The operation interface is as follows:

image/png he generation result is as follows:

image/png

Specifying Human Pose:

Building upon the base of Infinite Style Portrait, upload a pose reference image in the advanced options.The operation interface is as follows:

image/png The generation result is as follows:

image/png

image/png The generation result is as follows:

image/png

Fixed Templates Portrait:

Upload a template and character image, determine the face redrawing number to generate the corresponding portrait. The operation interface is as follows:

image/png The generation result is as follows:

image/png

Multi-person Templates Portrait:

Based on Fixed Templates Portrait, process different faces in a multi-person template according to face numbers. The operation interface is as follows:

image/png

image/png The generation result is as follows:

image/png

Robust Portrait Generation Cases:

Compared to the training-based FaceChain, the portrait generation experience of FaceChain FACT has also made a qualitative leap. 1.) In terms of generation speed, FaceChain FACT successfully eliminates the lengthy and cumbersome training phase, drastically reducing the time to generate a custom portrait from 5 minutes to about 10 seconds, providing users with an incredibly smooth user experience. 2.) In terms of generation results, FaceChain FACT further improves the delicacy with which it preserves the facial ID, keeping realism and high-quality portrait textures simultaneously. At the same time, FaceChain FACT possesses high compatibility with the massive exquisite styles of FaceChain, as well as expansions like pose control, and can accurately decouple character ID information from images of poor quality such as undesirable lighting or exaggerated expressions in the input face images, ensuring the generated portraits have exceptional artistic expressiveness.

Undesirable Lighting:

image/jpeg

image/png

Exaggerated Expressions:

image/jpeg

image/png

Methodology

The capability of AI portraits generation comes from the large generative models like Stable Diffusion and its fine-tuning techniques. Due to the strong generalization capability of large models, it is possible to perform downstream tasks by fine-tuning on specific types of data and tasks, while preserving the model's overall ability of text following and image generation. The technical foundation of train-based and train-free AI portraits generation comes from applying different fine-tuning tasks to generative models. Currently, most existing AI portraits tools adopt a two-stage “train then generate” pipeline, where the fine-tuning task is “to generate portrait photos of a fixed character ID”, and the corresponding training data are multiple images of the fixed character ID. The effectiveness of such train-based pipeline depends on the scale of the training data, thus requiring certain image data support and training time, which also increases the cost for users.

Different from train-based pipeline, train-free pipeline adjusts the fine-tuning task to “generate portrait photos of a specified character ID”, meaning that the character ID image (face photo) is used as an additional input, and the output is a portrait photo preserving the input ID. Such a pipeline completely separates offline training from online inference, allowing users to generate portraits directly based on the fine-tuned model with only one photo in just 10 seconds, avoiding the cost for extensive data and training time. The fine-tuning task of train-free AI portraits generation is based on the adapter module, whose basic structure is as follows. Face photos are processed through an image encoder with fixed weights and a parameter-efficient feature projection layer to obtain aligned features, and are then fed into the U-Net model of Stable Diffusion through attention mechanism similar as text conditions. At this point, face information as an independent branch condition is fed into the model alongside text information for inference, thereby enabling the generated images to maintain ID fidelity.

image/png The basic algorithm based on face adapter is capable of achieving train-free AI portraits, but still requires certain adjustments to further improve its effectiveness. Existing train-free portrait tools generally suffer from the following issues: poor image quality of portraits, inadequate text following and style retention abilities in portraits, poor controllability and richness of portrait faces, and poor compatibility with extensions like ControlNet and style Lora. To address these issues, FaceChain attribute them to the fact that the fine-tuning tasks for existing train-free AI portrait tools have coupled with too much information beyond character IDs, and propose FaceChain Face Adapter with Decoupled Training (FaceChain FACT) to solve these problems. By fine-tuning the Stable Diffusion model on millions of portrait data, FaceChain FACT can achieve high-quality portrait image generation for specified character IDs. The entire framework of FaceChain FACT is shown in the figure below.

image/png The decoupled training of FaceChain FACT consists of two parts: decoupling face from image, and decoupling ID from face. Existing methods often treat denoising portrait images as the fine-tuning task, which makes the model hard to accurately focus on the face area, thereby affecting the text-to-image ability of the base Stable Diffusion model. FaceChain FACT draws on the sequential processing and regional control advantages of face-swapping algorithms and implements the fine-tuning method for decoupling faces from images from both structural and training strategy aspects. Structurally, unlike existing methods that use a parallel cross-attention mechanism to process face and text information, FaceChain FACT adopts a sequential processing approach as an independent adapter layer inserted into the original Stable Diffusion's blocks. This way, face adaptation acts as an independent step similar to face-swapping during the denoising process, avoiding interference between face and text conditions. In terms of training strategy, besides the original MSE loss function, FaceChain FACT introduces the Face Adapting Incremental Regularization (FAIR) loss function, which controls the feature increment of the face adaptation step in the adapter layer to focus on the face region. During inference, users can flexibly adjust the generated effects by modifying the weight of the face adapter, balancing fidelity and generalization of the face while maintaining the text-to-image ability of Stable Diffusion. The FAIR loss function is formulated as follows:

image/png Furthermore, addressing the issue of poor controllability and richness of generated faces, FaceChain FACT proposes a training method for decoupling ID from face, so that the portrait process only preserves the character ID rather than the entire face. Firstly, to better extract the ID information from the face while maintaining certain key facial details, and to better adapt to the structure of Stable Diffusion, FaceChain FACT employs a face feature extractor based on the Transformer architecture, which is pre-trained on a large-scale face dataset. All tokens from the penultimate layer are subsequently fed into a simple attention query model for feature projection, thereby ensuring that the extracted ID features meet the aforementioned requirements. Additionally, during the training process, FaceChain FACT uses the Classifier Free Guidance (CFG) method to perform random shuffle and drop for different portrait images of the same ID, thus ensuring that the input face images and the target images used for denoising may have different faces with the same ID, thus further preventing the model from overfitting to non-ID information of the face.

Expansion & Co-construction

• Whole-body portrait generation • Adapter for SDXL • Acceleration for inference • Various styles construction • Human-centric video generation