Multi-concept customization with UniVerse. Given a set of reference images and their corresponding text descriptions, our method seamlessly extracts relevant visual concepts and synthesizes new images by composing them, without requiring expensive model finetuning or segmentation. Our approach effectively extracts concepts from objects with partial occlusion or abstract styles, and reliably preserves the distinct identities present in the reference images
Our proposed UniVerse Framework to generate personalized images from in-the-wild reference images. (a) Inference: The Reference Condition Extractor (RCE) extracts both visual and textual references. The two features are extracted from CLIP and T5 encoders with additional modules to adapt to DiT blocks. The textual reference includes a shared vector Δ̃s modulates all DiT blocks and block-wise vector sets { Δ̃j }j=1N . The visual condition zref is used as an additional latent to deeply control the generated images. (b) Stage 1 - RCE Pretraining: The segmentation head was added to facilitate training on a large-scale dataset. Binary cross-entropy (BCE) is used as the segmentation loss. (c) Stage 2 - Finetuning: the FiLM is continue finetuned with other blocks on multi-concept dataset. Here, LoRA is added to the DiT and the whole process is trained with diffusion loss Ldiff.