Universe:A Unified Modulation Framework for Segmentation-Free, Disentangled Multi-Concept Personalization

1University of Maryland, College Park 2Qualcomm Technologies, Inc.
CVPR 2026

Teaser

Multi-concept customization with UniVerse. Given a set of reference images and their corresponding text descriptions, our method seamlessly extracts relevant visual concepts and synthesizes new images by composing them, without requiring expensive model finetuning or segmentation. Our approach effectively extracts concepts from objects with partial occlusion or abstract styles, and reliably preserves the distinct identities present in the reference images

Compose and Decompose Objects

Abstract Concept Composition

Multiple subjects

Framework

Our proposed UniVerse Framework to generate personalized images from in-the-wild reference images. (a) Inference: The Reference Condition Extractor (RCE) extracts both visual and textual references. The two features are extracted from CLIP and T5 encoders with additional modules to adapt to DiT blocks. The textual reference includes a shared vector Δ̃s modulates all DiT blocks and block-wise vector sets { Δ̃j }j=1N . The visual condition zref is used as an additional latent to deeply control the generated images. (b) Stage 1 - RCE Pretraining: The segmentation head was added to facilitate training on a large-scale dataset. Binary cross-entropy (BCE) is used as the segmentation loss. (c) Stage 2 - Finetuning: the FiLM is continue finetuned with other blocks on multi-concept dataset. Here, LoRA is added to the DiT and the whole process is trained with diffusion loss Ldiff.

Abaltion Study

Comparison with other methods

References