Universe:A Unified Modulation Framework for Segmentation-Free, Disentangled Multi-Concept Personalization

Quynh Phung¹ Sandesh Ghimire² Minsi Hu¹ Chung-Chi Tsai² Jia-Bin Huang¹

¹University of Maryland, College Park ²Qualcomm Technologies, Inc.

CVPR 2026

Teaser

Multi-concept customization with UniVerse. Given a set of reference images and their corresponding text descriptions, our method seamlessly extracts relevant visual concepts and synthesizes new images by composing them, without requiring expensive model finetuning or segmentation. Our approach effectively extracts concepts from objects with partial occlusion or abstract styles, and reliably preserves the distinct identities present in the reference images

Compose and Decompose Objects

Abstract Concept Composition

Multiple subjects

Framework

Our proposed UniVerse Framework to generate personalized images from in-the-wild reference images. (a) Inference: The Reference Condition Extractor (RCE) extracts both visual and textual references. The two features are extracted from CLIP and T5 encoders with additional modules to adapt to DiT blocks. The textual reference includes a shared vector Δ̃^s modulates all DiT blocks and block-wise vector sets { Δ̃^j }_j=1^N . The visual condition z_ref is used as an additional latent to deeply control the generated images. (b) Stage 1 - RCE Pretraining: The segmentation head was added to facilitate training on a large-scale dataset. Binary cross-entropy (BCE) is used as the segmentation loss. (c) Stage 2 - Finetuning: the FiLM is continue finetuned with other blocks on multi-concept dataset. Here, LoRA is added to the DiT and the whole process is trained with diffusion loss L_diff.

Universe:A Unified Modulation Framework for Segmentation-Free, Disentangled Multi-Concept Personalization

Teaser

Compose and Decompose Objects

Abstract Concept Composition

Multiple subjects

Framework

Abaltion Study

Comparison with other methods

References