CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design

1Fudan University 2ByteDance Intelligent Creation
*Equal Contribution Project Lead Corresponding Author

Use Diffusion Transformers to create graphic designs, covering movie posters, brand promotions, product advertisements, and social media content.

Abstract

Graphic design plays a vital role in visual communication across advertising, marketing, and multimedia entertainment. Prior work has explored automated graphic design generation using diffusion models, aiming to streamline creative workflows and democratize design capabilities. However, complex graphic design scenarios require accurately adhering to design intent specified by multiple heterogeneous user-provided elements (e.g. images, layouts, and texts), which pose multi-condition control challenges for existing methods. Specifically, previous single-condition control models demonstrate effectiveness only within their specialized domains but fail to generalize to other conditions, while existing multi-condition methods often lack fine-grained control over each sub-condition and compromise overall compositional harmony. To address these limitations, we introduce CreatiDesign, a systematic solution for automated graphic design covering both model architecture and dataset construction. First, we design a unified multi-condition driven architecture that enables flexible and precise integration of heterogeneous design elements with minimal architectural modifications to the base diffusion model. Furthermore, to ensure that each condition precisely controls its designated image region and to avoid interference between conditions, we propose a multimodal attention mask mechanism. Additionally, we develop a fully automated pipeline for constructing graphic design datasets, and introduce a new dataset with 400K samples featuring multi-condition annotations, along with a comprehensive benchmark. Experimental results show that CreatiDesign outperforms existing models by a clear margin in faithfully adhering to user intent.

Graphic design is a multi-condition driven generation task that requires the precise and harmonious arrangement of heterogeneous elements, including primary visual elements (provided as images with positions), as well as secondary visual and textual elements (both specified by semantic descriptions and positions). Previous methods either support only a single type of condition (e.g. image-driven or layout-driven models) or lack accurate control over each sub-condition(e.g. multi-condition driven models), resulting in failure to strictly adhere to user design intent, as highlighted by the red and purple masks.

Unified Multi-Condition Driven Architecture

CreatiDesign integrates image and semantic layout conditions through native multimodal attention. Multimodal attention mask ensures that each condition precisely controls its designated image regions while preventing leakage between conditions.

Graphic Design Datasets

Automated pipeline for graphic design dataset construction. Upon this, we synthesize graphic design samples at scale, alleviating the data bottleneck for model training. As a result, we construct a new dataset of 400K samples with annotations for various conditions.

Quantitative comparison with State-of-the-Art Relevant Methods

We compare CreatiDesign with three types of previous SOTA models: multi-subject image-driven models, semantic layout-driven models, and multi-condition driven models. The best results are shown in bold, and the top-3 results are highlighted. Our proposed method significantly enhances the graphic design capabilities of the baseline, achieves top-tier performance across all metrics, and shows a clear lead in average score.

Qualitative comparison with State-of-the-Art Relevant Methods

Compared with previous multi-condition or single-condition models, CreatiDesign demonstrates stricter adherence to user intent, including high subject preservation and precise layout alignment. Purple masks: inconsistent or mispositioned subjects. Red masks: entities with incorrect semantics or locations. Gray masks: disharmonious background or foreground regions.

Free Lunch: Expanding to Editing Tasks

CreatiDesign naturally extends beyond graphic design to a wide range of editing tasks without extra retraining. We demonstrate this capability via editing a series of movie posters. CreatiDesign consistently maintains subject identity, achieves accuracy layout control and overall visual harmony. In contrast, strong baselines such as Gemini2.0 frequently fail to preserve non-edited regions during sequential edits, often resulting in unwanted attribute changes to subjects or text, highlighting a lack of strict adherence to user intent.

Citation BibTeX

If you find CreatiDesign useful for your research or applications, please cite our paper:

@article{zhang2025creatidesign,
      title={CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design},
      author={Zhang, Hui and Hong, Dexiang and Yang, Maoke and Chen, Yutao and Zhang, Zhao and Shao, Jie and Wu, Xinglong and Wu, Zuxuan and Jiang, Yu-Gang},
      journal={arXiv preprint arXiv:2505.19114},
      year={2025}
    }