标签: paper/multimodal-generation/pretraining-architecture

2026年5月

HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

2026年5月

Improved Baselines with Representation Autoencoders

2026年5月

Lance: Unified Multimodal Modeling by Multi-Task Synergy

2026年5月

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

2026年5月

Qwen-Image-2.0 Technical Report

2026年5月

Qwen-Image-VAE-2.0 Technical Report

2026年5月

Rethinking Cross-Layer Information Routing in Diffusion Transformers

2026年5月

Semantic Generative Tuning for Unified Multimodal Models

2026年5月

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

2026年5月

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

2026年4月

Context Unrolling in Omni Models

2026年4月

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

2026年4月

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

2026年3月

Beyond Language Modeling: An Exploration of Multimodal Pretraining

2026年3月

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

2026年3月

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

2026年3月

Utonia: Toward One Encoder for All Point Clouds

2026年2月

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

2026年2月

Unified Latents (UL): How to train your latents

2026年1月

Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models

2026年1月

Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

2026年1月

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

2025年12月

Tuna: Taming Unified Visual Representations for Native Unified Multimodal Models

2025年12月

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

2025年11月

HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

2025年10月

LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

2025年7月

UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

2025年6月

OmniGen2: Towards Instruction-Aligned Multimodal Generation

2025年6月

Ovis-U1 Technical Report

2025年6月

Show-o2: Improved Native Unified Multimodal Models

2025年5月

Emerging Properties in Unified Multimodal Pretraining (BAGEL)

2025年5月

MMaDA: Multimodal Large Diffusion Language Models

2025年5月

Paper Notes

探索

标签: paper/multimodal-generation/pretraining-architecture

HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

Improved Baselines with Representation Autoencoders

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

Qwen-Image-2.0 Technical Report

Qwen-Image-VAE-2.0 Technical Report

Rethinking Cross-Layer Information Routing in Diffusion Transformers

Semantic Generative Tuning for Unified Multimodal Models

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Context Unrolling in Omni Models

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

Utonia: Toward One Encoder for All Point Clouds

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Unified Latents (UL): How to train your latents

Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models

Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

Tuna: Taming Unified Visual Representations for Native Unified Multimodal Models

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Ovis-U1 Technical Report

Show-o2: Improved Native Unified Multimodal Models

Emerging Properties in Unified Multimodal Pretraining (BAGEL)

MMaDA: Multimodal Large Diffusion Language Models

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation