NativeTok: Native Visual Tokenization for Improved Image Generation
Abstract
NativeTok introduces a novel visual tokenization approach that enforces causal dependencies during image encoding, using a Meta Image Transformer and Mixture of Causal Expert Transformer for efficient and coherent image generation.
VQ-based image generation typically follows a two-stage pipeline: a tokenizer encodes images into discrete tokens, and a generative model learns their dependencies for reconstruction. However, improved tokenization in the first stage does not necessarily enhance the second-stage generation, as existing methods fail to constrain token dependencies. This mismatch forces the generative model to learn from unordered distributions, leading to bias and weak coherence. To address this, we propose native visual tokenization, which enforces causal dependencies during tokenization. Building on this idea, we introduce NativeTok, a framework that achieves efficient reconstruction while embedding relational constraints within token sequences. NativeTok consists of: (1) a Meta Image Transformer (MIT) for latent image modeling, and (2) a Mixture of Causal Expert Transformer (MoCET), where each lightweight expert block generates a single token conditioned on prior tokens and latent features. We further design a Hierarchical Native Training strategy that updates only new expert blocks, ensuring training efficiency. Extensive experiments demonstrate the effectiveness of NativeTok.
Community
Introduces native visual tokenization with causal dependencies, via NativeTok (MIT and MoCET) and hierarchical training for efficient, coherent image reconstruction with relational token constraints.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Improving Flexible Image Tokenizers for Autoregressive Image Generation (2026)
- DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation (2025)
- Soft Tail-dropping for Adaptive Visual Tokenization (2026)
- ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation (2026)
- SFTok: Bridging the Performance Gap in Discrete Tokenizers (2025)
- NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation (2026)
- One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper