Papers - Image - Encoders
updated
CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows
Paper
• 2107.00652
• Published
• 2
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering
Paper
• 2403.09622
• Published
• 17
Veagle: Advancements in Multimodal Representation Learning
Paper
• 2403.08773
• Published
• 10
mPLUG-Owl: Modularization Empowers Large Language Models with
Multimodality
Paper
• 2304.14178
• Published
• 3
ViTAR: Vision Transformer with Any Resolution
Paper
• 2403.18361
• Published
• 55
TextCraftor: Your Text Encoder Can be Image Quality Controller
Paper
• 2403.18978
• Published
• 15
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact
Language Model
Paper
• 2404.01331
• Published
• 27
PointInfinity: Resolution-Invariant Point Diffusion Models
Paper
• 2404.03566
• Published
• 16
TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models
Paper
• 2109.10282
• Published
• 12
Text Role Classification in Scientific Charts Using Multimodal
Transformers
Paper
• 2402.14579
• Published
• 1