BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks Paper • 2412.04626 • Published Dec 5, 2024 • 14
Chitrarth: Bridging Vision and Language for a Billion People Paper • 2502.15392 • Published Feb 21, 2025
LitLLMs, LLMs for Literature Review: Are we there yet? Paper • 2412.15249 • Published Dec 15, 2024 • 2
IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs Paper • 2511.04727 • Published Nov 6, 2025
VoiceAgentBench: Are Voice Assistants ready for agentic tasks? Paper • 2510.07978 • Published Oct 9, 2025
Seeing Straight: Document Orientation Detection for Efficient OCR Paper • 2511.04161 • Published Nov 6, 2025
Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems Paper • 2602.16430 • Published Feb 18
Chitranuvad: Adapting Multi-Lingual LLMs for Multimodal Translation Paper • 2502.20420 • Published Feb 27, 2025
CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents Paper • 2603.24440 • Published Mar 25 • 98
EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings Paper • 2603.13594 • Published Mar 13 • 148
BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages Paper • 2511.10338 • Published Nov 13, 2025
Grounding Computer Use Agents on Human Demonstrations Paper • 2511.07332 • Published Nov 10, 2025 • 107
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks Paper • 2412.04626 • Published Dec 5, 2024 • 14
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding Paper • 2502.01341 • Published Feb 3, 2025 • 39
InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation Paper • 2407.06423 • Published Jul 8, 2024
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction Paper • 2503.15661 • Published Mar 19, 2025 • 3