WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models Paper • 2401.13919 • Published Jan 25, 2024 • 32
Reward Bench 2 Collection Datasets, spaces, and models for Reward Bench 2 benchmark and paper! • 11 items • Updated 9 days ago • 16
BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs Paper • 2505.19457 • Published May 26, 2025 • 64
view article Article TinyAgents: A Minimal Experiment with Code Agents and MCP Tools May 16, 2025 • 30
SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially? Paper • 2503.12349 • Published Mar 16, 2025 • 44
API Agents vs. GUI Agents: Divergence and Convergence Paper • 2503.11069 • Published Mar 14, 2025 • 36
TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools Paper • 2503.10970 • Published Mar 14, 2025 • 18
PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving Paper • 2502.16111 • Published Feb 22, 2025 • 9
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning Paper • 2502.14768 • Published Feb 20, 2025 • 47
view article Article NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates +2 Feb 2, 2024 • 4
view article Article Agent Leaderboard: Evaluating AI Agents in Multi-Domain Scenarios Feb 12, 2025 • 27
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling Paper • 2502.06703 • Published Feb 10, 2025 • 153
Efficient Tool Use with Chain-of-Abstraction Reasoning Paper • 2401.17464 • Published Jan 30, 2024 • 21
Training Language Model Agents without Modifying Language Models Paper • 2402.11359 • Published Feb 17, 2024 • 2
Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments Paper • 2402.14672 • Published Feb 22, 2024 • 1
view article Article π0 and π0-FAST: Vision-Language-Action Models for General Robot Control +2 Feb 4, 2025 • 186
The Lessons of Developing Process Reward Models in Mathematical Reasoning Paper • 2501.07301 • Published Jan 13, 2025 • 99
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs Paper • 2501.06186 • Published Jan 10, 2025 • 65
DynaSaur: Large Language Agents Beyond Predefined Actions Paper • 2411.01747 • Published Nov 4, 2024 • 37