suryakiran786 (Kunal Suri)

upvoted a paper 5 months ago

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Paper • 2401.13919 • Published Jan 25, 2024 • 32

upvoted a collection 7 months ago

Reward Bench 2

Collection

Datasets, spaces, and models for Reward Bench 2 benchmark and paper! • 11 items • Updated 9 days ago • 16

upvoted a paper 7 months ago

BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

Paper • 2505.19457 • Published May 26, 2025 • 64

upvoted an article 8 months ago

Article

TinyAgents: A Minimal Experiment with Code Agents and MCP Tools

May 16, 2025

•

30

upvoted 5 papers 10 months ago

PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving

Paper • 2502.16111 • Published Feb 22, 2025 • 9

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Paper • 2502.14768 • Published Feb 20, 2025 • 47

upvoted 2 articles 11 months ago

Article

NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

+2

Feb 2, 2024

•

4

Article

Agent Leaderboard: Evaluating AI Agents in Multi-Domain Scenarios

Feb 12, 2025

•

27

upvoted 4 papers 11 months ago

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

Paper • 2502.06703 • Published Feb 10, 2025 • 153

Efficient Tool Use with Chain-of-Abstraction Reasoning

Paper • 2401.17464 • Published Jan 30, 2024 • 21

Training Language Model Agents without Modifying Language Models

Paper • 2402.11359 • Published Feb 17, 2024 • 2

Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments

Paper • 2402.14672 • Published Feb 22, 2024 • 1

upvoted an article 11 months ago

Article

π0 and π0-FAST: Vision-Language-Action Models for General Robot Control

+2

Feb 4, 2025

•

186

upvoted a paper 11 months ago

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Paper • 2501.07301 • Published Jan 13, 2025 • 99

upvoted 3 papers 12 months ago

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Paper • 2501.06186 • Published Jan 10, 2025 • 65

DynaSaur: Large Language Agents Beyond Predefined Actions

Paper • 2411.01747 • Published Nov 4, 2024 • 37

Executable Code Actions Elicit Better LLM Agents

Paper • 2402.01030 • Published Feb 1, 2024 • 184

Kunal Suri

AI & ML interests

Organizations

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Reward Bench 2

BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

TinyAgents: A Minimal Experiment with Code Agents and MCP Tools

SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

API Agents vs. GUI Agents: Divergence and Convergence

TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools

PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

Agent Leaderboard: Evaluating AI Agents in Multi-Domain Scenarios

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

Efficient Tool Use with Chain-of-Abstraction Reasoning

Training Language Model Agents without Modifying Language Models

Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments

π0 and π0-FAST: Vision-Language-Action Models for General Robot Control

The Lessons of Developing Process Reward Models in Mathematical Reasoning

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

DynaSaur: Large Language Agents Beyond Predefined Actions

Executable Code Actions Elicit Better LLM Agents

Kunal Suri

AI & ML interests

Organizations

suryakiran786's activity

TinyAgents: A Minimal Experiment with Code Agents and MCP Tools

NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

Agent Leaderboard: Evaluating AI Agents in Multi-Domain Scenarios

π0 and π0-FAST: Vision-Language-Action Models for General Robot Control