SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Paper • 2310.06770 • Published Oct 10, 2023 • 9
Tree of Thoughts: Deliberate Problem Solving with Large Language Models Paper • 2305.10601 • Published May 17, 2023 • 14
COLLIE: Systematic Construction of Constrained Text Generation Tasks Paper • 2307.08689 • Published Jul 17, 2023
SILG: The Multi-environment Symbolic Interactive Language Grounding Benchmark Paper • 2110.10661 • Published Oct 20, 2021
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering Paper • 2405.15793 • Published May 6, 2024 • 7
SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? Paper • 2410.03859 • Published Oct 4, 2024 • 1
VideoGameBench: Can Vision-Language Models complete popular video games? Paper • 2505.18134 • Published May 23 • 6
Contextual Experience Replay for Self-Improvement of Language Agents Paper • 2506.06698 • Published Jun 7
ShieldGemma: Generative AI Content Moderation Based on Gemma Paper • 2407.21772 • Published Jul 31, 2024 • 14