Skip to content

weitianxin/Awesome-Agentic-Reasoning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Agentic Reasoning Papers

Awesome arXiv Coverage Hugging Face #1 Paper of the Day

License: MIT Contributions Welcome Last Commit Visitors

This repository organizes research by thematic areas that integrate reasoning with action, including planning, tool use, search, self-evolution through memory and feedback, multi-agent systems, and real-world applications and benchmarks.

📄 Based on the survey: Agentic Reasoning for Large Language Models: A Survey

Framework overview

🔔 News

[01/21/26] 🚀 We have released a comprehensive survey on Agentic Reasoning for Large Language Models! The paper is now available on arxiv and HuggingFace. We welcome contributions from the community to help expand and improve our survey 🤗!

📋 Table of Contents


🌟 Introduction

Bridging thought and action through autonomous agents that reason, act, and learn via continual interaction with their environments. The goal is to enhance agent capabilities by grounding reasoning in action.

We organize agentic reasoning into three layers, each corresponding to a distinct reasoning paradigm under different environmental dynamics:

🔹 Foundational Reasoning. Core single-agent abilities (planning, tool-use, search) in environments

🔹 Self-Evolving Reasoning. Adaptation through feedback, memory, and learning in dynamic settings

🔹 Collective Reasoning. Multi-agent coordination, role specialization, and collaborative intelligence

Across these layers, we further identify complementary reasoning paradigms defined by their optimization settings.

🔸 In-Context Reasoning. Test-time scaling through structured orchestration and adaptive workflows

🔸 Post-Training Reasoning. Behavior optimization via RL and supervised fine-tuning

🤝 Contributing

This collection is an ongoing effort. We are actively expanding and refining its coverage, and welcome contributions from the community. You can:

We regularly update the repository to include new research.

📝 Citation

If you find this repository or paper useful, please consider citing the survey paper:

@article{wei2026agentic,
  title={Agentic Reasoning for Large Language Models},
  author={Wei, Tianxin and Li, Ting-Wei and Liu, Zhining and Ning, Xuying and Yang, Ze and Zou, Jiaru and Zeng, Zhichen and Qiu, Ruizhong and Lin, Xiao and Fu, Dongqi and others},
  journal={arXiv preprint arXiv:2601.12538},
  year={2026}
}

🏗️ Foundational Agentic Reasoning

🗺️ Planning Reasoning

plan

In-context Planning

Workflow Design
Paper Year
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency 2023
PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change NeurIPS 2023 DB Track
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models 2023
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models 2024
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models ICLR 2023
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models ACL 2023
Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models ICML 2024
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face 2023
Plan, Eliminate, and Track -- Language Models are Good Teachers for Embodied Agents 2023
PERIA: Perceive, Reason, Imagine, Act via Holistic Language and Vision Planning for Manipulation 2024
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks 2025
CodePlan: Repository-level Coding using LLMs and Planning FSE 2024
ReAct: Synergizing Reasoning and Acting in Language Models ICLR 2023
Mind2Web: Towards a Generalist Agent for the Web NeurIPS 2023
WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents 2024
Executable Code Actions Elicit Better LLM Agents ICML 2024
Gorilla: Large Language Model Connected with Massive APIs 2023
Reflexion: Language Agents with Verbal Reinforcement Learning 2023
CodeNav: Beyond Tool-Use to Using Real-World Codebases with LLM Agents ACL 2024
MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing 2025
Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection Agents 2025
Pre-Act: Multi-Step Planning and Reasoning Improves Acting in LLM Agents 2025
REST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent 2023
Self-Planning Code Generation with Large Language Models TOSEM 2023
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action CoRL 2022
Tree Search / Algorithm Simulation
Paper Year
Tree of Thoughts: Deliberate Problem Solving with Large Language Models NeurIPS 2023
Tree Search for Language Model Agents 2024
Tree-Planner: Efficient Planning with Large Language Models ICLR 2024
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning 2024
LLM-A*: Large Language Model Enhanced Incremental Heuristic Search on Path Planning 2024
Multimodal Chain-of-Thought Reasoning in Language Models 2023
Reasoning with Language Model is Planning with World Model NeurIPS 2023
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents 2024
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning 2024
Prompt-Based Monte-Carlo Tree Search for Goal-Oriented Dialogue Policy Planning 2023
Large Language Models as Tool Makers ICLR 2024
Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation 2023
Tree of Thoughts: Deliberate Problem Solving with Large Language Models NeurIPS 2023
Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training 2023
Broaden your SCOPE! Efficient Multi-turn Conversation Planning for LLMs with Semantic Space 2025
Self-Evaluation Guided Beam Search for Reasoning NeurIPS 2023
PathFinder: Multimodal Multi-Agent Medical Diagnosis Framework 2025
Discriminator-Guided Embodied Planning for LLM Agent ICLR 2025
Stream of Search (SoS): Learning to Search in Language 2024
System-1.x: Learning to Balance Fast and Slow Planning with Language Models 2024
Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems 2024
Intelligent Virtual Assistants with LLM-based Process Automation 2023
Agent S: An Open Agentic Framework that Uses Computers Like a Human 2024
HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking 2025
Tree-of-Code: A Tree-Structured Exploring Framework for End-to-End Code Generation and Execution in Complex Task Handling ACL 2025
Enhancing LLM-Based Agents via Global Planning and Hierarchical Execution 2025
Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning 2025
SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement ICLR 2025
BTGenBot: Behavior Tree Generation for Robotic Tasks with Lightweight LLMs 2024
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances CoRL 2022
Inner Monologue: Embodied Reasoning through Planning with Language Models CoRL 2022
Process Formalization
Paper Year
Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning NeurIPS 2023
Leveraging Environment Interaction for Automated PDDL Translation and Planning with Large Language Models NeurIPS 2024
Thought of Search: Planning with Language Models Through The Lens of Efficiency NeurIPS 2024
CodePlan: Repository-level Coding using LLMs and Planning FSE 2024
Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming 2024
From An LLM Swarm To A PDDL-Empowered HIVE: Planning Self-Executed Instructions In A Multi-Modal Jungle 2024
Decoupling / Decomposition
Paper Year
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models NeurIPS 2023
DiffuserLite: Towards Real-time Diffusion Planning 2024
Goal-Space Planning with Subgoal Models JMLR 2024
Agent-Oriented Planning in Multi-Agent Systems 2024
GoPlan: Goal-Conditioned Offline Reinforcement Learning by Planning with Learned Models 2023
RetroInText: A Multimodal Large Language Model Enhanced Framework for Retrosynthetic Planning via In-Context Representation Learning ICLR 2025
HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking 2025
VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning 2024
Beyond Autoregression: Discrete Diffusion for Complex Reasoning 2024
PlanAgent: A Multi-modal Large Language Agent for Vehicle Motion Planning 2024
LLaMAR: Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments 2024
External Aid / Tool Use
Paper Year
Plan-on-Graph: Self-Correcting Adaptive Planning on Knowledge Graphs NeurIPS 2024
Hierarchical Planning for Complex Tasks with Knowledge Graph-RAG and Symbolic Verification 2025
TeLoGraF: Temporal Logic Planning via Graph-encoded Flow Matching 2025
FlexPlanner: Flexible 3D Floorplanning via Deep Reinforcement Learning in Hybrid Action Space with Multi-Modality Representation NeurIPS 2024
Exploratory Retrieval-Augmented Planning For Continual Embodied Instruction Following NeurIPS 2024
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent 2024
RAG over Tables: Hierarchical Memory Index, Multi-Stage Retrieval, and Benchmarking 2025
Reasoning with Language Model is Planning with World Model NeurIPS 2023
Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning NeurIPS 2023
Agent Planning with World Knowledge Model NeurIPS 2024
BehaviorGPT: Smart Agent Simulation for Autonomous Driving with Next-Patch Prediction NeurIPS 2024
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning 2024
FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model 2024
Continual Reinforcement Learning by Planning with Online World Models 2025
AdaWM: Adaptive World Model based Planning for Autonomous Driving 2025
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face 2023
Tool-Planner: Task Planning with Clusters across Multiple Tools 2024
RetroInText: A Multimodal Large Language Model Enhanced Framework for Retrosynthetic Planning via In-Context Representation Learning ICLR 2025

Post-training Planning

Paper Year
Reflexion: Language Agents with Verbal Reinforcement Learning NeurIPS 2023
Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens 2025
Rational Decision-Making Agent with Internalized Utility Judgment 2023
Scaling Autonomous Agents via Automatic Reward Modeling 2025
Strategic Planning: A Top-Down Approach to Option Generation 2025
Non-myopic Generation of Language Models for Reasoning and Planning 2024
Physics-informed Temporal Difference Metric Learning for Robot Motion Planning 2025
Generalizable Motion Planning via Operator Learning 2024
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration 2025
Latent Diffusion Planning for Imitation Learning 2025
SafeDiffuser: Safe Planning with Diffusion Probabilistic Models ICLR 2023
ContraDiff: Planning Towards High Return States via Contrastive Learning ICLR 2025
Amortized Planning with Large-Scale Transformers: A Case Study on Chess NeurIPS 2024
GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models 2023
A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks 2025

🛠️ Tool-Use Optimization

tool

In-Context Tool-Integration

Interleaving Reasoning and Tool Use
Paper Year
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models NeurIPS 2022
ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models EMNLP 2023
MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting ACL 2023
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions ACL 2023
ReAct: Synergizing Reasoning and Acting in Language Models ICLR 2023
ART: Automatic Multi-step Reasoning and Tool-use for Large Language Models 2023
Optimizing Context for Tool Interaction
Paper Year
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models 2023
EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction NAACL 2025
GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution EACL 2024
AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning NeurIPS 2024

Post-training Tool-Integration

Bootstrapping of Tool Use via SFT
Paper Year
Toolformer: Language Models Can Teach Themselves to Use Tools NeurIPS 2023
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs ICLR 2024
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases 2023
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models NeurIPS 2023
RestGPT: Connecting Large Language Models with Real-World RESTful APIs 2023
ADaPT: As-Needed Decomposition and Planning with Language Models 2023
Agent Lumos: Unified and Modular Training for Open-Source Language Agents 2023
Learning to Use Tools via Cooperative and Interactive Agents 2024
Understanding the Effects of RLHF on LLM Generalisation and Diversity 2023
Preserving Diversity in Supervised Fine-Tuning of Large Language Models 2024
Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification EMNLP 2024
Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning 2025
iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use 2025
START: Self-taught Reasoner with Tools 2025
Mastery of Tool Use via RL
Paper Year
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution 2025
SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement 2024
ToolRL: Reward is All Tool Learning Needs 2025
RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents 2025
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning 2025
AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning 2025
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning 2025
Agentic Reinforced Policy Optimization 2025
Agentic Entropy-Balanced Policy Optimization 2025
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning 2025
DeepAgent: A General Reasoning Agent with Scalable Toolsets 2025
Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning 2025
Demystifying Reinforcement Learning in Agentic Reasoning 2025
Reinforcement Pre-Training 2025
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs 2025
ZeroSearch: Incentivize the Search Capability of LLMs Without Searching 2025
Kimi k1.5: Scaling Reinforcement Learning with LLMs 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning and Next Generation Agentic Capabilities 2025
Kimi k2: Open Agentic Intelligence 2025
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models 2025
TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning 2025

Orchestration-based Tool-Integration

Agentic Pipelines for Tool Orchestration
Paper Year
ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback 2025
Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning KDD 2025
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning 2025
Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models 2025
PyVision: Agentic Vision with Dynamic Tooling 2025
Learning to Use Tools via Cooperative and Interactive Agents 2024
El Agente: An Autonomous Agent for Quantum Chemistry 2025
Tool Representations for Orchestration
Paper Year
ToolExpNet: Optimizing Multi-Tool Selection in LLMs with Similarity and Dependency-Aware Experience Networks ACL (Findings) 2025
T^2Agent: A Tool-augmented Multimodal Misinformation Detection Agent with Monte Carlo Tree Search 2025
ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search 2023
ToolRerank: Adaptive and Hierarchy-Aware Reranking for Tool Retrieval COLING 2024

🔍 Agentic Search

search

In-Context Search

Interleaving Reasoning and Search
Paper Year
ReAct: Synergizing Reasoning and Acting in Language Models ICLR 2023
Measuring and Narrowing the Compositionality Gap in Language Models 2022
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions 2022
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection NeurIPS Workshop 2023
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-Adaptive Planning Agent 2024
DeepRAG: Thinking to Retrieve Step by Step for Large Language Models 2025
MC-Search: Benchmarking Multimodal Agentic RAG with Structured Reasoning Chains NeurIPS Workshop 2025
Structure-Enhanced Search
Paper Year
Agent-G: An Agentic Framework for Graph Retrieval Augmented Generation 2025
MC-Search: Benchmarking Multimodal Agentic RAG with Structured Reasoning Chains NeurIPS Workshop 2025
GeAR: Graph-Enhanced Agent for Retrieval-Augmented Generation 2024
Learning to Retrieve and Reason on Knowledge Graph through Active Self-Reflection 2025

Post-Training Search

SFT-Based Agentic Search
Paper Year
Toolformer: Language Models Can Teach Themselves to Use Tools NeurIPS 2023
INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning 2024
RAG-Studio: Towards In-Domain Adaptation of Retrieval Augmented Generation through Self-Alignment EMNLP (Findings) 2024
RAFT: Adapting Language Model to Domain Specific RAG 2024
Search-o1: Agentic search-enhanced large reasoning models 2025
RA-DIT: Retrieval-Augmented Dual Instruction Tuning ICLR 2023
SFR-RAG: Towards Contextually Faithful LLMs 2024
RL-Based Agentic Search
Paper Year
WebGPT: Browser-assisted question-answering with human feedback 2021
RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning 2025
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning 2025
KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering 2025
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-World Environments 2025
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning 2025
ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding 2025

🧬 Self-evolving Agentic Reasoning

🔄 Agentic Feedback Mechanisms

feed

Reflective Feedback

Paper Year
Reflexion: Language Agents with Verbal Reinforcement Learning NeurIPS 2023
Self-Refine: Iterative Refinement with Self-Feedback NeurIPS 2023
Enable Language Models to Implicitly Learn Self-Improvement From Data ICLR 2024
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve TMLR 2025
Tree of Thoughts: Deliberate Problem Solving with Large Language Models NeurIPS 2023
Graph of Thoughts: Solving Elaborate Problems with Large Language Models AAAI 2024
Zero-Shot Verification-Guided Chain of Thoughts 2025
ReAct: Synergizing Reasoning and Acting in Language Models ICLR 2023
WebGPT: Browser-assisted Question-Answering with Human Feedback 2021
MemGPT: Towards LLMs as Operating Systems 2023
Voyager: An Open-Ended Embodied Agent with Large Language Models 2023

Parametric Adaptation

Paper Year
AgentTuning: Enabling Generalized Agent Abilities for LLMs 2023
ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent 2023
Re-ReST: Reflection-Reinforced Self-Training for Language Agents 2024
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes 2023
Deep Reinforcement Learning from Human Preferences NeurIPS 2017
Direct Preference Optimization: Your Language Model is Secretly a Reward Model NeurIPS 2023
Constitutional AI: Harmlessness from AI Feedback 2022
ReflectEvo: Improving Meta Introspection of Small LLMs by Learning Self-Reflection ACL (Findings) 2025

Validator-Driven Feedback

Paper Year
ReZero: Enhancing LLM search ability by trying one-more-time 2025
Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback 2025
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning 2022
LEVER: Learning to Verify Language-to-Code Generation with Execution ICML 2023
SWE-bench: Can Language Models Resolve Real-world Github Issues? ICLR 2024
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances CoRL 2022
PaLM-E: An Embodied Multimodal Language Model ICML 2023
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning 2025

🧠 Agentic Memory

mem

Agentic Use of Flat Memory

Factual Memory
Paper Year
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks NeurIPS 2020
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection ICLR 2024
MemoryBank: Enhancing Large Language Models with Long-Term Memory 2023
LlamaIndex 2022
MemGPT: Towards LLMs as Operating Systems 2023
RET-LLM: Towards a General Read-Write Memory for Large Language Models 2023
SCM: Enhancing Large Language Model with Self-Controlled Memory Framework 2023
Evaluating Very Long-Term Conversational Memory of LLM Agents 2024
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory 2024
SELFGOAL: Your Language Agents Already Know How to Achieve High-level Goals NAACL 2025
FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design 2023
A-mem: Agentic memory for llm agents 2025
In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents 2025
Zep: A Temporal Knowledge Graph Architecture for Agent Memory 2025
MIRIX: Multi-Agent Memory System for LLM-Based Agents 2025
MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models 2025
LightMem: Lightweight and Efficient Memory-Augmented Generation 2025
Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science 2025
Experience Memory
Paper Year
Agent Workflow Memory 2024
Sleep-time Compute: Beyond Inference Scaling at Test-time 2025
Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory 2025
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models 2025
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory 2025
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory 2025

Structured Use of Memory

Paper Year
RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph 2024
From Local to Global: A Graph RAG Approach to Query-Focused Summarization 2024
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory 2025
Zep: A Temporal Knowledge Graph Architecture for Agent Memory 2025
From Isolated Conversations to Hierarchical Schemas: Dynamic Tree Memory Representation for LLMs 2024
AutoFlow: Automated Workflow Generation for Large Language Model Agents 2024
AFlow: Automating Agentic Workflow Generation ICLR 2025
FlowMind: Automatic Workflow Generation with LLMs 2024
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory (M3-Agent) 2025
Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations 2025
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks NeurIPS 2024
RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents 2024

Post-training Memory Control

Paper Year
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent 2025
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents 2025
Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning 2025
Mem-alpha: Learning Memory Construction via Reinforcement Learning 2025
Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks 2025
Agent Learning via Early Experience 2025
Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents 2026
MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory 2026

🚀 Evolving Foundational Agentic Capabilities

mem

Self-evolving Planning

Paper Year
Self-challenging language model agents 2025
Self-rewarding language models ICML 2024
RLSR: Reinforcement Learning from Self Reward 2025
Self: Self-evolution with language feedback 2023
Training language models to self-correct via reinforcement learning 2024
TextGrad: Differentiable Text Feedback for Language Models 2024
AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning 2025
AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation 2024
Reflexion: Language agents with verbal reinforcement learning NeurIPS 2023
Adaplanner: Adaptive planning from feedback with language models NeurIPS 2023
Self-refine: Iterative refinement with self-feedback NeurIPS 2023
A self-improving coding agent 2025
Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning 2025
DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning 2025

Self-evolving Tool-use

Paper Year
Large Language Models as Tool Makers ICLR 2024
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets ICLR 2024
CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models EMNLP 2023
LLM Agents Making Agent Tools 2025

Self-evolving Search for Memory Retrieval

Paper Year
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks NeurIPS 2020
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection ICLR 2024
MemoryBank: Enhancing Large Language Models with Long-Term Memory 2023
MemGPT: Towards LLMs as Operating Systems 2023
Agent Workflow Memory 2024
Dynamic Cheatsheet: Test-time learning with adaptive memory 2025
Reflexion: Language agents with verbal reinforcement learning NeurIPS 2023
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory 2025
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models 2025
AutoFlow: Automated Workflow Generation for Large Language Model Agents 2024
AFlow: Automating Agentic Workflow Generation ICLR 2025
FlowMind: Automatic Workflow Generation with LLMs 2024
RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph 2024
From Local to Global: A Graph RAG Approach to Query-Focused Summarization 2024
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory 2025
Zep: A Temporal Knowledge Graph Architecture for Agent Memory 2025
MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models 2025
Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks 2025

👥 Collective Multi-agent Reasoning

mem

🤝 Collaboration and Division of Labor

collab

In-context Collaboration

Manually Crafted Pipelines
Paper Year
AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose Task Solving 2025
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework ICLR 2024
SurgRAW: Multi-agent workflow with chain-of-thought reasoning for surgical intelligence 2025
Collab-RAG: Boosting retrieval-augmented generation for complex question answering via white-box and black-box llm collaboration 2025
MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning 2025
Chain of Agents: Large Language Models Collaborating on Long-Context Tasks NeurIPS 2024
AutoAgents: a framework for automatic agent generation IJCAI 2024
RAG-KG-IL: A Multi-Agent Hybrid Framework for Reducing Hallucinations and Enhancing LLM Reasoning 2025
SMoA: Improving Multi-agent Large Language Models with Sparse Mixture-of-Agents 2024
MDocAgent: A multi-modal multi-agent framework for document understanding 2025
LLM-Driven Pipelines
Paper Year
AutoML-Agent: A multi-agent llm framework for full-pipeline automl 2024
Magentic-One: A generalist multi-agent system for solving complex tasks 2024
MAS-GPT: Training LLMs to build LLM-based multi-agent systems 2025
MetaAgent: Automatically Constructing Multi-Agent Systems Based on Finite State Machines 2025
Agent-oriented planning in multi-agent systems 2024
AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering 2025
Talk to Right Specialists: Routing and planning in multi-agent system for question answering 2025
Theory-of-Mind-Augmented Collaboration
Paper Year
Theory of mind for multi-agent collaboration via large language models 2023
Hypothetical Minds: Scaffolding theory of mind for multi-agent tasks with large language models 2024
MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning 2024
How large language models encode theory-of-mind: a study on sparse parameter patterns npj Artificial Intelligence 2025
Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual Reflection 2025
BeliefNest: A Joint Action Simulator for Embodied Agents with Theory of Mind 2025

Post-training Collaboration

Multi-agent Prompt Optimization
Paper Year
AutoAgents: A Framework for Automatic Agent Generation IJCAI 2024
Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration NAACL 2024
DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines 2023
Multi-agent Design: Optimizing Agents with Better Prompts and Topologies 2025
Automatic Prompt Optimization with "Gradient Descent" and Beam Search 2023
Graph-based Topology Generation
Paper Year
Learning Multi-Agent Communication from Graph Modeling Perspective 2024
G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks 2024
Graph Diffusion for Robust Multi-Agent Coordination ICML 2025
Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems 2024
Adaptive Graph Pruning for Multi-Agent Communication 2025
G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-Agent Systems 2025
AFlow: Automating Agentic Workflow Generation ICLR 2025
Multi-agent Design: Optimizing Agents with Better Prompts and Topologies 2025
Multi-Agent Architecture Search via Agentic Supernet 2025
DynaSwarm: Dynamically Graph Structure Selection for LLM-based Multi-Agent System 2025
GPTSwarm: Language Agents as Optimizable Graphs ICML 2024
Policy-based Topology Generation
Paper Year
MASRouter: Learning to Route LLMs for Multi-Agent Systems 2025
RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory 2025
xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning 2025
Optimal-Agent-Selection: State-Aware Routing Framework for Efficient Multi-Agent Collaboration 2025
LLM Collaboration with Multi-Agent Reinforcement Learning 2025
Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems 2025
Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-based Planner and Graph-based Policy 2025
LAMARL: LLM-Aided Multi-Agent Reinforcement Learning for Cooperative Policy Generation IEEE RA-L 2025
MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning 2025
Reflective Multi-Agent Collaboration Based on Large Language Models NeurIPS 2024
Sirius: Self-Improving Multi-Agent Systems via Bootstrapped Reasoning 2025
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains 2025
M3HF: Multi-Agent Reinforcement Learning from Multi-Phase Human Feedback of Mixed Quality 2025
O-MAPL: Offline Multi-Agent Preference Learning 2025

🌱 Multi-Agent Memory and Evolution

mem

From Single-Agent Evolution to Multi-Agent Evolution

Intra-test-time Evolution
Paper Year
Reflexion: Language Agents with Verbal Reinforcement Learning NeurIPS 2023
Self-Refine: Iterative Refinement with Self-Feedback NeurIPS 2023
AdaPlanner: Adaptive Planning from Feedback with Language Models NeurIPS 2023
TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution TiFA 2024
Self-Adapting Language Models 2025
TTRL: Test-Time Reinforcement Learning 2025
Ladder: Self-Improving LLMs through Recursive Problem Decomposition 2025
Inter-test-time Evolution
Paper Year
Self: Self-Evolution with Language Feedback 2023
STaR: Bootstrapping Reasoning with Reasoning NeurIPS 2022
Reasoning Beyond Limits: Advances and Open Problems for LLMs 2025
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning 2025
DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning 2025
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning 2024
Why do animals need shaping? A theory of task composition and curriculum learning 2024
SAGE: Self-evolving Agents with Reflective and Memory-augmented Abilities Neurocomputing 2025
MemInsight: Autonomous Memory Augmentation for LLM Agents 2025
Agent Workflow Memory 2024
Multi-agent Evolution
Paper Year
Self: Self-Evolution with Language Feedback 2023
Training Language Models to Self-Correct via Reinforcement Learning 2024
TextGrad: Automatic "Differentiation" via Text 2024
REMA: Learning to Meta-Think for LLMs with Multi-Agent Reinforcement Learning 2025
Group-in-Group Policy Optimization for LLM Agent Training 2025
Agent Workflow Memory 2024
MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models 2025
Multi-agent Design: Optimizing Agents with Better Prompts and Topologies 2025
AFlow: Automating Agentic Workflow Generation ICLR 2025
Testing Advanced Driver Assistance Systems Using Multi-Objective Search and Neural Networks ASE 2016
Latent Collaboration in Multi-Agent Systems 2025

Multi-agent Memory Management for Evolution

Paper Year
G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems 2025
Intrinsic Memory Agents: Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory 2025
LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning 2025
SEDM: Scalable Self-Evolving Distributed Memory for Agents 2025
Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control 2025
Memory Sharing for Large Language Model based Agents 2024
MIRIX: Multi-Agent Memory System for LLM-Based Agents 2025
LEGOMem: Modular Procedural Memory for Multi-agent LLM Systems for Workflow Automation 2025
MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning ALTA 2025
Lyfe Agents: Generative agents for low-cost real-time social interactions 2023
Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving 2025

Training Multi-agent to Evolve

Paper Year
Multi-Agent Evolve: LLM Self-Improve through Co-evolution 2025
CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards 2025
MARFT: Multi-Agent Reinforcement Fine-Tuning 2025
Stronger-MAS: Multi-Agent Reinforcement Learning for Collaborative LLMs 2025
MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning 2025
MALT: Multi-Agent Learning from Trajectories 2025
MARS: Optimizing Dual-System Deep Research via Multi-Agent Reinforcement Learning 2025
Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques 2024
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety 2025

🎨 Applications

app

💻 Math Exploration & Vibe Coding Agents

Foundational Agentic Reasoning

Paper Year
Advancing mathematics by guiding human intuition with AI Nature 2021
Solving olympiad geometry without human demonstrations Nature 2024
Mathematical discoveries from program search with large language models Nature 2024
Mathematical Exploration and Discovery at Scale 2025
Advancing geometry with AI: Multi-agent generation of polytopes 2025
Towards Robust Mathematical Reasoning EMNLP 2025
CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules ICLR 2024
Executable Code Actions Elicit Better LLM Agents ICML 2024
Knowledge-Aware Code Generation with Large Language Models ICPC 2024
CodePlan: Repository-level Coding using LLMs and Planning FSE 2024
Multi-stage guided code generation for Large Language Models Eng. App. AI 2025
CodeTree: Agent-Guided Tree Search for Code Generation with Large Language Models 2024
DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning 2024
Tree-of-Code: A Self-Growing Tree Framework for End-to-End Code Generation and Execution in Complex Tasks ACL 2025
CoRT: Code-integrated Reasoning within Thinking 2025
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal 2025
Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search NeurIPS 2024
VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning AAAI 2025
Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents ICML 2025
An In-Context Learning Agent for Formal Theorem-Proving COLM 2024
Formal Mathematical Reasoning: A New Frontier in AI 2024
Generative Modelling for Mathematical Discovery 2025
Toolformer: Language Models Can Teach Themselves to Use Tools NeurIPS 2023
ToolCoder: Teach Code Generation Models to use API search tools 2023
ToolGen: Unified Tool Retrieval and Calling via Generation ICLR 2025
CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges ACL 2024
ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation ICSE 2025
CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision 2025
RepoHyper: Better Context Retrieval is All You Need for Repository-Level Code Completion 2024
CodeNav: Beyond Tool-Use to Using Real-World Codebases with LLM Agents ICLR 2024
Optimizing Code Runtime Performance Through Context-Aware Retrieval-Augmented Generation ICPC 2025
Knowledge Graph Based Repository-Level Code Generation LLM4Code 2025
cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree 2025

Self-evolving Agentic Reasoning

Paper Year
Evaluating Language Models for Mathematics through Interactions PNAS 2024
CLCL: Non-compositional Expression Detection with Contrastive Learning and Curriculum Learning ACL 2023
Is Self-Repair a Silver Bullet for Code Generation? 2024
LeDeX: Learning to Debug with Execution Feedback NeurIPS 2024
Self-Refine: Iterative Refinement with Self-Feedback NeurIPS 2023
A Self-Iteration Code Generation Method Based on Large Language Models ICPADS 2023
Teaching Large Language Models to Self-Debug ICLR 2024
Self-Collaboration Code Generation via ChatGPT TOSEM 2024
L2MAC: Large Language Model Automatic Computer for Extensive Code Generation 2023
Cogito, Ergo Sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation 2025

Collective Multi-agent Reasoning

Paper Year
AgentCoder: Multi-Agent-Based Code Generation with Iterative Testing and Optimisation 2023
A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement ASE 2024
SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents ICSE 2025
Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization 2024
MapCoder: Multi-Agent Code Generation for Competitive Problem Solving 2024
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing 2024
QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks 2025
SEW: Self-Evolving Agentic Workflows for Automated Code Generation 2025
Self-Evolving Multi-Agent Collaboration Networks for Software Development 2024
Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement 2024
CodeCoR: An LLM-based Self-Reflective Multi-Agent Framework for Code Generation 2025
SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering ICML 2025
Hallucination to Consensus: Multi-Agent LLMs for End-to-End Test Generation 2025

🔬 Scientific Discovery Agents

Here are the extracted citation tables grouped by their respective sections.

Foundational Agentic Reasoning

Paper Year
ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning Digital Discovery 2024
Agent-based learning of materials datasets from the scientific literature Digital Discovery 2024
ReAct: Synergizing Reasoning and Acting in Language Models ICLR 2023
Biomni: A General-Purpose Biomedical AI Agent bioRxiv 2025
SciAgent: Tool-augmented Language Models for Scientific Reasoning 2024
Chemcrow: Augmenting large-language models with chemistry tools 2023
CACTUS: Chemistry Agent Connecting Tool-Usage to Science ACS Omega 2024
ChemToolAgent: The Impact of Tools on Language Agents for Chemistry Problem Solving 2024
CheMatAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning 2025
TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools 2025
AgentMD: Empowering language agents for risk prediction with large-scale clinical tool learning Nature Communications 2025
LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation 2024
HoneyComb: A Flexible LLM-Based Agent System for Materials Science 2024
CRISPR-GPT for Agentic Automation of Gene-editing Experiments 2024
PharmAgents: Building a Virtual Pharma with Large Language Model Agents 2025
ORGANA: A robotic assistant for automated chemistry experimentation and characterization Matter 2025
AtomAgents: Alloy design and discovery through physics-aware multi-modal multi-agent artificial intelligence 2024
Chemist-X: Large Language Model-empowered Agent for Reaction Condition Recommendation in Chemical Synthesis 2024
LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery 2024
CellAgent: LLM-Driven Multi-Agent Framework for Natural Language-Based Single-Cell Analysis BioRxiv 2024
BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments 2024
DrugAgent: Multi-Agent Large Language Model-Based Reasoning for Drug-Target Interaction Prediction 2024
Accelerating Scientific Research Through a Multi-LLM Framework 2025
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search 2025
Large Language Models are Zero Shot Hypothesis Proposers 2023
PaperQA: Retrieval-Augmented Generative Agent for Scientific Research 2023
Language agents achieve superhuman synthesis of scientific knowledge 2024
LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation 2024

Self-evolving Agentic Reasoning

Paper Year
ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning 2025
Accelerated Inorganic Materials Design with Generative AI Agents 2025
LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery 2024
ChemReasoner: Heuristic Search over a Large Language Model's Knowledge Space using Quantum-Chemical Feedback 2024
LLMatDesign: Autonomous Materials Discovery with Large Language Models 2024
Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents 2025

Collective multi-agent reasoning

Paper Year
ProtAgents: protein discovery via large language model multi-agent collaborations combining physics and machine learning Digital Discovery 2024
PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration 2025
AtomAgents: Alloy design and discovery through physics-aware multi-modal multi-agent artificial intelligence 2024
CellAgent: LLM-Driven Multi-Agent Framework for Natural Language-Based Single-Cell Analysis BioRxiv 2024
Accelerating Scientific Research Through a Multi-LLM Framework 2025
Toward a team of ai-made scientists for scientific discovery from gene expression data 2024
The virtual lab: Ai agents design new sars-cov-2 nanobodies with experimental validation bioRxiv 2024

🤖 Embodied Agents

Foundational Agentic Reasoning

Paper Year
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances 2022
SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning 2023
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought NeurIPS 2023
Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents ECCV 2024
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents NeurIPS 2023
Robotic Control via Embodied Chain-of-Thought Reasoning 2024
Fast ECoT: Fast Embodied Chain-of-Thought for Vision-Language-Action Models 2025
Cosmos-Reason1: Physical Commonsense with Multimodal Chain of Thought Reasoning 2025
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models 2025
Emma-X: An Embodied Multimodal Action Model with Chain of Thought Reasoning 2024
Robot-R1: Reinforcement Learning Enhanced Large Vision-Language Models for Robotic Manipulation 2025
ManipLVM-R1: Learning to Reason for Robotic Manipulation via Reinforcement Learning 2025
Embodied-R: Emergent Spatial Reasoning in Robotics via Multi-Agent Reinforcement Learning 2025
VIKI-R: A VLM-Based Reinforcement Learning Approach for Heterogeneous Multi-Agent Cooperation 2025
GSCE: A Prompt Framework for Enhanced Logical Reasoning in LLM-Based Drone Control 2025
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge NeurIPS 2022
Physical AI Agents: Integrating Generative AI, Symbolic AI and Robotics 2025
Chat with the Environment: Interactive Multimodal Perception using Large Language Models IROS 2023
An embodied generalist agent in 3d world ICML 2024
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models 2025
Gemini Robotics: Bringing AI to the Physical World 2025
Octopus: Embodied Vision-Language Programmer from Environmental Feedback ECCV 2024
CaPo: Cooperative Plan Optimization for Multi-Agent Collaboration 2024
COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models ICRA 2025
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception CVPR 2024
LLM-Planner: Few-Shot Grounded High-Level Planning for Embodied Agents with Large Language Models ICCV 2023
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought NeurIPS 2023
L3MVN: Leveraging Large Language Models for Visual Target Navigation 2023
SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments ICAPS 2023
SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning CoRL 2023
ReMEmbR: Building and Reasoning with Long-Horizon Spatio-Temporal Memory for Embodied Agents 2025
Embodied-RAG: General Non-parametric Embodied Memory for Retrieval-Augmented Generation NeurIPS Workshop AFM 2024
Retrieval-Augmented Embodied Agents 2024
MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents 2024

Self-evolving Agentic Reasoning

Paper Year
LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics 2024
Optimus-1: Hybrid Multimodal Memory Empowered Agents for Long-Horizon Tasks in Minecraft 2024
Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models EMNLP 2023
Endowing Embodied Agents with Spatial Reasoning Capabilities for Vision-and-Language Navigation 2025
Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents 2024
Ella: Embodied Social Agents with Lifelong Memory 2025
Chat with the Environment: Interactive Multimodal Perception using Large Language Models IROS 2023
From Strangers to Assistants: Fast Desire Alignment for Embodied Agent-User Adaptation 2025
Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners CoRL 2023
Octopus: Embodied Vision-Language Programmer from Environmental Feedback ECCV 2024
MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning 2024
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration 2024
EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM 2025
Voyager: An Open-Ended Embodied Agent with Large Language Models 2023

Collective multi-agent reasoning

Paper Year
Smart-LLM: Smart Multi-Agent Robot Task Planning with Large Language Models 2024
CaPo: Cooperative Plan Optimization for Multi-Agent Collaboration 2024
COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models ICRA 2025
Theory of mind for multi-agent collaboration via large language models 2023
How large language models encode theory-of-mind: a study on sparse parameter patterns npj Artificial Intelligence 2025
Hypothetical Minds: Scaffolding theory of mind for multi-agent tasks with large language models 2024
MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning 2024
EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM 2025
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation 2025
VIKI-R: A VLM-Based Reinforcement Learning Approach for Heterogeneous Multi-Agent Cooperation 2025
RoCo: Dialectic Multi-Robot Collaboration with Large Language Models 2024

🏥 Healthcare & Medicine Agents

Foundational agentic reasoning

Paper Year
Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology Nature Medicine 2024
EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records 2024
PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology 2025
MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow 2025
MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility 2025
ClinicalAgent: Clinical Trial Multi-Agent System with Large Language Model-based Reasoning 2024
DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making 2025
TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools 2025
AgentMD: Empowering language agents for risk prediction with large-scale clinical tool learning Nature Communications 2025
Large language model agents can use tools to perform clinical calculations NPJ Digital Medicine 2025
MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling 2024
MMedAgent: Learning to Use Medical Tools with Multi-modal Agents 2024
VoxelPrompt: A Vision Agent for End-to-End Medical Image Analysis 2024
Enhancing Surgical Robots with Embodied Intelligence for Autonomous Ultrasound Scanning 2024
Adaptive Reasoning and Acting in Medical Language Agents 2024
MedRAX: Medical Reasoning Agent for Chest X-ray 2025
Conversational Health Agents: A Personalized LLM-Powered Agent Framework 2023
MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science 2025
Simulated patient systems powered by large language model-based AI agents offer potential for transforming medical education 2024
Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions MICCAI 2025
RAG-Enhanced Collaborative LLM Agents for Drug Discovery 2025
MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs 2025

Self-evolving agentic reasoning

Paper Year
Epidemic Modeling with Generative Agents 2023
Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions MICCAI 2025
EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records 2024
LLMs Can Simulate Standardized Patients via Agent Coevolution 2024
Simulated patient systems powered by large language model-based AI agents offer potential for transforming medical education 2024
MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility 2025
DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making 2025
MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science 2025
EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records 2024
MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling 2025
Large language model agents can use tools to perform clinical calculations NPJ Digital Medicine 2025

Collective multi-agent reasoning

Paper Year
MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making 2024
DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue 2025
Beyond Direct Diagnosis: LLM-based Multi-Specialist Agent Consultation for Automatic Diagnosis 2024
ClinicalAgent: Clinical Trial Multi-Agent System with Large Language Model-based Reasoning 2024
PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology 2025
Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions MICCAI 2025
LLMs Can Simulate Standardized Patients via Agent Coevolution 2024
DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making 2025
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning 2024
RAG-Enhanced Collaborative LLM Agents for Drug Discovery 2025
GMAI-VL-R1: Harnessing Reinforcement Learning for Multi-Modal Medical Reasoning 2025

🌐 Autonomous Web Exploration & Research Agents

Foundational agentic reasoning

Paper Year
Agent Laboratory: Using LLM Agents as Research Assistants 2025
GPT Researcher 2023
Accelerating Scientific Research Through a Multi-LLM Framework 2025
InternAgent: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification 2025
WebGPT: Browser-assisted question-answering with human feedback 2021
Language Models are Few-Shot Learners NeurIPS 2020
GPT-4V(ision) is a Generalist Web Agent, if Grounded ICML 2024
AutoWebGLM: A Large Language Model-based Web Navigating Agent 2024
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents 2024
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning 2024
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning 2025
Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning 2024
DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning 2025
EvolveSearch: An Iterative Self-Evolving Search Agent 2025
WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model 2025
ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL ICLR 2025
Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents 2024
WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection 2025
ZeroSearch: Incentivize the Search Capability of LLMs Without Searching 2025
StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization 2025
How to Train Your LLM Web Agent: A Statistical Diagnosis 2025
Agent S: An Open Agentic Framework that Uses Computers Like a Human 2024
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2025
MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation 2024
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC 2025
UItron: Foundational GUI Agent with Advanced Perception and Planning 2025
ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay 2025
ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents 2025
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning 2025
GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents 2025
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners 2025
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning 2025
GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration EMNLP 2025
Learning GUI Grounding with Spatial Reasoning from Visual Feedback 2025
GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning 2025
UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding 2025
ZeroGUI: Automating Online GUI Learning at Zero Human Cost 2025
AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning 2025
AutoGLM: Autonomous Foundation Agents for GUIs 2024
Mobile-Agent-v3: Fundamental Agents for GUI Automation 2025
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models ACL 2024
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions 2025
WALT: Web Agents that Learn Tools 2025
WebDancer: Towards Autonomous Information Seeking Agency 2025
WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization 2025
AutoDroid: LLM-powered Task Automation in Android MobiCom 2024
MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices 2024
AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant 2024
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement 2024
OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning 2024
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents 2024
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents 2024
Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools 2025
Agent Laboratory: Using LLM Agents as Research Assistants 2025
MLR-Copilot: Autonomous Machine Learning Research based on Large Language Model Agents 2024
Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback 2025
The AI Scientist: Fully Automated Open-Ended Scientific Discovery 2024
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search 2025
WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents 2025
WebSailor: Navigating Super-human Reasoning for Web Agent 2025
RaDA: Retrieval-augmented Web Agent Planning with LLMs 2024
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control ICLR 2024
LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark 2025
Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation 2023
Retrieval-augmented GUI Agents with Generative Guidelines 2025
WebThinker: Empowering Large Reasoning Models with Deep Research Capability 2025
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments 2025
PaperQA: Retrieval-Augmented Generative Agent for Scientific Research 2023
Language agents achieve superhuman synthesis of scientific knowledge 2024
Chain of Ideas: Revolutionizing Research Via Novel Idea Development with LLM Agents 2024
Scideator: Human-LLM Scientific Idea Generation Grounded in Research-Paper Facet Recombination 2024

Self-evolving agentic reasoning

Paper Year
Agent Workflow Memory 2024
VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought 2024
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions 2025
AutoWebGLM: A Large Language Model-based Web Navigating Agent 2024
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents 2024
LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications 2025
WebDancer: Towards Automated Web Information Seeking with Large Language Model Agents 2025
WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization 2025
Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation 2023
MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation 2024
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks 2025
Agent Laboratory: Using LLM Agents as Research Assistants 2025
GPT Researcher 2023
Chain of Ideas: Revolutionizing Research Via Novel Idea Development with LLM Agents 2024
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search 2025
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents 2024
Reflection-Based Memory For Web navigation Agents 2025
Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems 2024
Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution 2025
WINELL: Wikipedia Never-Ending Updating with LLM Agents 2025
WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection 2025
GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior 2025
History-Aware Reasoning for GUI Agents 2025
MobileUse: A GUI Agent with Hierarchical Reflection for Autonomous Mobile Operation 2025
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2025
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks 2025
CycleResearcher: Improving Automated Research via Automated Review 2024
MLR-Copilot: Autonomous Machine Learning Research based on Large Language Model Agents 2024
Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback 2025
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments 2025

Collective multi-agent reasoning

Paper Year
WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration 2024
WINELL: Wikipedia Never-Ending Updating with LLM Agents 2025
Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution 2025
Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents 2024
Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems 2024
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks 2025
Agentic Web: Weaving the Next Web with AI Agents 2025
CoLA: Collaborative Low-Rank Adaptation 2025
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration ACL 2024
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks 2025
Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation 2025
MobileExperts: Orchestrating Tool-Capable Specialists for Mobile Automation 2024
Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use 2025
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC 2025
AgentRxiv: Towards Collaborative Autonomous Research 2025
Accelerating Scientific Research Through a Multi-LLM Framework 2025
Large Language Models are Zero-Shot Reasoners NeurIPS 2022
Emergent autonomous scientific research capabilities of large language models Nature 2023
Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data 2024

📊 Benchmarks

bench

⚙️ Core Mechanisms of Agentic Reasoning

Tool Use

Single-Turn Tool Use
Paper Year
ToolQA: A Dataset for LLM Question Answering with External Tools NeurIPS 2023
Gorilla: Large Language Model Connected with Massive APIs 2023
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs ICLR 2024
MetaTool: A Benchmark for Controlling Special-purpose Large Language Models ICLR 2024
T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step ACL 2024
GTA: A Benchmark for General Tool Agents NeurIPS 2024
Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models 2025
Multi-Turn Tool Use
Paper Year
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases 2023
On the Tool Manipulation Capability of Open-source Large Language Models 2023
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs EMNLP 2023
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios ACL 2024
MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models ICLR 2025

Search

Memory and Planning

Long-Horizon Episodic Memory
Paper Year
PerLTQA: A Persona-based Long-term Memory Benchmark for RAG 2024
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context LLMs 2024
Multi-IF: A Benchmark for Multi-turn Instruction Following 2024
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs 2025
TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models 2025
StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns 2025
MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation 2025
Multi-session Recall
Paper Year
Evaluating Very Long-Term Conversational Memory of LLM Agents 2024
MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants 2024
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory 2024
REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation 2025
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions 2025
Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents 2026
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory 2025
Planning and Feedback
Paper Year
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning ICLR 2021
PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change NeurIPS 2022
ACPBench: Reasoning about Action, Change, and Planning 2024
Text2World: Benchmarking Large Language Models for Symbolic World Model Generation ACL 2025
REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks 2025
TravelPlanner: A Benchmark for Real-World Planning with Language Agents ICML 2024
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents 2024
UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models 2025

Multi-Agent System

Game-based reinforcement learning evaluation
Paper Year
MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence AAAI 2018
Pommerman: A Multi-Agent Playground 2018
The StarCraft Multi-Agent Challenge NeurIPS 2019
MineLand: Simulating Large-Scale Multi-Agent Interactions with Limited Multimodal Senses and Physical Needs 2024
TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft 2024
Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting Pot ICML 2021
BenchMARL: Benchmarking Multi-Agent Reinforcement Learning 2023
Arena: A General Evaluation Platform and Building Toolkit for Multi-Agent Intelligence AAAI 2020

Simulation-centric real-world assessment

Paper Year
SMARTS: Scalable Multi-Agent Reinforcement Learning Training School for Autonomous Driving CoRL 2020
Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world NeurIPS 2022
A Versatile Multi-Agent Reinforcement Learning Benchmark for Inventory Management 2023
IMP-MARL: a Suite of Environments for Infrastructure Management Planning with Multi-Agent Reinforcement Learning NeurIPS 2023
POGEMA: Partially Observable Grid Environment for Multiple Agents Arxiv 2022
IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning NeurIPS 2024
REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks 2025

Language, Communication, and Social Reasoning

Paper Year
LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models 2023
AvalonBench: Evaluating LLMs Playing the Game of Avalon 2023
Welfare Diplomacy: Benchmarking Language Model Cooperation 2023
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration EMNLP 2024
BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems 2024
COMMA: A Benchmark for Inter-Agent Communication in Multi-Agent Systems 2024
IntellAgent: A Benchmark for Evaluating Conversational Agents in Realistic Scenarios 2025
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents 2025

🎯 Applications of Agentic Reasoning

Embodied Agents

Paper Year
Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks 2025
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games NeurIPS 2024
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning ICLR 2021
Understanding the Weakness of Large Language Model Agents within a Complex Android Environment 2024
MindAgent: Emergent Gaming Interaction 2023
Playing repeated games with Large Language Models 2023
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments NeurIPS 2024

Scientific Discovery Agents

Paper Year
DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents NeurIPS 2024
ScienceWorld: Is your Agent Smarter than a 5th Grader? EMNLP 2022
ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery NeurIPS 2024
The AI Scientist: Fully Automated Open-Ended Scientific Discovery 2024
LAB-Bench: Measuring Capabilities of Language Models for Biology Research 2024
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation 2023

Autonomous Research Agents

Paper Year
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? ICML 2024
WorkArena++: Towards Agents that Act Like Employees 2024
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation 2024
PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change NeurIPS 2022
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents 2024
ACPBench: Reasoning about Action, Change, and Planning 2024
TRAIL: Trace Reasoning and Agentic Issue Localization 2025
CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization NeurIPS 2023
Agent-as-a-Judge: Evaluate Agents with Agents 2024
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation 2025

Medical and Clinical Agents

Paper Year
AgentClinic: a multimodal agent benchmark for clinical environments NeurIPS 2024
MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents NEJM AI 2025
EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records 2024
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning 2023
GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning 2024

Web Agents

Paper Year
WebArena: A Realistic Web Environment for Building Autonomous Agents ICLR 2024
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks NeurIPS 2024
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models ACL 2024
Mind2Web: Towards a Generalist Agent for the Web NeurIPS 2023
WebCanvas: Benchmarking Web Agents in Online Canvas NeurIPS 2024
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue CVPR 2024
LASER: LLM Agent with State-Space Exploration for Web Navigation NeurIPS 2023
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Agent for Automated Web Navigation 2024

General Tool-Use Agents

Paper Year
GTA: A Benchmark for General Tool Agents NeurIPS 2024
NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls 2024
Executable Code Actions Elicit Better LLM Agents ICML 2024
RestGPT: Connecting Large Language Models with Real-World RESTful APIs 2023
Search-o1: Agentic Search-Enhanced Large Reasoning Models 2025
Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning 2025
ActionReasoningBench: Reasoning about Actions with and without Ramification Constraints 2024
R-Judge: Benchmarking Safety-Critical Decision Making for LLM Agents 2024

License

This repository is licensed under the MIT License.


Star History

Star History Chart

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published