Awesome Agentic Reasoning Papers

This repository organizes research by thematic areas that integrate reasoning with action, including planning, tool use, search, self-evolution through memory and feedback, multi-agent systems, and real-world applications and benchmarks.

📄 Based on the survey: Agentic Reasoning for Large Language Models: A Survey

🔔 News

[01/21/26] 🚀 We have released a comprehensive survey on Agentic Reasoning for Large Language Models! The paper is now available on arxiv and HuggingFace. We welcome contributions from the community to help expand and improve our survey 🤗!

📋 Table of Contents

🔔 News
📋 Table of Contents
🌟 Introduction
🤝 Contributing
📝 Citation
🏗️ Foundational Agentic Reasoning
🧬 Self-evolving Agentic Reasoning
👥 Collective Multi-agent Reasoning
🎨 Applications
📊 Benchmarks
- ⚙️ Core Mechanisms of Agentic Reasoning
- 🎯 Applications of Agentic Reasoning

🌟 Introduction

Bridging thought and action through autonomous agents that reason, act, and learn via continual interaction with their environments. The goal is to enhance agent capabilities by grounding reasoning in action.

We organize agentic reasoning into three layers, each corresponding to a distinct reasoning paradigm under different environmental dynamics:

🔹 Foundational Reasoning. Core single-agent abilities (planning, tool-use, search) in environments

🔹 Self-Evolving Reasoning. Adaptation through feedback, memory, and learning in dynamic settings

🔹 Collective Reasoning. Multi-agent coordination, role specialization, and collaborative intelligence

Across these layers, we further identify complementary reasoning paradigms defined by their optimization settings.

🔸 In-Context Reasoning. Test-time scaling through structured orchestration and adaptive workflows

🔸 Post-Training Reasoning. Behavior optimization via RL and supervised fine-tuning

🤝 Contributing

This collection is an ongoing effort. We are actively expanding and refining its coverage, and welcome contributions from the community. You can:

Submit a pull request to add papers or resources
Open an issue to suggest additional papers or resources
Email us at twei10@illinois.edu, twli@illinois.edu, liu326@illinois.edu

We regularly update the repository to include new research.

📝 Citation

If you find this repository or paper useful, please consider citing the survey paper:

@article{wei2026agentic,
  title={Agentic Reasoning for Large Language Models},
  author={Wei, Tianxin and Li, Ting-Wei and Liu, Zhining and Ning, Xuying and Yang, Ze and Zou, Jiaru and Zeng, Zhichen and Qiu, Ruizhong and Lin, Xiao and Fu, Dongqi and others},
  journal={arXiv preprint arXiv:2601.12538},
  year={2026}
}

🏗️ Foundational Agentic Reasoning

🗺️ Planning Reasoning

In-context Planning

Workflow Design

Paper	Year
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency	2023
PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change	NeurIPS 2023 DB Track
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models	2023
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models	2024
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models	ICLR 2023
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models	ACL 2023
Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models	ICML 2024
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face	2023
Plan, Eliminate, and Track -- Language Models are Good Teachers for Embodied Agents	2023
PERIA: Perceive, Reason, Imagine, Act via Holistic Language and Vision Planning for Manipulation	2024
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks	2025
CodePlan: Repository-level Coding using LLMs and Planning	FSE 2024
ReAct: Synergizing Reasoning and Acting in Language Models	ICLR 2023
Mind2Web: Towards a Generalist Agent for the Web	NeurIPS 2023
WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents	2024
Executable Code Actions Elicit Better LLM Agents	ICML 2024
Gorilla: Large Language Model Connected with Massive APIs	2023
Reflexion: Language Agents with Verbal Reinforcement Learning	2023
CodeNav: Beyond Tool-Use to Using Real-World Codebases with LLM Agents	ACL 2024
MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing	2025
Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection Agents	2025
Pre-Act: Multi-Step Planning and Reasoning Improves Acting in LLM Agents	2025
REST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent	2023
Self-Planning Code Generation with Large Language Models	TOSEM 2023
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action	CoRL 2022

Tree Search / Algorithm Simulation

Paper	Year
Tree of Thoughts: Deliberate Problem Solving with Large Language Models	NeurIPS 2023
Tree Search for Language Model Agents	2024
Tree-Planner: Efficient Planning with Large Language Models	ICLR 2024
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning	2024
LLM-A*: Large Language Model Enhanced Incremental Heuristic Search on Path Planning	2024
Multimodal Chain-of-Thought Reasoning in Language Models	2023
Reasoning with Language Model is Planning with World Model	NeurIPS 2023
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents	2024
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning	2024
Prompt-Based Monte-Carlo Tree Search for Goal-Oriented Dialogue Policy Planning	2023
Large Language Models as Tool Makers	ICLR 2024
Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation	2023
Tree of Thoughts: Deliberate Problem Solving with Large Language Models	NeurIPS 2023
Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training	2023
Broaden your SCOPE! Efficient Multi-turn Conversation Planning for LLMs with Semantic Space	2025
Self-Evaluation Guided Beam Search for Reasoning	NeurIPS 2023
PathFinder: Multimodal Multi-Agent Medical Diagnosis Framework	2025
Discriminator-Guided Embodied Planning for LLM Agent	ICLR 2025
Stream of Search (SoS): Learning to Search in Language	2024
System-1.x: Learning to Balance Fast and Slow Planning with Language Models	2024
Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems	2024
Intelligent Virtual Assistants with LLM-based Process Automation	2023
Agent S: An Open Agentic Framework that Uses Computers Like a Human	2024
HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking	2025
Tree-of-Code: A Tree-Structured Exploring Framework for End-to-End Code Generation and Execution in Complex Task Handling	ACL 2025
Enhancing LLM-Based Agents via Global Planning and Hierarchical Execution	2025
Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning	2025
SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement	ICLR 2025
BTGenBot: Behavior Tree Generation for Robotic Tasks with Lightweight LLMs	2024
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances	CoRL 2022
Inner Monologue: Embodied Reasoning through Planning with Language Models	CoRL 2022

Process Formalization

Paper	Year
Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning	NeurIPS 2023
Leveraging Environment Interaction for Automated PDDL Translation and Planning with Large Language Models	NeurIPS 2024
Thought of Search: Planning with Language Models Through The Lens of Efficiency	NeurIPS 2024
CodePlan: Repository-level Coding using LLMs and Planning	FSE 2024
Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming	2024
From An LLM Swarm To A PDDL-Empowered HIVE: Planning Self-Executed Instructions In A Multi-Modal Jungle	2024

Decoupling / Decomposition

Paper	Year
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models	NeurIPS 2023
DiffuserLite: Towards Real-time Diffusion Planning	2024
Goal-Space Planning with Subgoal Models	JMLR 2024
Agent-Oriented Planning in Multi-Agent Systems	2024
GoPlan: Goal-Conditioned Offline Reinforcement Learning by Planning with Learned Models	2023
RetroInText: A Multimodal Large Language Model Enhanced Framework for Retrosynthetic Planning via In-Context Representation Learning	ICLR 2025
HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking	2025
VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning	2024
Beyond Autoregression: Discrete Diffusion for Complex Reasoning	2024
PlanAgent: A Multi-modal Large Language Agent for Vehicle Motion Planning	2024
LLaMAR: Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments	2024

External Aid / Tool Use

Paper	Year
Plan-on-Graph: Self-Correcting Adaptive Planning on Knowledge Graphs	NeurIPS 2024
Hierarchical Planning for Complex Tasks with Knowledge Graph-RAG and Symbolic Verification	2025
TeLoGraF: Temporal Logic Planning via Graph-encoded Flow Matching	2025
FlexPlanner: Flexible 3D Floorplanning via Deep Reinforcement Learning in Hybrid Action Space with Multi-Modality Representation	NeurIPS 2024
Exploratory Retrieval-Augmented Planning For Continual Embodied Instruction Following	NeurIPS 2024
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent	2024
RAG over Tables: Hierarchical Memory Index, Multi-Stage Retrieval, and Benchmarking	2025
Reasoning with Language Model is Planning with World Model	NeurIPS 2023
Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning	NeurIPS 2023
Agent Planning with World Knowledge Model	NeurIPS 2024
BehaviorGPT: Smart Agent Simulation for Autonomous Driving with Next-Patch Prediction	NeurIPS 2024
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning	2024
FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model	2024
Continual Reinforcement Learning by Planning with Online World Models	2025
AdaWM: Adaptive World Model based Planning for Autonomous Driving	2025
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face	2023
Tool-Planner: Task Planning with Clusters across Multiple Tools	2024
RetroInText: A Multimodal Large Language Model Enhanced Framework for Retrosynthetic Planning via In-Context Representation Learning	ICLR 2025

Post-training Planning

Paper	Year
Reflexion: Language Agents with Verbal Reinforcement Learning	NeurIPS 2023
Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens	2025
Rational Decision-Making Agent with Internalized Utility Judgment	2023
Scaling Autonomous Agents via Automatic Reward Modeling	2025
Strategic Planning: A Top-Down Approach to Option Generation	2025
Non-myopic Generation of Language Models for Reasoning and Planning	2024
Physics-informed Temporal Difference Metric Learning for Robot Motion Planning	2025
Generalizable Motion Planning via Operator Learning	2024
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration	2025
Latent Diffusion Planning for Imitation Learning	2025
SafeDiffuser: Safe Planning with Diffusion Probabilistic Models	ICLR 2023
ContraDiff: Planning Towards High Return States via Contrastive Learning	ICLR 2025
Amortized Planning with Large-Scale Transformers: A Case Study on Chess	NeurIPS 2024
GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models	2023
A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks	2025

🛠️ Tool-Use Optimization

In-Context Tool-Integration

Interleaving Reasoning and Tool Use

Paper	Year
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models	NeurIPS 2022
ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models	EMNLP 2023
MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting	ACL 2023
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions	ACL 2023
ReAct: Synergizing Reasoning and Acting in Language Models	ICLR 2023
ART: Automatic Multi-step Reasoning and Tool-use for Large Language Models	2023

Optimizing Context for Tool Interaction

Paper	Year
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models	2023
EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction	NAACL 2025
GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution	EACL 2024
AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning	NeurIPS 2024

Post-training Tool-Integration

Bootstrapping of Tool Use via SFT

Paper	Year
Toolformer: Language Models Can Teach Themselves to Use Tools	NeurIPS 2023
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs	ICLR 2024
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases	2023
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models	NeurIPS 2023
RestGPT: Connecting Large Language Models with Real-World RESTful APIs	2023
ADaPT: As-Needed Decomposition and Planning with Language Models	2023
Agent Lumos: Unified and Modular Training for Open-Source Language Agents	2023
Learning to Use Tools via Cooperative and Interactive Agents	2024
Understanding the Effects of RLHF on LLM Generalisation and Diversity	2023
Preserving Diversity in Supervised Fine-Tuning of Large Language Models	2024
Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification	EMNLP 2024
Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning	2025
iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use	2025
START: Self-taught Reasoner with Tools	2025

Mastery of Tool Use via RL

Paper	Year
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution	2025
SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement	2024
ToolRL: Reward is All Tool Learning Needs	2025
RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents	2025
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning	2025
AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning	2025
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning	2025
Agentic Reinforced Policy Optimization	2025
Agentic Entropy-Balanced Policy Optimization	2025
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning	2025
DeepAgent: A General Reasoning Agent with Scalable Toolsets	2025
Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning	2025
Demystifying Reinforcement Learning in Agentic Reasoning	2025
Reinforcement Pre-Training	2025
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs	2025
ZeroSearch: Incentivize the Search Capability of LLMs Without Searching	2025
Kimi k1.5: Scaling Reinforcement Learning with LLMs	2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning and Next Generation Agentic Capabilities	2025
Kimi k2: Open Agentic Intelligence	2025
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models	2025
TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning	2025

Orchestration-based Tool-Integration

Agentic Pipelines for Tool Orchestration

Paper	Year
ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback	2025
Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning	KDD 2025
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning	2025
Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models	2025
PyVision: Agentic Vision with Dynamic Tooling	2025
Learning to Use Tools via Cooperative and Interactive Agents	2024
El Agente: An Autonomous Agent for Quantum Chemistry	2025

Tool Representations for Orchestration

Paper	Year
ToolExpNet: Optimizing Multi-Tool Selection in LLMs with Similarity and Dependency-Aware Experience Networks	ACL (Findings) 2025
T^2Agent: A Tool-augmented Multimodal Misinformation Detection Agent with Monte Carlo Tree Search	2025
ToolChain: Efficient Action Space Navigation in Large Language Models with A Search	2023
ToolRerank: Adaptive and Hierarchy-Aware Reranking for Tool Retrieval	COLING 2024

🔍 Agentic Search

In-Context Search

Interleaving Reasoning and Search

Paper	Year
ReAct: Synergizing Reasoning and Acting in Language Models	ICLR 2023
Measuring and Narrowing the Compositionality Gap in Language Models	2022
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions	2022
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection	NeurIPS Workshop 2023
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-Adaptive Planning Agent	2024
DeepRAG: Thinking to Retrieve Step by Step for Large Language Models	2025
MC-Search: Benchmarking Multimodal Agentic RAG with Structured Reasoning Chains	NeurIPS Workshop 2025

Structure-Enhanced Search

Paper	Year
Agent-G: An Agentic Framework for Graph Retrieval Augmented Generation	2025
MC-Search: Benchmarking Multimodal Agentic RAG with Structured Reasoning Chains	NeurIPS Workshop 2025
GeAR: Graph-Enhanced Agent for Retrieval-Augmented Generation	2024
Learning to Retrieve and Reason on Knowledge Graph through Active Self-Reflection	2025

Post-Training Search

SFT-Based Agentic Search

Paper	Year
Toolformer: Language Models Can Teach Themselves to Use Tools	NeurIPS 2023
INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning	2024
RAG-Studio: Towards In-Domain Adaptation of Retrieval Augmented Generation through Self-Alignment	EMNLP (Findings) 2024
RAFT: Adapting Language Model to Domain Specific RAG	2024
Search-o1: Agentic search-enhanced large reasoning models	2025
RA-DIT: Retrieval-Augmented Dual Instruction Tuning	ICLR 2023
SFR-RAG: Towards Contextually Faithful LLMs	2024

RL-Based Agentic Search

Paper	Year
WebGPT: Browser-assisted question-answering with human feedback	2021
RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning	2025
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning	2025
KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering	2025
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-World Environments	2025
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning	2025
ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding	2025

🧬 Self-evolving Agentic Reasoning

🔄 Agentic Feedback Mechanisms

Reflective Feedback

Paper	Year
Reflexion: Language Agents with Verbal Reinforcement Learning	NeurIPS 2023
Self-Refine: Iterative Refinement with Self-Feedback	NeurIPS 2023
Enable Language Models to Implicitly Learn Self-Improvement From Data	ICLR 2024
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve	TMLR 2025
Tree of Thoughts: Deliberate Problem Solving with Large Language Models	NeurIPS 2023
Graph of Thoughts: Solving Elaborate Problems with Large Language Models	AAAI 2024
Zero-Shot Verification-Guided Chain of Thoughts	2025
ReAct: Synergizing Reasoning and Acting in Language Models	ICLR 2023
WebGPT: Browser-assisted Question-Answering with Human Feedback	2021
MemGPT: Towards LLMs as Operating Systems	2023
Voyager: An Open-Ended Embodied Agent with Large Language Models	2023

Parametric Adaptation

Paper	Year
AgentTuning: Enabling Generalized Agent Abilities for LLMs	2023
ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent	2023
Re-ReST: Reflection-Reinforced Self-Training for Language Agents	2024
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes	2023
Deep Reinforcement Learning from Human Preferences	NeurIPS 2017
Direct Preference Optimization: Your Language Model is Secretly a Reward Model	NeurIPS 2023
Constitutional AI: Harmlessness from AI Feedback	2022
ReflectEvo: Improving Meta Introspection of Small LLMs by Learning Self-Reflection	ACL (Findings) 2025

Validator-Driven Feedback

Paper	Year
ReZero: Enhancing LLM search ability by trying one-more-time	2025
Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback	2025
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning	2022
LEVER: Learning to Verify Language-to-Code Generation with Execution	ICML 2023
SWE-bench: Can Language Models Resolve Real-world Github Issues?	ICLR 2024
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances	CoRL 2022
PaLM-E: An Embodied Multimodal Language Model	ICML 2023
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning	2025

🧠 Agentic Memory

Agentic Use of Flat Memory

Factual Memory

Paper	Year
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks	NeurIPS 2020
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection	ICLR 2024
MemoryBank: Enhancing Large Language Models with Long-Term Memory	2023
LlamaIndex	2022
MemGPT: Towards LLMs as Operating Systems	2023
RET-LLM: Towards a General Read-Write Memory for Large Language Models	2023
SCM: Enhancing Large Language Model with Self-Controlled Memory Framework	2023
Evaluating Very Long-Term Conversational Memory of LLM Agents	2024
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory	2024
SELFGOAL: Your Language Agents Already Know How to Achieve High-level Goals	NAACL 2025
FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design	2023
A-mem: Agentic memory for llm agents	2025
In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents	2025
Zep: A Temporal Knowledge Graph Architecture for Agent Memory	2025
MIRIX: Multi-Agent Memory System for LLM-Based Agents	2025
MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models	2025
LightMem: Lightweight and Efficient Memory-Augmented Generation	2025
Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science	2025

Experience Memory

Paper	Year
Agent Workflow Memory	2024
Sleep-time Compute: Beyond Inference Scaling at Test-time	2025
Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory	2025
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models	2025
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory	2025
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory	2025

Structured Use of Memory

Paper	Year
RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph	2024
From Local to Global: A Graph RAG Approach to Query-Focused Summarization	2024
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory	2025
Zep: A Temporal Knowledge Graph Architecture for Agent Memory	2025
From Isolated Conversations to Hierarchical Schemas: Dynamic Tree Memory Representation for LLMs	2024
AutoFlow: Automated Workflow Generation for Large Language Model Agents	2024
AFlow: Automating Agentic Workflow Generation	ICLR 2025
FlowMind: Automatic Workflow Generation with LLMs	2024
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory (M3-Agent)	2025
Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations	2025
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks	NeurIPS 2024
RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents	2024

Post-training Memory Control

Paper	Year
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent	2025
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents	2025
Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning	2025
Mem-alpha: Learning Memory Construction via Reinforcement Learning	2025
Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks	2025
Agent Learning via Early Experience	2025
Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents	2026
MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory	2026

🚀 Evolving Foundational Agentic Capabilities

Self-evolving Planning

Paper	Year
Self-challenging language model agents	2025
Self-rewarding language models	ICML 2024
RLSR: Reinforcement Learning from Self Reward	2025
Self: Self-evolution with language feedback	2023
Training language models to self-correct via reinforcement learning	2024
TextGrad: Differentiable Text Feedback for Language Models	2024
AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning	2025
AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation	2024
Reflexion: Language agents with verbal reinforcement learning	NeurIPS 2023
Adaplanner: Adaptive planning from feedback with language models	NeurIPS 2023
Self-refine: Iterative refinement with self-feedback	NeurIPS 2023
A self-improving coding agent	2025
Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning	2025
DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning	2025

Self-evolving Tool-use

Paper	Year
Large Language Models as Tool Makers	ICLR 2024
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets	ICLR 2024
CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models	EMNLP 2023
LLM Agents Making Agent Tools	2025

Self-evolving Search for Memory Retrieval

Paper	Year
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks	NeurIPS 2020
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection	ICLR 2024
MemoryBank: Enhancing Large Language Models with Long-Term Memory	2023
MemGPT: Towards LLMs as Operating Systems	2023
Agent Workflow Memory	2024
Dynamic Cheatsheet: Test-time learning with adaptive memory	2025
Reflexion: Language agents with verbal reinforcement learning	NeurIPS 2023
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory	2025
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models	2025
AutoFlow: Automated Workflow Generation for Large Language Model Agents	2024
AFlow: Automating Agentic Workflow Generation	ICLR 2025
FlowMind: Automatic Workflow Generation with LLMs	2024
RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph	2024
From Local to Global: A Graph RAG Approach to Query-Focused Summarization	2024
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory	2025
Zep: A Temporal Knowledge Graph Architecture for Agent Memory	2025
MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models	2025
Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks	2025

👥 Collective Multi-agent Reasoning

🤝 Collaboration and Division of Labor

In-context Collaboration

Manually Crafted Pipelines

Paper	Year
AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose Task Solving	2025
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework	ICLR 2024
SurgRAW: Multi-agent workflow with chain-of-thought reasoning for surgical intelligence	2025
Collab-RAG: Boosting retrieval-augmented generation for complex question answering via white-box and black-box llm collaboration	2025
MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning	2025
Chain of Agents: Large Language Models Collaborating on Long-Context Tasks	NeurIPS 2024
AutoAgents: a framework for automatic agent generation	IJCAI 2024
RAG-KG-IL: A Multi-Agent Hybrid Framework for Reducing Hallucinations and Enhancing LLM Reasoning	2025
SMoA: Improving Multi-agent Large Language Models with Sparse Mixture-of-Agents	2024
MDocAgent: A multi-modal multi-agent framework for document understanding	2025

LLM-Driven Pipelines

Paper	Year
AutoML-Agent: A multi-agent llm framework for full-pipeline automl	2024
Magentic-One: A generalist multi-agent system for solving complex tasks	2024
MAS-GPT: Training LLMs to build LLM-based multi-agent systems	2025
MetaAgent: Automatically Constructing Multi-Agent Systems Based on Finite State Machines	2025
Agent-oriented planning in multi-agent systems	2024
AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering	2025
Talk to Right Specialists: Routing and planning in multi-agent system for question answering	2025

Theory-of-Mind-Augmented Collaboration

Paper	Year
Theory of mind for multi-agent collaboration via large language models	2023
Hypothetical Minds: Scaffolding theory of mind for multi-agent tasks with large language models	2024
MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning	2024
How large language models encode theory-of-mind: a study on sparse parameter patterns	npj Artificial Intelligence 2025
Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual Reflection	2025
BeliefNest: A Joint Action Simulator for Embodied Agents with Theory of Mind	2025

Post-training Collaboration

Multi-agent Prompt Optimization

Paper	Year
AutoAgents: A Framework for Automatic Agent Generation	IJCAI 2024
Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration	NAACL 2024
DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines	2023
Multi-agent Design: Optimizing Agents with Better Prompts and Topologies	2025
Automatic Prompt Optimization with "Gradient Descent" and Beam Search	2023

Graph-based Topology Generation

Paper	Year
Learning Multi-Agent Communication from Graph Modeling Perspective	2024
G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks	2024
Graph Diffusion for Robust Multi-Agent Coordination	ICML 2025
Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems	2024
Adaptive Graph Pruning for Multi-Agent Communication	2025
G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-Agent Systems	2025
AFlow: Automating Agentic Workflow Generation	ICLR 2025
Multi-agent Design: Optimizing Agents with Better Prompts and Topologies	2025
Multi-Agent Architecture Search via Agentic Supernet	2025
DynaSwarm: Dynamically Graph Structure Selection for LLM-based Multi-Agent System	2025
GPTSwarm: Language Agents as Optimizable Graphs	ICML 2024

Policy-based Topology Generation

Paper	Year
MASRouter: Learning to Route LLMs for Multi-Agent Systems	2025
RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory	2025
xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning	2025
Optimal-Agent-Selection: State-Aware Routing Framework for Efficient Multi-Agent Collaboration	2025
LLM Collaboration with Multi-Agent Reinforcement Learning	2025
Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems	2025
Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-based Planner and Graph-based Policy	2025
LAMARL: LLM-Aided Multi-Agent Reinforcement Learning for Cooperative Policy Generation	IEEE RA-L 2025
MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning	2025
Reflective Multi-Agent Collaboration Based on Large Language Models	NeurIPS 2024
Sirius: Self-Improving Multi-Agent Systems via Bootstrapped Reasoning	2025
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains	2025
M3HF: Multi-Agent Reinforcement Learning from Multi-Phase Human Feedback of Mixed Quality	2025
O-MAPL: Offline Multi-Agent Preference Learning	2025

🌱 Multi-Agent Memory and Evolution

From Single-Agent Evolution to Multi-Agent Evolution

Intra-test-time Evolution

Paper	Year
Reflexion: Language Agents with Verbal Reinforcement Learning	NeurIPS 2023
Self-Refine: Iterative Refinement with Self-Feedback	NeurIPS 2023
AdaPlanner: Adaptive Planning from Feedback with Language Models	NeurIPS 2023
TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution	TiFA 2024
Self-Adapting Language Models	2025
TTRL: Test-Time Reinforcement Learning	2025
Ladder: Self-Improving LLMs through Recursive Problem Decomposition	2025

Inter-test-time Evolution

Paper	Year
Self: Self-Evolution with Language Feedback	2023
STaR: Bootstrapping Reasoning with Reasoning	NeurIPS 2022
Reasoning Beyond Limits: Advances and Open Problems for LLMs	2025
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning	2025
DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning	2025
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning	2024
Why do animals need shaping? A theory of task composition and curriculum learning	2024
SAGE: Self-evolving Agents with Reflective and Memory-augmented Abilities	Neurocomputing 2025
MemInsight: Autonomous Memory Augmentation for LLM Agents	2025
Agent Workflow Memory	2024

Multi-agent Evolution

Paper	Year
Self: Self-Evolution with Language Feedback	2023
Training Language Models to Self-Correct via Reinforcement Learning	2024
TextGrad: Automatic "Differentiation" via Text	2024
REMA: Learning to Meta-Think for LLMs with Multi-Agent Reinforcement Learning	2025
Group-in-Group Policy Optimization for LLM Agent Training	2025
Agent Workflow Memory	2024
MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models	2025
Multi-agent Design: Optimizing Agents with Better Prompts and Topologies	2025
AFlow: Automating Agentic Workflow Generation	ICLR 2025
Testing Advanced Driver Assistance Systems Using Multi-Objective Search and Neural Networks	ASE 2016
Latent Collaboration in Multi-Agent Systems	2025

Multi-agent Memory Management for Evolution

Paper	Year
G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems	2025
Intrinsic Memory Agents: Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory	2025
LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning	2025
SEDM: Scalable Self-Evolving Distributed Memory for Agents	2025
Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control	2025
Memory Sharing for Large Language Model based Agents	2024
MIRIX: Multi-Agent Memory System for LLM-Based Agents	2025
LEGOMem: Modular Procedural Memory for Multi-agent LLM Systems for Workflow Automation	2025
MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning	ALTA 2025
Lyfe Agents: Generative agents for low-cost real-time social interactions	2023
Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving	2025

Training Multi-agent to Evolve

Paper	Year
Multi-Agent Evolve: LLM Self-Improve through Co-evolution	2025
CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards	2025
MARFT: Multi-Agent Reinforcement Fine-Tuning	2025
Stronger-MAS: Multi-Agent Reinforcement Learning for Collaborative LLMs	2025
MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning	2025
MALT: Multi-Agent Learning from Trajectories	2025
MARS: Optimizing Dual-System Deep Research via Multi-Agent Reinforcement Learning	2025
Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques	2024
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety	2025

🎨 Applications

💻 Math Exploration & Vibe Coding Agents

Foundational Agentic Reasoning

Paper	Year
Advancing mathematics by guiding human intuition with AI	Nature 2021
Solving olympiad geometry without human demonstrations	Nature 2024
Mathematical discoveries from program search with large language models	Nature 2024
Mathematical Exploration and Discovery at Scale	2025
Advancing geometry with AI: Multi-agent generation of polytopes	2025
Towards Robust Mathematical Reasoning	EMNLP 2025
CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules	ICLR 2024
Executable Code Actions Elicit Better LLM Agents	ICML 2024
Knowledge-Aware Code Generation with Large Language Models	ICPC 2024
CodePlan: Repository-level Coding using LLMs and Planning	FSE 2024
Multi-stage guided code generation for Large Language Models	Eng. App. AI 2025
CodeTree: Agent-Guided Tree Search for Code Generation with Large Language Models	2024
DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning	2024
Tree-of-Code: A Self-Growing Tree Framework for End-to-End Code Generation and Execution in Complex Tasks	ACL 2025
CoRT: Code-integrated Reasoning within Thinking	2025
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal	2025
Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search	NeurIPS 2024
VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning	AAAI 2025
Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents	ICML 2025
An In-Context Learning Agent for Formal Theorem-Proving	COLM 2024
Formal Mathematical Reasoning: A New Frontier in AI	2024
Generative Modelling for Mathematical Discovery	2025
Toolformer: Language Models Can Teach Themselves to Use Tools	NeurIPS 2023
ToolCoder: Teach Code Generation Models to use API search tools	2023
ToolGen: Unified Tool Retrieval and Calling via Generation	ICLR 2025
CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges	ACL 2024
ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation	ICSE 2025
CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision	2025
RepoHyper: Better Context Retrieval is All You Need for Repository-Level Code Completion	2024
CodeNav: Beyond Tool-Use to Using Real-World Codebases with LLM Agents	ICLR 2024
Optimizing Code Runtime Performance Through Context-Aware Retrieval-Augmented Generation	ICPC 2025
Knowledge Graph Based Repository-Level Code Generation	LLM4Code 2025
cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree	2025

Self-evolving Agentic Reasoning

Paper	Year
Evaluating Language Models for Mathematics through Interactions	PNAS 2024
CLCL: Non-compositional Expression Detection with Contrastive Learning and Curriculum Learning	ACL 2023
Is Self-Repair a Silver Bullet for Code Generation?	2024
LeDeX: Learning to Debug with Execution Feedback	NeurIPS 2024
Self-Refine: Iterative Refinement with Self-Feedback	NeurIPS 2023
A Self-Iteration Code Generation Method Based on Large Language Models	ICPADS 2023
Teaching Large Language Models to Self-Debug	ICLR 2024
Self-Collaboration Code Generation via ChatGPT	TOSEM 2024
L2MAC: Large Language Model Automatic Computer for Extensive Code Generation	2023
Cogito, Ergo Sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation	2025

Collective Multi-agent Reasoning

Paper	Year
AgentCoder: Multi-Agent-Based Code Generation with Iterative Testing and Optimisation	2023
A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement	ASE 2024
SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents	ICSE 2025
Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization	2024
MapCoder: Multi-Agent Code Generation for Competitive Problem Solving	2024
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing	2024
QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks	2025
SEW: Self-Evolving Agentic Workflows for Automated Code Generation	2025
Self-Evolving Multi-Agent Collaboration Networks for Software Development	2024
Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement	2024
CodeCoR: An LLM-based Self-Reflective Multi-Agent Framework for Code Generation	2025
SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering	ICML 2025
Hallucination to Consensus: Multi-Agent LLMs for End-to-End Test Generation	2025

🔬 Scientific Discovery Agents

Here are the extracted citation tables grouped by their respective sections.

Foundational Agentic Reasoning

Paper	Year
ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning	Digital Discovery 2024
Agent-based learning of materials datasets from the scientific literature	Digital Discovery 2024
ReAct: Synergizing Reasoning and Acting in Language Models	ICLR 2023
Biomni: A General-Purpose Biomedical AI Agent	bioRxiv 2025
SciAgent: Tool-augmented Language Models for Scientific Reasoning	2024
Chemcrow: Augmenting large-language models with chemistry tools	2023
CACTUS: Chemistry Agent Connecting Tool-Usage to Science	ACS Omega 2024
ChemToolAgent: The Impact of Tools on Language Agents for Chemistry Problem Solving	2024
CheMatAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning	2025
TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools	2025
AgentMD: Empowering language agents for risk prediction with large-scale clinical tool learning	Nature Communications 2025
LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation	2024
HoneyComb: A Flexible LLM-Based Agent System for Materials Science	2024
CRISPR-GPT for Agentic Automation of Gene-editing Experiments	2024
PharmAgents: Building a Virtual Pharma with Large Language Model Agents	2025
ORGANA: A robotic assistant for automated chemistry experimentation and characterization	Matter 2025
AtomAgents: Alloy design and discovery through physics-aware multi-modal multi-agent artificial intelligence	2024
Chemist-X: Large Language Model-empowered Agent for Reaction Condition Recommendation in Chemical Synthesis	2024
LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery	2024
CellAgent: LLM-Driven Multi-Agent Framework for Natural Language-Based Single-Cell Analysis	BioRxiv 2024
BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments	2024
DrugAgent: Multi-Agent Large Language Model-Based Reasoning for Drug-Target Interaction Prediction	2024
Accelerating Scientific Research Through a Multi-LLM Framework	2025
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search	2025
Large Language Models are Zero Shot Hypothesis Proposers	2023
PaperQA: Retrieval-Augmented Generative Agent for Scientific Research	2023
Language agents achieve superhuman synthesis of scientific knowledge	2024
LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation	2024

Self-evolving Agentic Reasoning

Paper	Year
ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning	2025
Accelerated Inorganic Materials Design with Generative AI Agents	2025
LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery	2024
ChemReasoner: Heuristic Search over a Large Language Model's Knowledge Space using Quantum-Chemical Feedback	2024
LLMatDesign: Autonomous Materials Discovery with Large Language Models	2024
Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents	2025

Collective multi-agent reasoning

Paper	Year
ProtAgents: protein discovery via large language model multi-agent collaborations combining physics and machine learning	Digital Discovery 2024
PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration	2025
AtomAgents: Alloy design and discovery through physics-aware multi-modal multi-agent artificial intelligence	2024
CellAgent: LLM-Driven Multi-Agent Framework for Natural Language-Based Single-Cell Analysis	BioRxiv 2024
Accelerating Scientific Research Through a Multi-LLM Framework	2025
Toward a team of ai-made scientists for scientific discovery from gene expression data	2024
The virtual lab: Ai agents design new sars-cov-2 nanobodies with experimental validation	bioRxiv 2024

🤖 Embodied Agents

Foundational Agentic Reasoning

Paper	Year
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances	2022
SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning	2023
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought	NeurIPS 2023
Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents	ECCV 2024
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents	NeurIPS 2023
Robotic Control via Embodied Chain-of-Thought Reasoning	2024
Fast ECoT: Fast Embodied Chain-of-Thought for Vision-Language-Action Models	2025
Cosmos-Reason1: Physical Commonsense with Multimodal Chain of Thought Reasoning	2025
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models	2025
Emma-X: An Embodied Multimodal Action Model with Chain of Thought Reasoning	2024
Robot-R1: Reinforcement Learning Enhanced Large Vision-Language Models for Robotic Manipulation	2025
ManipLVM-R1: Learning to Reason for Robotic Manipulation via Reinforcement Learning	2025
Embodied-R: Emergent Spatial Reasoning in Robotics via Multi-Agent Reinforcement Learning	2025
VIKI-R: A VLM-Based Reinforcement Learning Approach for Heterogeneous Multi-Agent Cooperation	2025
GSCE: A Prompt Framework for Enhanced Logical Reasoning in LLM-Based Drone Control	2025
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge	NeurIPS 2022
Physical AI Agents: Integrating Generative AI, Symbolic AI and Robotics	2025
Chat with the Environment: Interactive Multimodal Perception using Large Language Models	IROS 2023
An embodied generalist agent in 3d world	ICML 2024
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models	2025
Gemini Robotics: Bringing AI to the Physical World	2025
Octopus: Embodied Vision-Language Programmer from Environmental Feedback	ECCV 2024
CaPo: Cooperative Plan Optimization for Multi-Agent Collaboration	2024
COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models	ICRA 2025
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception	CVPR 2024
LLM-Planner: Few-Shot Grounded High-Level Planning for Embodied Agents with Large Language Models	ICCV 2023
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought	NeurIPS 2023
L3MVN: Leveraging Large Language Models for Visual Target Navigation	2023
SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments	ICAPS 2023
SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning	CoRL 2023
ReMEmbR: Building and Reasoning with Long-Horizon Spatio-Temporal Memory for Embodied Agents	2025
Embodied-RAG: General Non-parametric Embodied Memory for Retrieval-Augmented Generation	NeurIPS Workshop AFM 2024
Retrieval-Augmented Embodied Agents	2024
MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents	2024

Self-evolving Agentic Reasoning

Paper	Year
LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics	2024
Optimus-1: Hybrid Multimodal Memory Empowered Agents for Long-Horizon Tasks in Minecraft	2024
Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models	EMNLP 2023
Endowing Embodied Agents with Spatial Reasoning Capabilities for Vision-and-Language Navigation	2025
Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents	2024
Ella: Embodied Social Agents with Lifelong Memory	2025
Chat with the Environment: Interactive Multimodal Perception using Large Language Models	IROS 2023
From Strangers to Assistants: Fast Desire Alignment for Embodied Agent-User Adaptation	2025
Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners	CoRL 2023
Octopus: Embodied Vision-Language Programmer from Environmental Feedback	ECCV 2024
MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning	2024
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration	2024
EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM	2025
Voyager: An Open-Ended Embodied Agent with Large Language Models	2023

Collective multi-agent reasoning

Paper	Year
Smart-LLM: Smart Multi-Agent Robot Task Planning with Large Language Models	2024
CaPo: Cooperative Plan Optimization for Multi-Agent Collaboration	2024
COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models	ICRA 2025
Theory of mind for multi-agent collaboration via large language models	2023
How large language models encode theory-of-mind: a study on sparse parameter patterns	npj Artificial Intelligence 2025
Hypothetical Minds: Scaffolding theory of mind for multi-agent tasks with large language models	2024
MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning	2024
EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM	2025
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation	2025
VIKI-R: A VLM-Based Reinforcement Learning Approach for Heterogeneous Multi-Agent Cooperation	2025
RoCo: Dialectic Multi-Robot Collaboration with Large Language Models	2024

🏥 Healthcare & Medicine Agents

Foundational agentic reasoning

Paper	Year
Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology	Nature Medicine 2024
EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records	2024
PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology	2025
MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow	2025
MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility	2025
ClinicalAgent: Clinical Trial Multi-Agent System with Large Language Model-based Reasoning	2024
DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making	2025
TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools	2025
AgentMD: Empowering language agents for risk prediction with large-scale clinical tool learning	Nature Communications 2025
Large language model agents can use tools to perform clinical calculations	NPJ Digital Medicine 2025
MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling	2024
MMedAgent: Learning to Use Medical Tools with Multi-modal Agents	2024
VoxelPrompt: A Vision Agent for End-to-End Medical Image Analysis	2024
Enhancing Surgical Robots with Embodied Intelligence for Autonomous Ultrasound Scanning	2024
Adaptive Reasoning and Acting in Medical Language Agents	2024
MedRAX: Medical Reasoning Agent for Chest X-ray	2025
Conversational Health Agents: A Personalized LLM-Powered Agent Framework	2023
MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science	2025
Simulated patient systems powered by large language model-based AI agents offer potential for transforming medical education	2024
Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions	MICCAI 2025
RAG-Enhanced Collaborative LLM Agents for Drug Discovery	2025
MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs	2025

Self-evolving agentic reasoning

Paper	Year
Epidemic Modeling with Generative Agents	2023
Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions	MICCAI 2025
EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records	2024
LLMs Can Simulate Standardized Patients via Agent Coevolution	2024
Simulated patient systems powered by large language model-based AI agents offer potential for transforming medical education	2024
MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility	2025
DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making	2025
MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science	2025
EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records	2024
MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling	2025
Large language model agents can use tools to perform clinical calculations	NPJ Digital Medicine 2025

Collective multi-agent reasoning

Paper	Year
MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making	2024
DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue	2025
Beyond Direct Diagnosis: LLM-based Multi-Specialist Agent Consultation for Automatic Diagnosis	2024
ClinicalAgent: Clinical Trial Multi-Agent System with Large Language Model-based Reasoning	2024
PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology	2025
Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions	MICCAI 2025
LLMs Can Simulate Standardized Patients via Agent Coevolution	2024
DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making	2025
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning	2024
RAG-Enhanced Collaborative LLM Agents for Drug Discovery	2025
GMAI-VL-R1: Harnessing Reinforcement Learning for Multi-Modal Medical Reasoning	2025

🌐 Autonomous Web Exploration & Research Agents

Foundational agentic reasoning

Paper	Year
Agent Laboratory: Using LLM Agents as Research Assistants	2025
GPT Researcher	2023
Accelerating Scientific Research Through a Multi-LLM Framework	2025
InternAgent: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification	2025
WebGPT: Browser-assisted question-answering with human feedback	2021
Language Models are Few-Shot Learners	NeurIPS 2020
GPT-4V(ision) is a Generalist Web Agent, if Grounded	ICML 2024
AutoWebGLM: A Large Language Model-based Web Navigating Agent	2024
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents	2024
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning	2024
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning	2025
Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning	2024
DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning	2025
EvolveSearch: An Iterative Self-Evolving Search Agent	2025
WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model	2025
ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL	ICLR 2025
Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents	2024
WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection	2025
ZeroSearch: Incentivize the Search Capability of LLMs Without Searching	2025
StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization	2025
How to Train Your LLM Web Agent: A Statistical Diagnosis	2025
Agent S: An Open Agentic Framework that Uses Computers Like a Human	2024
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection	2025
MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation	2024
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC	2025
UItron: Foundational GUI Agent with Advanced Perception and Planning	2025
ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay	2025
ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents	2025
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning	2025
GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents	2025
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners	2025
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning	2025
GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration	EMNLP 2025
Learning GUI Grounding with Spatial Reasoning from Visual Feedback	2025
GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning	2025
UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding	2025
ZeroGUI: Automating Online GUI Learning at Zero Human Cost	2025
AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning	2025
AutoGLM: Autonomous Foundation Agents for GUIs	2024
Mobile-Agent-v3: Fundamental Agents for GUI Automation	2025
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models	ACL 2024
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions	2025
WALT: Web Agents that Learn Tools	2025
WebDancer: Towards Autonomous Information Seeking Agency	2025
WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization	2025
AutoDroid: LLM-powered Task Automation in Android	MobiCom 2024
MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices	2024
AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant	2024
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement	2024
OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning	2024
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents	2024
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents	2024
Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools	2025
Agent Laboratory: Using LLM Agents as Research Assistants	2025
MLR-Copilot: Autonomous Machine Learning Research based on Large Language Model Agents	2024
Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback	2025
The AI Scientist: Fully Automated Open-Ended Scientific Discovery	2024
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search	2025
WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents	2025
WebSailor: Navigating Super-human Reasoning for Web Agent	2025
RaDA: Retrieval-augmented Web Agent Planning with LLMs	2024
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control	ICLR 2024
LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark	2025
Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation	2023
Retrieval-augmented GUI Agents with Generative Guidelines	2025
WebThinker: Empowering Large Reasoning Models with Deep Research Capability	2025
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments	2025
PaperQA: Retrieval-Augmented Generative Agent for Scientific Research	2023
Language agents achieve superhuman synthesis of scientific knowledge	2024
Chain of Ideas: Revolutionizing Research Via Novel Idea Development with LLM Agents	2024
Scideator: Human-LLM Scientific Idea Generation Grounded in Research-Paper Facet Recombination	2024

Self-evolving agentic reasoning

Paper	Year
Agent Workflow Memory	2024
VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought	2024
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions	2025
AutoWebGLM: A Large Language Model-based Web Navigating Agent	2024
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents	2024
LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications	2025
WebDancer: Towards Automated Web Information Seeking with Large Language Model Agents	2025
WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization	2025
Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation	2023
MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation	2024
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks	2025
Agent Laboratory: Using LLM Agents as Research Assistants	2025
GPT Researcher	2023
Chain of Ideas: Revolutionizing Research Via Novel Idea Development with LLM Agents	2024
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search	2025
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents	2024
Reflection-Based Memory For Web navigation Agents	2025
Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems	2024
Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution	2025
WINELL: Wikipedia Never-Ending Updating with LLM Agents	2025
WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection	2025
GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior	2025
History-Aware Reasoning for GUI Agents	2025
MobileUse: A GUI Agent with Hierarchical Reflection for Autonomous Mobile Operation	2025
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection	2025
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks	2025
CycleResearcher: Improving Automated Research via Automated Review	2024
MLR-Copilot: Autonomous Machine Learning Research based on Large Language Model Agents	2024
Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback	2025
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments	2025

Collective multi-agent reasoning

Paper	Year
WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration	2024
WINELL: Wikipedia Never-Ending Updating with LLM Agents	2025
Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution	2025
Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents	2024
Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems	2024
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks	2025
Agentic Web: Weaving the Next Web with AI Agents	2025
CoLA: Collaborative Low-Rank Adaptation	2025
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration	ACL 2024
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks	2025
Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation	2025
MobileExperts: Orchestrating Tool-Capable Specialists for Mobile Automation	2024
Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use	2025
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC	2025
AgentRxiv: Towards Collaborative Autonomous Research	2025
Accelerating Scientific Research Through a Multi-LLM Framework	2025
Large Language Models are Zero-Shot Reasoners	NeurIPS 2022
Emergent autonomous scientific research capabilities of large language models	Nature 2023
Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data	2024

📊 Benchmarks

⚙️ Core Mechanisms of Agentic Reasoning

Tool Use

Single-Turn Tool Use

Paper	Year
ToolQA: A Dataset for LLM Question Answering with External Tools	NeurIPS 2023
Gorilla: Large Language Model Connected with Massive APIs	2023
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs	ICLR 2024
MetaTool: A Benchmark for Controlling Special-purpose Large Language Models	ICLR 2024
T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step	ACL 2024
GTA: A Benchmark for General Tool Agents	NeurIPS 2024
Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models	2025

Multi-Turn Tool Use

Paper	Year
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases	2023
On the Tool Manipulation Capability of Open-source Large Language Models	2023
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs	EMNLP 2023
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios	ACL 2024
MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models	ICLR 2025

Search

Memory and Planning

Long-Horizon Episodic Memory

Paper	Year
PerLTQA: A Persona-based Long-term Memory Benchmark for RAG	2024
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context LLMs	2024
Multi-IF: A Benchmark for Multi-turn Instruction Following	2024
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs	2025
TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models	2025
StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns	2025
MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation	2025

Multi-session Recall

Paper	Year
Evaluating Very Long-Term Conversational Memory of LLM Agents	2024
MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants	2024
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory	2024
REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation	2025
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions	2025
Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents	2026
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory	2025

Planning and Feedback

Paper	Year
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning	ICLR 2021
PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change	NeurIPS 2022
ACPBench: Reasoning about Action, Change, and Planning	2024
Text2World: Benchmarking Large Language Models for Symbolic World Model Generation	ACL 2025
REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks	2025
TravelPlanner: A Benchmark for Real-World Planning with Language Agents	ICML 2024
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents	2024
UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models	2025

Multi-Agent System

Game-based reinforcement learning evaluation

Paper	Year
MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence	AAAI 2018
Pommerman: A Multi-Agent Playground	2018
The StarCraft Multi-Agent Challenge	NeurIPS 2019
MineLand: Simulating Large-Scale Multi-Agent Interactions with Limited Multimodal Senses and Physical Needs	2024
TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft	2024
Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting Pot	ICML 2021
BenchMARL: Benchmarking Multi-Agent Reinforcement Learning	2023
Arena: A General Evaluation Platform and Building Toolkit for Multi-Agent Intelligence	AAAI 2020

Simulation-centric real-world assessment

Paper	Year
SMARTS: Scalable Multi-Agent Reinforcement Learning Training School for Autonomous Driving	CoRL 2020
Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world	NeurIPS 2022
A Versatile Multi-Agent Reinforcement Learning Benchmark for Inventory Management	2023
IMP-MARL: a Suite of Environments for Infrastructure Management Planning with Multi-Agent Reinforcement Learning	NeurIPS 2023
POGEMA: Partially Observable Grid Environment for Multiple Agents	Arxiv 2022
IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning	NeurIPS 2024
REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks	2025

Language, Communication, and Social Reasoning

Paper	Year
LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models	2023
AvalonBench: Evaluating LLMs Playing the Game of Avalon	2023
Welfare Diplomacy: Benchmarking Language Model Cooperation	2023
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration	EMNLP 2024
BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems	2024
COMMA: A Benchmark for Inter-Agent Communication in Multi-Agent Systems	2024
IntellAgent: A Benchmark for Evaluating Conversational Agents in Realistic Scenarios	2025
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents	2025

🎯 Applications of Agentic Reasoning

Embodied Agents

Paper	Year
Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks	2025
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games	NeurIPS 2024
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning	ICLR 2021
Understanding the Weakness of Large Language Model Agents within a Complex Android Environment	2024
MindAgent: Emergent Gaming Interaction	2023
Playing repeated games with Large Language Models	2023
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments	NeurIPS 2024

Scientific Discovery Agents

Paper	Year
DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents	NeurIPS 2024
ScienceWorld: Is your Agent Smarter than a 5th Grader?	EMNLP 2022
ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery	NeurIPS 2024
The AI Scientist: Fully Automated Open-Ended Scientific Discovery	2024
LAB-Bench: Measuring Capabilities of Language Models for Biology Research	2024
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation	2023

Autonomous Research Agents

Paper	Year
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?	ICML 2024
WorkArena++: Towards Agents that Act Like Employees	2024
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation	2024
PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change	NeurIPS 2022
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents	2024
ACPBench: Reasoning about Action, Change, and Planning	2024
TRAIL: Trace Reasoning and Agentic Issue Localization	2025
CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization	NeurIPS 2023
Agent-as-a-Judge: Evaluate Agents with Agents	2024
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation	2025

Medical and Clinical Agents

Paper	Year
AgentClinic: a multimodal agent benchmark for clinical environments	NeurIPS 2024
MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents	NEJM AI 2025
EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records	2024
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning	2023
GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning	2024

Web Agents

Paper	Year
WebArena: A Realistic Web Environment for Building Autonomous Agents	ICLR 2024
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks	NeurIPS 2024
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models	ACL 2024
Mind2Web: Towards a Generalist Agent for the Web	NeurIPS 2023
WebCanvas: Benchmarking Web Agents in Online Canvas	NeurIPS 2024
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue	CVPR 2024
LASER: LLM Agent with State-Space Exploration for Web Navigation	NeurIPS 2023
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Agent for Automated Web Navigation	2024

General Tool-Use Agents

Paper	Year
GTA: A Benchmark for General Tool Agents	NeurIPS 2024
NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls	2024
Executable Code Actions Elicit Better LLM Agents	ICML 2024
RestGPT: Connecting Large Language Models with Real-World RESTful APIs	2023
Search-o1: Agentic Search-Enhanced Large Reasoning Models	2025
Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning	2025
ActionReasoningBench: Reasoning about Actions with and without Ramification Constraints	2024
R-Judge: Benchmarking Safety-Critical Decision Making for LLM Agents	2024

License

This repository is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
figs		figs
LICENSE		LICENSE
README.md		README.md

License

weitianxin/Awesome-Agentic-Reasoning

Folders and files

Latest commit

History

Repository files navigation

Awesome Agentic Reasoning Papers

🔔 News

📋 Table of Contents

🌟 Introduction

🤝 Contributing

📝 Citation

🏗️ Foundational Agentic Reasoning

🗺️ Planning Reasoning

In-context Planning

Workflow Design

Tree Search / Algorithm Simulation

Process Formalization

Decoupling / Decomposition

External Aid / Tool Use

Post-training Planning

🛠️ Tool-Use Optimization

In-Context Tool-Integration

Interleaving Reasoning and Tool Use

Optimizing Context for Tool Interaction

Post-training Tool-Integration

Bootstrapping of Tool Use via SFT

Mastery of Tool Use via RL

Orchestration-based Tool-Integration

Agentic Pipelines for Tool Orchestration

Tool Representations for Orchestration

🔍 Agentic Search

In-Context Search

Interleaving Reasoning and Search

Structure-Enhanced Search

Post-Training Search

SFT-Based Agentic Search

RL-Based Agentic Search

🧬 Self-evolving Agentic Reasoning

🔄 Agentic Feedback Mechanisms

Reflective Feedback

Parametric Adaptation

Validator-Driven Feedback

🧠 Agentic Memory

Agentic Use of Flat Memory

Factual Memory

Experience Memory

Structured Use of Memory

Post-training Memory Control

🚀 Evolving Foundational Agentic Capabilities

Self-evolving Planning

Self-evolving Tool-use

Self-evolving Search for Memory Retrieval

👥 Collective Multi-agent Reasoning

🤝 Collaboration and Division of Labor

In-context Collaboration

Manually Crafted Pipelines

LLM-Driven Pipelines

Theory-of-Mind-Augmented Collaboration

Post-training Collaboration

Multi-agent Prompt Optimization

Graph-based Topology Generation

Policy-based Topology Generation

🌱 Multi-Agent Memory and Evolution

From Single-Agent Evolution to Multi-Agent Evolution

Intra-test-time Evolution

Inter-test-time Evolution

Multi-agent Evolution

Multi-agent Memory Management for Evolution

Training Multi-agent to Evolve

🎨 Applications

💻 Math Exploration & Vibe Coding Agents

Foundational Agentic Reasoning

Self-evolving Agentic Reasoning

Collective Multi-agent Reasoning

🔬 Scientific Discovery Agents

Foundational Agentic Reasoning

Self-evolving Agentic Reasoning

Collective multi-agent reasoning

🤖 Embodied Agents

Packages