We improved Cline, a popular open-source coding agent, by +15% accuracy on SWE-Bench — without retraining the LLM, changing any tools, or modifying the architecture whatsoever. How? All we did was optimize its ruleset, in ./clinerules — a user defined section for developers to add custom instructions to the system prompt, just like .cursor/rules in Cursor or CLAUDE.md in Claude Code. Using our algorithm, Prompt Learning, we automatically refined these rules across a feedback loop powered by GPT-5. What is Prompt Learning? It’s an optimization algorithm that improves prompts, not models. Inspired by RL, it follows an action → evaluation → improvement loop — but instead of gradients, it uses Meta Prompting: feeding a prompt into an LLM and asking it to make it better. We add a key twist — LLM-generated feedback explaining why outputs were right or wrong, giving the optimizer richer signal to refine future prompts. The result: measurable gains in accuracy, zero retraining. You can use it in Arize AX or the Prompt Learning SDK. Here’s how we brought GPT-4.1’s performance on SWE-Bench Lite to near state-of-the-art levels — matching Claude Sonnet 4-5 — purely through ruleset optimization. Last time, we optimized Plan Mode - but this time, we optimized over Act Mode - giving Cline full permissions to read, write, and edit code files, and testing its accuracy on SWE Bench Lite. Our optimization loop: 1️⃣ Run Cline on SWE-Bench Lite (150 train, 150 test) and record its train/test accuracy. 2️⃣ Collect the patches it produces and verify correctness via unit tests. 3️⃣ Use GPT-5 to explain why each fix succeeded or failed on the training set. 4️⃣ Feed those training evals — along with Cline’s system prompt and current ruleset — into a Meta-Prompt LLM to generate an improved ruleset. 5️⃣ Update ./clinerules, re-run, and repeat. The results: Sonnet 4-5 saw a modest +6% training and +0.7% test gain — already near saturation — while GPT-4.1 improved 14–15% in both, reaching near-Sonnet performance (34% vs 36%) through ruleset optimization alone in just two loops! These results highlight how prompt optimization alone can deliver system-level gains — no retraining, no new tools, no architecture changes. In just two optimization loops, Prompt Learning closed much of the gap between GPT-4.1 and Sonnet-level performance, proving how fast and data-efficient instruction-level optimization can be. And of course, we used Arize Phoenix to run LLM evals on Cline’s code and track experiments across optimization runs. Code here: https://lnkd.in/eDejFy6N
Improving LLM Coding Accuracy with Code Intelligence
Explore top LinkedIn content from expert professionals.
Summary
Improving LLM coding accuracy with code intelligence means using advanced techniques and feedback systems to help AI models write, revise, and better understand code, much like a programmer would. This approach goes beyond simple text prediction, aiming to make language models more reliable, efficient, and capable in real-world coding tasks.
- Refine your prompts: Focus your instructions at the top of your prompt and use clear formatting to draw the model’s attention to critical rules or constraints.
- Incorporate feedback loops: Set up a system where the model reflects on its own output or receives automated feedback, allowing it to identify and fix mistakes in each coding attempt.
- Simulate code execution: Use models or workflows that evaluate code by running it or checking its behavior, which helps the AI reason more like a developer and produce more accurate results.
-
-
Last week, I described four design patterns for AI agentic workflows that I believe will drive significant progress: Reflection, Tool use, Planning and Multi-agent collaboration. Instead of having an LLM generate its final output directly, an agentic workflow prompts the LLM multiple times, giving it opportunities to build step by step to higher-quality output. Here, I'd like to discuss Reflection. It's relatively quick to implement, and I've seen it lead to surprising performance gains. You may have had the experience of prompting ChatGPT/Claude/Gemini, receiving unsatisfactory output, delivering critical feedback to help the LLM improve its response, and then getting a better response. What if you automate the step of delivering critical feedback, so the model automatically criticizes its own output and improves its response? This is the crux of Reflection. Take the task of asking an LLM to write code. We can prompt it to generate the desired code directly to carry out some task X. Then, we can prompt it to reflect on its own output, perhaps as follows: Here’s code intended for task X: [previously generated code] Check the code carefully for correctness, style, and efficiency, and give constructive criticism for how to improve it. Sometimes this causes the LLM to spot problems and come up with constructive suggestions. Next, we can prompt the LLM with context including (i) the previously generated code and (ii) the constructive feedback, and ask it to use the feedback to rewrite the code. This can lead to a better response. Repeating the criticism/rewrite process might yield further improvements. This self-reflection process allows the LLM to spot gaps and improve its output on a variety of tasks including producing code, writing text, and answering questions. And we can go beyond self-reflection by giving the LLM tools that help evaluate its output; for example, running its code through a few unit tests to check whether it generates correct results on test cases or searching the web to double-check text output. Then it can reflect on any errors it found and come up with ideas for improvement. Further, we can implement Reflection using a multi-agent framework. I've found it convenient to create two agents, one prompted to generate good outputs and the other prompted to give constructive criticism of the first agent's output. The resulting discussion between the two agents leads to improved responses. Reflection is a relatively basic type of agentic workflow, but I've been delighted by how much it improved my applications’ results. If you’re interested in learning more about reflection, I recommend: - Self-Refine: Iterative Refinement with Self-Feedback, by Madaan et al. (2023) - Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, by Gou et al. (2024) [Original text: https://lnkd.in/g4bTuWtU ]
-
Meta just dropped a new kind of code model; and it's not just bigger. It's different. The new Code World Model (CWM), a 32B parameter LLM for code generation is not "just another code model." What makes it different? CWM was trained not only on code, but on what code does at runtime. Most LLMs learn code like they learn prose: predict the next token. CWM learns code like developers do; by simulating its execution. This shift is critical because: - When humans debug or write code, we think in terms of state changes, side effects, and what happens next. - CWM learns from execution traces of Python functions and agentic behaviors in Dockerized Bash environments. It doesn’t just guess the next line; it reasons like it’s living inside the terminal. This unlocks: - Stronger reasoning in multi-step problems - Simulation-based debugging ; More accurate code generation in real-world workflows - Potential for autonomous “neural debuggers” that think in traces, not just tokens On benchmarks, it’s already competitive: - 68.6% on LiveCodeBench v5 - 76% on AIME 2024 - 65.8% on SWE-bench Verified And it's open weights. Meta is betting that world modeling + RL fine-tuning is the next frontier for coding LLMs; not just scale. Is this a glimpse of what post-token-prediction AI looks like? Get started with the links below: - Tech Report: https://lnkd.in/eV7YirjC - Model Weights: https://lnkd.in/e2CTzsxr - On Huggingface: https://lnkd.in/e_S4R-P4 - Inference Code: https://lnkd.in/eVHeW8VV ___ If you like this content and it resonates, follow me Armand Ruiz for more like it.
-
I’ve been working on a massive prompt that extracts structured data from unstructured text. It's effectively a program, developed over the course of weeks, in plain English. Each instruction is precise. The output format is strict. The logic flows. It should Just Work™. And the model? Ignores large swaths of it. Not randomly, but consistently and stubbornly. This isn't a "program," it's a probability engine with auto-complete. This is because LLMs don’t "read" like we do, or execute prompts like a program does. They run everything through the "attention mechanism," which mathematically weighs which tokens matter in relation to others. Technically speaking: Each token is transformed into a query, key, and value vector. The model calculates dot products between the query vector and all key vectors to assign weights. Basically: "How relevant is this other token to what I’m doing right now?" Then it averages the values using those weights and moves on. No state. No memory. Just a rolling calculation over a sliding window of opaquely-chosen context. It's kind of tragic, honestly. You build this beautifully precise setup, but because your detailed instructions are buried in the middle of a long prompt -- or phrased too much like background noise -- they get low scores. The model literally pays less attention to them. We thought we were vibe coding, but the real vibe coder was the LLM all along! So how to fix it? Don’t just write accurate instructions. Write ATTENTION-WORTHY ones. - 🔁 Repeat key patterns. Repetition increases token relevance, especially when you're relying on specific phrasing to guide the model's output. - 🔝 Push constraints to the top. Instructions buried deep in the prompt get lower attention scores. Front-load critical rules so they have a better chance of sticking. - 🗂️ Use structure to force salience. Consistent headers, delimiters, and formatting cues help key sections stand out. Markdown, line breaks, and even ALL CAPS (sparingly) can help direct the model's focus to what actually matters. - ✂️ Cut irrelevant context. The less junk in the prompt, the more likely your real instructions are to be noticed and followed. You're not teaching a model. You're gaming a scoring function.
-
AI-generated code isn't just for weekend projects and vide-coding. Airbnb just did an LLM-driven code migration that took just 6 weeks worth of engineering time instead of the estimated 1.5 years. - They kicked off the migration by breaking down the process into a series of automated validation and refactor steps. This state-machine-like approach moved each file through stages, letting the pipeline handle files while also keeping track of progress. - They built in retry loops to improve success rates. Each time a file encountered an error, the system retried the validation and prompted the LLM with updated context and errors. This brute-force method allowed for the fixing of many simple-to-medium complexity files. - To handle more complex files, they significantly increased the context fed into the prompts. Each prompt drew from a lot of related files and examples, so the LLM had the best chance of understanding the specific patterns and requirements needed for the migration. - After reaching a 75% success rate, the team took a systematic approach to tackle the remaining 900 files. They introduced a system that commented on the migration status, allowing them to identify common pitfalls and refine their scripts accordingly. - Using a "sample, tune, and sweep" strategy, they iteratively improved their scripts over four days, pushing the success rate from 75% to 97%. This let them significantly reduce the remaining workload while still making sure that thorough testing coverage remained intact. Link to the blog post from Airbnb: https://lnkd.in/gPmYFQAP #AI #LLMs #GenAI
-
91.3% accuracy vs 0%. Same model. Same task. The only difference: treating your prompt as code instead of text. Recursive Language Models (RLMs) from MIT have completely changed how I think about handling long context in LLMs. Instead of cramming everything into the context window, RLMs treat your prompt as part of the 𝘦𝘯𝘷𝘪𝘳𝘰𝘯𝘮𝘦𝘯𝘵 that the model can programmatically explore. 𝗧𝗵𝗲 𝗖𝗼𝗿𝗲 𝗜𝗻𝘀𝗶𝗴𝗵𝘁 Once you hit the context limit in an LLM, you're done. But LLMs are trained for code as well, right? Why not use their coding skills for more than just coding? 1. Load your prompt as a 𝘷𝘢𝘳𝘪𝘢𝘣𝘭𝘦 in a REPL programming environment 2. Give the model tools to peek into, decompose, and recursively process parts of that variable 3. Let the model write 𝘤𝘰𝘥𝘦 that calls itself on programmatic slices of the input This enables the model to handle prompts that are literally 100x longer than its context window. The 𝗿𝗲𝗰𝘂𝗿𝘀𝗶𝘃𝗲 element is the key insight here - the LLM can call itself (or a smaller subagent) for smaller tasks, allowing it to batch and concatenate results to answer complex questions. 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 I tested it in Python (via DSPy), input the full alice in wonderland book and asked it to give a sentiment analysis of the openings of each chapter. The LLM: 1. Explored the prompt (book) to see how the chapter headings were formatted 2. Implemented regex to split the full string into chunks before/after each chapter heading 3. Invoked the LLM sub-agent on each paragraph to analyse the sentiment Even if the full prompt can't fit into history, LLMs have notoriously suffered from context rot. This approach enabled each task to be separately analysed by the sub-agent, each having no knowledge of the greater task. 𝗥𝗲𝘀𝘂𝗹𝘁𝘀 • RLMs successfully process inputs up to 𝘁𝘄𝗼 𝗼𝗿𝗱𝗲𝗿𝘀 𝗼𝗳 𝗺𝗮𝗴𝗻𝗶𝘁𝘂𝗱𝗲 beyond model context windows • On BrowseComp-Plus (6-11M tokens), RLM(GPT-5) achieved 91.3% accuracy vs 0% for the base model RLMs aren't perfect. The inference cost has high variance - median costs are comparable to base models, but some trajectories explode to 3x+ the cost due to long recursive chains. I also found, as the authors note in the appendix, that the models continue analysing well past when they had already found an answer. My hunch is that each LLM invocation always wants to do 𝘴𝘰𝘮𝘦𝘵𝘩𝘪𝘯𝘨, even if that something has already been done. It always wants to check its answer. Because of how they're trained, LLMs never just say "Okay, done!". The paper demonstrates that with better training (especially on-policy rollouts at scale), native RLMs could become far more efficient than current implementations suggest. I'll be extremely excited if this becomes a core part of model training, building custom models that excel at managing their prompt with code. Read the paper: https://lnkd.in/eq_xUJvJ
-
A single CLAUDE.md file just hit 15K+ GitHub stars. No framework. No infra. No fine-tuning. Just… better instructions. This idea is inspired by Andrej Karpathy, who pointed out something most people ignore: "LLMs don’t fail randomly. They fail predictably." - Overengineering simple tasks - Making silent assumptions - Editing things you didn't ask for - Writing 10x more code than needed If the mistakes are predictable → you can design against them. That's exactly what this CLAUDE.md does. It turns AI coding from: "generate code" into "engineer behavior" Here are the 4 core principles inside: 1️⃣ Think Before Coding → Force the model to state assumptions, surface ambiguity, and ask questions 2️⃣ Simplicity First → Minimum code. No speculative abstractions. No unnecessary flexibility 3️⃣ Surgical Changes → Only touch what’s required. No “drive-by refactoring” 4️⃣ Goal-Driven Execution → Define success criteria (tests, checks) instead of vague instructions This is the real shift happening right now: We're moving from "AI writes code" to "we design systems that make AI write good code" And the most powerful tools? Not always libraries. Sometimes… just well-crafted prompts.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development