The Agentic Flip: Why I Ditched Heavy Frontier Models for Gemini 3.5 Flash and Haiku
High-volume agentic coding doesn't require multi-dollar frontier models. By combining a strict local orchestrator (Plan, Implement, Test, Review) with ultra-fast, cheap models like Gemini 3.5 Flash (Medium) and Claude 4.5 Haiku, I cut AI costs while boosting iteration speed.
A few days ago, I wrote about the broken economics of Copilot Enterprise, where running high-frequency agentic loops evaporated $60 in raw API tokens in a single 8-hour workday.
It was a stark wake-up call. If agentic software development is the future, then running everyday refactoring, compilation, and test loops on heavy-reasoning, premium frontier models is an economic dead end.
I spent the day today completely redesigning my local agentic workflow. My goal was simple: eliminate the budget anxiety of high-context development by switching entirely to cheap, high-throughput models, without sacrificing code quality.
The results have been eye-opening. Not only did I drastically slash token costs, but my iteration speed skyrocketed. Here is how I did it, the architecture of my custom orchestrator, and why cheap models—when paired with a structured planning framework—are more than enough to handle complex engineering tasks.
⚡ The Need for Speed: High-Throughput & Fast Reasoning
When we think of LLMs for coding, we are often conditioned to believe that “bigger is always better.” We assume that to write clean, type-safe Kotlin, Astro, or TypeScript, we must query the absolute largest model available on the market.
But in day-to-day software engineering, the reality is different. Most of our work isn’t solving high-level mathematical proofs or designing novel cryptographic protocols. Instead, it consists of:
- Boilerplate generation and syntax transformation.
- Implementing unit tests for newly created classes.
- Minor refactoring, updating import paths, and modifying config schemas.
- Diagnostic loops (parsing a compiler stack trace and fixing a syntax error).
For these tasks, the bottleneck is not “superintelligence.” The bottleneck is latency and throughput.
[Frontier Models (e.g. Opus/GPT-5)] ──► Slow Reasoning (15 t/s) ──► High Cost ($15/1M) ──► High Latency [Cheap Flash Models (Gemini 3.5)] ──► Fast Reasoning (150 t/s) ──► Low Cost ($0.075/1M) ──► Near InstantRight now, what developers actually need is a model that is smart enough, but crucially, ultra-cheap, high-throughput (outputting a massive volume of tokens per second), and capable of fast reasoning.
On my personal machine, I’ve been enjoying Gemini 3.5 Flash (Medium) and Claude 4.5 Haiku (along with GPT-5.4 Mini). Because Gemini 3.5 Flash is highly optimized for speed, it can stream output at 150+ tokens per second. When running recursive compilation and testing loops, this high iteration speed makes the assistant feel like a real-time terminal shell rather than a slow chat prompt. You can query, compile, fail, and fix in seconds.
🚫 The Trap of “Vibe Coding”
Of course, using a cheaper, smaller model has a catch: they have a smaller cognitive footprint. If you try to vibe code with them, they will fail.
[!WARNING] Vibe Coding is the habit of dumping a vague prompt into an AI, hitting compile, and crossing your fingers that it works. While it feels fast and magical for small script modifications, it is highly toxic in the long term. It leads to fragmented architectures, silent regressions, skipped edge cases, and massive technical debt.
With a heavy frontier model, vibe coding sometimes works because the model’s massive parameter count allows it to guess missing requirements. A cheap model will quickly hallucinate or fail to see side effects across files if you do not guide it.
The secret to using cheap models successfully is to replace vibe coding with structured discipline. You don’t need a super-model to write excellent code if you enforce a strict planning and verification lifecycle.
🛠️ The Local Orchestrator Architecture
To make cheap models highly successful, I built a custom local orchestrator that runs a strict execution environment. Instead of letting the model make blind modifications, the orchestrator divides the developer loop into four distinct, isolated phases:
┌──────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ 1. PLANNING │ ───► │ 2. REFINEMENT │ ───► │ 3. IMPLEMENT │ ───► │ 4. TESTING │ │ Generate │ │ Review Types │ │ Apply Target │ │ Compile & Run │ │ PLAN.md │ │ & Edge Cases│ │ Code Chunks │ │ Unit Tests │ └──────────────┘ └───────────────┘ └───────────────┘ └───────────────┘ ▲ │ │ If Tests Fail │ └─────────────────────────────────────────────┘ Self-Healing Loop1. The Planning Phase
Before a single file is modified, the model is injected with the codebase context and commands. It is forbidden from writing code. Instead, it must generate a structured PLAN.md file that specifies:
- The exact user requirements and user-defined constraints.
- A list of all files that will be modified, created, or deleted.
- API contract changes, data model migrations, and dependency implications.
2. The Refinement Phase
The plan is presented to the user (or reviewed by a secondary fast LLM pass). We check for edge cases, nullability, backward compatibility, and typing issues. Any adjustments are made directly to the PLAN.md before execution begins.
3. The Implementation Phase
The orchestrator reads the approved plan and feeds it to the developer agent. The agent is instructed to make edits in precise, non-adjacent chunks (rather than rewriting entire files, which is token-expensive and error-prone).
4. The Testing & Self-Healing Phase
Once the modifications are applied, the orchestrator automatically triggers the workspace’s compilation and testing suites (e.g., npx vitest run, gradle test, or make test).
- If a test fails or the compilation breaks, the orchestrator intercepts the stack trace.
- It passes the error log and the target file back to the model.
- The model analyzes the regression, adjusts the file, and runs the test suite again.
- This self-healing loop runs automatically until tests pass.
5. The Code Review Phase
Finally, a final lint and structural review is executed to ensure the code complies with local styling guidelines (like .cursorrules or standards.md) and doesn’t introduce architectural regression.
📊 The Math: Massive Savings, Faster Iteration
By combining this orchestrator with cheap models, the economic difference is staggering. However, we cannot simply lump all cheap models together; Gemini 3.5 Flash (Medium), GPT-5.4 Mini, and Claude 4.5 Haiku have distinct cost and performance profiles.
Here is a breakdown comparing the heavy frontier models against our specific low-cost options over a typical 100-step refactoring workflow:
| Model / Tier | Input Cost (per 1M) | Output Cost (per 1M) | Speed (Throughput) | 100-Step Loop Cost | Core Orchestrator Role |
|---|---|---|---|---|---|
| Heavy Frontier (Opus / GPT-5) | $15.00 | $75.00 | ~15 - 20 t/s | ~$85.00 | High-level architecture & initial plan |
| Gemini 3.5 Flash (Medium) | $0.075 | $0.300 | ~180 - 250 t/s | ~$0.45 | High-frequency code compilation & fix loops |
| GPT-5.4 Mini | $0.150 | $0.600 | ~120 - 150 t/s | ~$0.90 | Implementation logic & unit tests |
| Claude 4.5 Haiku | $0.250 | $1.000 | ~100 - 130 t/s | ~$1.50 | Code review, safety reviews, and lint checks |
Because the orchestrator breaks down complex work into tiny, structured, single-step tasks (e.g., “Refactor lines 15-22 to fix compile error X”), we can route each task to the model that offers the best trade-off. Gemini 3.5 Flash handles low-level iterative diagnostic loops with near-zero latency, while GPT-5.4 Mini or Haiku can be reserved for slightly more logical checks—keeping the entire workflow’s aggregate cost incredibly low.
Conclusion: Share Your Thoughts!
The age of unchecked “vibe coding” on expensive, slow models is drawing to a close. By enforcing a Plan-First approach and utilizing high-throughput, cheap models like Gemini 3.5 Flash and Claude 4.5 Haiku, I’ve managed to get all of the efficiency of agentic coding with none of the credit burnout.
It is proof that structural discipline and execution architecture will always beat raw parameter count when it comes to shipping production-ready code.
What models are you currently using to drive your local IDE tools? Are you feeling the crunch of metered credits, or have you already transitioned to cheap, fast models for your local iteration loops?
If anyone is interested in the custom orchestrator configuration and the scripting setup I use to run this local Planning, Implementation, Testing, and Code Review loop, let me know in the comment section below and I’d be happy to share my scripts and setup!