Your Prompt Framework Isn't a Spell Anymore


There was a stretch around 2023 when prompt engineering felt like alchemy. You’d find the right incantation — the magic phrase, the perfect ordering of instructions — and a model that had been giving you mush would suddenly produce gold. Frameworks like COSTAR emerged in exactly this world. Sheila Teo’s COSTAR (Context, Objective, Style, Tone, Audience, Response) won Singapore’s first GPT-4 prompt engineering competition, and for good reason: it imposed order on a process that otherwise felt like trial and error.

Three years later, the frameworks are still here. COSTAR is still one of the most widely taught structures around. But the reason they work has quietly changed, and if you’re still using them as performance hacks, you’re solving a problem that mostly no longer exists.

What the frameworks were actually doing

The original pitch for structured prompting was capability unlocking. Early models needed you to spell things out. If you didn’t declare the audience, you got generic output. If you didn’t specify a format, you got a wall of prose. The six slots of COSTAR mapped neatly onto the six things a 2023-era model would otherwise guess wrong.

That made the structure feel load-bearing. Drop the “Audience” field and the output got measurably worse. So people internalized a rule: more structure equals better output. Reasonable, given the evidence at the time.

What changed

Frontier models in 2026 infer most of that scaffolding on their own. Ask a capable model to “write a polite rejection email to a vendor,” and it already picks a sensible style, an appropriate tone, and a clean format — without you naming any of them. The slots you used to fill manually are now filled by the model’s own judgment.

This doesn’t make COSTAR wrong. It makes parts of it redundant for capable models on straightforward tasks. The marginal value of declaring “Tone: professional” has dropped close to zero when the model would have chosen professional anyway. Rigid six-part scaffolding on a simple request is now mostly ceremony — effort that doesn’t move the output.

The interesting wrinkle: this is model-dependent. A recent paper on COSTAR-A found that the original framework still improves clarity for large models but is less consistent with smaller, locally optimized ones — especially on tasks needing constrained, directive output. If you’re running an 8B-class fine-tuned model on your own hardware, the old rules still largely apply, and you may even want more directive structure than COSTAR provides. The “frameworks are obsolete” take is really a “frontier models got good” take, and it doesn’t transfer cleanly to the small-model world.

The part that still earns its keep

Here’s what survives, and it’s the part worth keeping: a framework is a checklist against forgetting.

The reason a structured prompt outperforms a lazy one in 2026 usually isn’t the structure itself. It’s that going through the slots forces you to supply the things that genuinely still move quality — and those cluster in three places:

  • Context — the single highest-leverage input. Most bad outputs trace back to missing context, not missing structure. The model can’t infer what it was never told.
  • Objective — a specific, concrete task instead of a vague gesture at one.
  • Response format — the shape of the output. Forcing a parseable, predictable structure is what makes prompts safe to run in a pipeline.

Style, Tone, and Audience still matter, but selectively. They earn their place in customer-facing and brand-voice work, where prose has to sound a particular way at scale. For code review or technical analysis, they’re largely noise — a model reviewing your auth handler doesn’t need a tone directive.

So the right mental model isn’t “fill in all six boxes every time.” It’s “use the boxes as a memory aid, then trim aggressively.” Keep what’s load-bearing for your task and drop the rest.

A template you’d actually use

Here’s COSTAR as a working checklist rather than a mandatory form. Note how much is marked droppable:

/* ============================================================
 * TL;DR: COSTAR as a checklist, not an incantation.
 * On capable 2026 models, Context + Objective + Response do
 * most of the work. Style/Tone/Audience matter for prose at
 * scale, less for code. Keep what's load-bearing, cut the rest.
 * ============================================================ */

# CONTEXT  — highest leverage; never skip
You are reviewing a Hono route handler in a TypeScript monorepo
on Cloudflare Workers + Supabase.

# OBJECTIVE — be specific and concrete
Find auth-bypass risks in this handler and list each one.

# STYLE — often inferable for technical work; drop if obvious
Terse, senior-engineer register.

# TONE — usually noise for code; matters for user-facing copy
(omit)

# AUDIENCE — shapes assumed knowledge and depth
A backend dev who already knows JWT and RLS.

# RESPONSE — high leverage; forces predictable, parseable output
Markdown list: finding -> severity -> one-line fix. No preamble.

Where the structure still wins outright

One place the checklist discipline isn’t optional: prompts you write once and run many times. System prompts, automation flows, RAG pipelines — anywhere you don’t get to iterate per call. Interactive chat lets you toss a messy request at the model and refine when it misreads you. A pipeline doesn’t. When the prompt has to be right on the first try across thousands of inputs, the completeness a framework enforces stops being ceremony and starts being insurance.

The honest takeaway

COSTAR and its cousins didn’t get worse. They got demoted — from spell to discipline. They no longer unlock capability the model couldn’t otherwise reach. What they still do is make you write a complete prompt, and a complete prompt produces better output regardless of which acronym you used to get there.

Treat the framework as scaffolding for your thinking, not as a magic phrase for the model. Fill the slots that carry weight for the task in front of you, cut the ones that don’t, and spend the saved effort on the one input no framework can supply for you: context the model has no way of knowing.