LLM Cost + Quality Tuner — the skill that pays for itself in the first hour — Skill of the Week

LLM bills in 2026 grow faster than usage. Output tokens cost 3–5x input. Conversation history re-uploads with every turn. System prompts get longer "just in case." Cache hit rates stay at 12% when they should be 60%. The bill arrives. People notice.

This skill is the structured exercise for cutting cost without breaking the feature.

What it does

Takes an LLM-powered feature (chatbot, RAG pipeline, agent, summarizer) and profiles where the money goes. Then ranks 7 cost-reduction levers by ROI for your specific shape of traffic, with the quality risk per lever spelled out.

The levers, in rough order of typical impact:

Route by complexity — cheap model for 80% of queries, premium model for 20% that need it. Typical savings: 40–60%.
Aggressive prompt compression — most system prompts have 30%+ fat.
Cache hit rate — semantic + exact-match caching for repeat-heavy workloads.
Max-tokens cap — most outputs don't need 2000-token headroom.
Context window discipline — stop sending the entire conversation every turn.
Batch + async for non-real-time tasks — usually 50% cheaper.
Provider arbitrage — open-weights for the easy 80%, frontier for the hard 20%.

Why this skill exists separately from "use a cheaper model"

The naive answer is "switch from GPT-4 to GPT-4o-mini." The naive answer is also how features quietly degrade. This skill makes you measure quality before and after, on a real eval set, before committing to the cheaper path.

Cutting cost without a quality check is how AI features die — not in one explosion, but in a long slow drift the team doesn't notice until users start complaining.

When this skill earns its keep

LLM bill > $5K/month and growing
Switching from prototype to production scale
New feature launch where projected cost looks scary
Investigating a cost spike in one specific endpoint

When to skip

Pre-product. Quality matters more than cost at <100 users.
Cost is fine; you want quality wins instead.

Full skill — with the profiling worksheet, eval-set requirement, and the implementation sequence that minimizes the risk of breaking outputs along the way.