Lean AI: Stop Wasting Tokens and Start Building Smarter LLM Apps ๐
A practical guide to reducing AI costs, latency, and unnecessary contextโwithout destroying response quality.

Most AI applications have a token problem.
Not because tokens are inherently expensive, but because many applications send far more context than the model actually needs.
Massive system prompts. Entire conversation histories. Twenty tools loaded for a task that needs one. Huge documents retrieved for questions answered by a single paragraph.
It works during the prototype phase. Then traffic grows, latency increases, and the API bill becomes uncomfortable.
The solution is not simply choosing a cheaper model.
It is building Lean AI systems.
What Is Lean AI? ๐ง
Lean AI applies the main idea of lean engineering to LLM applications:
Remove anything that consumes resources without improving the result.
In an LLM system, waste usually appears as:
Irrelevant context
Repeated instructions
Excessive output
Unnecessary tool definitions
Oversized retrieval results
Expensive models handling simple tasks
Failed requests and repeated retries
The goal is not to create the shortest possible prompt.
The goal is to use the minimum amount of context and computation required to produce a reliable result.
That distinction matters. Blindly removing tokens can reduce quality. Lean AI removes waste while preserving useful information.
Why Token Efficiency Matters ๐ธ
Every LLM request has a basic cost:
Total cost =
(input tokens ร input price)
+
(output tokens ร output price)
However, cost is only part of the problem.
More tokens can also mean:
Higher response latency
Faster exhaustion of context windows
More irrelevant information distracting the model
Increased risk of inconsistent answers
Higher infrastructure costs at scale
A few hundred unnecessary tokens may look harmless.
But when an agent makes multiple model calls per task and serves thousands of users, small inefficiencies compound quickly.
Token optimization is not just a billing exercise. It is a system-design problem.
1. Measure Before You Optimize ๐
Do not optimize prompts based only on intuition.
Track at least:
Input tokens per request
Output tokens per request
Cost per successful task
Number of LLM calls per task
Cache hit rate
Retry rate
Latency
Task-success score
The most useful metric is often cost per successful task, not cost per API call.
A cheap request that fails three times may cost more than one successful request sent to a stronger model.
2. Keep Prompts Focused โ๏ธ
Long prompts are not automatically better prompts.
Compare these instructions:
Please carefully analyze the following content and provide a
comprehensive but concise summary containing all important information.
And:
Summarize the text in five bullets. Preserve decisions, dates, and risks.
The second version is shorter and more precise.
A good prompt should define:
The task
The relevant constraints
The expected output format
The completion condition
Avoid repeated politeness, duplicated rules, and vague instructions such as โgive the best possible answerโ unless they create measurable value.
3. Stop Sending the Entire Conversation History ๐งน
Many chat applications resend every previous message with each request.
That is convenient, but rarely efficient.
A better approach is to divide memory into three layers:
Recent messages โ Preserve short-term conversational flow
Structured facts โ Store durable preferences and decisions
Conversation summary โ Compress older discussions
For example, an assistant does not need 40 previous messages to remember that a user prefers TypeScript.
Store the fact directly:
{
"preferred_language": "TypeScript"
}
Structured memory is often cheaper and more reliable than repeatedly sending raw conversation history.
4. Retrieve Less, but Retrieve Better ๐
A common Retrieval-Augmented Generation, or RAG, mistake is retrieving too many chunks โjust to be safe.โ
More context can produce worse answers when irrelevant passages compete with useful evidence.
A stronger retrieval pipeline can include:
Metadata filtering
Hybrid semantic and keyword search
Reranking
Deduplication
A strict context-token budget
Instead of sending the top 20 chunks directly to the model, retrieve broadly, rerank the results, and pass only the most relevant evidence.
The goal is not maximum context.
It is maximum signal per token.
5. Route Tasks to the Right Model ๐ฆ
Not every task needs your most capable model.
Simple operations such as:
Classification
Intent detection
Entity extraction
Formatting
Basic summarization
Query rewriting
can often run on smaller and faster models.
Reserve expensive reasoning models for tasks involving:
Complex reasoning
Ambiguous requirements
Architecture decisions
Difficult debugging
High-risk outputs
A basic routing strategy might look like this:
function chooseModel(task: Task): Model {
if (task.type === "classification") {
return "small-model";
}
if (task.complexity === "high") {
return "reasoning-model";
}
return "general-model";
}
Model routing usually delivers more value than endlessly removing a few words from a prompt.
6. Use Prompt Caching Correctly โก
Many AI providers can reuse repeated prompt prefixes.
To benefit from caching, organize prompts so that stable content appears first:
1. Static system instructions
2. Stable examples
3. Tool definitions
4. Retrieved context
5. Current user request
Avoid placing timestamps, random identifiers, or frequently changing values near the beginning of an otherwise stable prompt.
A small change near the start may prevent the rest of the prompt from being reused.
Caching is especially useful for:
Large system prompts
Shared documents
Repeated few-shot examples
Agent instructions
Codebase context
7. Give the Model Only the Tools It Needs ๐ ๏ธ
Tool-calling agents often receive every available tool definition with every request.
That creates several problems:
More input tokens
Slower processing
Greater tool-selection ambiguity
More opportunities for incorrect calls
If a user asks about the weather, the model probably does not need access to billing, CRM, deployment, database, and calendar tools.
Filter tools before calling the model:
const tools = toolRegistry.getToolsForIntent(detectedIntent);
A smaller toolset improves both efficiency and reliability.
8. Control Output Length ๐
Output tokens can be more expensive and slower to generate than input tokens.
Do not request unlimited detail when your application needs a predictable result.
Prefer explicit boundaries:
Return:
- A maximum of five recommendations
- One sentence per recommendation
- No introduction
- Valid JSON only
Structured outputs can also reduce parsing errors and expensive retry loops.
However, avoid requesting large schemas containing fields your application never uses.
Structured waste is still waste.
9. Prevent Expensive Retry Loops ๐
Retries are one of the least visible sources of token waste.
A poorly designed agent may repeatedly call the model because:
Tool errors are vague
Validation rules are unclear
The model cannot detect completion
Maximum iteration limits are missing
Bad tool response:
{
"error": "Request failed"
}
Better tool response:
{
"error": "INVALID_DATE_RANGE",
"message": "endDate must be later than startDate",
"retryable": true
}
Give the model actionable error messages and enforce hard limits:
const MAX_AGENT_STEPS = 8;
An agent without a step limit is not autonomous.
It is an unbounded invoice.
10. Optimize Quality and Cost Together โ๏ธ
Token reduction should never be evaluated in isolation.
Use a scorecard such as:
| Metric | Before | After |
|---|---|---|
| Average input tokens | 8,400 | 3,100 |
| Average output tokens | 900 | 420 |
| Successful-task rate | 91% | 92% |
| P95 latency | 8.2s | 4.7s |
| Cost per successful task | $0.18 | $0.07 |
A change is useful when it reduces cost or latency without causing an unacceptable quality regression.
Create a representative evaluation dataset before rewriting prompts, changing retrieval limits, or routing requests to smaller models.
Otherwise, you are not optimizing.
You are guessing.
A Practical Lean AI Checklist โ
Before sending a request to an LLM, ask:
Does the model need every part of this context?
Can older messages be summarized?
Can structured data replace raw text?
Are retrieved chunks relevant and deduplicated?
Does this task require the most expensive model?
Can stable prompt content be cached?
Does the model need every available tool?
Is the output length bounded?
Are retries limited and observable?
Are cost and quality evaluated together?
Final Thoughts ๐ฏ
The future of AI engineering is not just about using more capable models.
It is about building systems that use intelligence efficiently.
Lean AI does not mean stripping prompts until responses become unreliable. It means understanding which tokens create value and removing the ones that do not.
Start with observability. Improve retrieval. Route models intelligently. Compress memory. Limit tools and retries. Then measure the result.
Because the best LLM system is not the one that uses the most context.
It is the one that knows exactly how much context it needs. ๐



