Skip to main content

Command Palette

Search for a command to run...

Lean AI: Stop Wasting Tokens and Start Building Smarter LLM Apps ๐Ÿš€

A practical guide to reducing AI costs, latency, and unnecessary contextโ€”without destroying response quality.

Updated
Lean AI: Stop Wasting Tokens and Start Building Smarter LLM Apps ๐Ÿš€
A
Iโ€™m Abbas Afsharfarnia, an Engineering Manager and hands-on Technical Lead based in Germany. I write about backend architecture, engineering leadership, developer experience, and AI-assisted software delivery. My focus is practical: scaling systems, improving code quality, reducing legacy complexity, mentoring engineers, and building teams that deliver with ownership.

Most AI applications have a token problem.

Not because tokens are inherently expensive, but because many applications send far more context than the model actually needs.

Massive system prompts. Entire conversation histories. Twenty tools loaded for a task that needs one. Huge documents retrieved for questions answered by a single paragraph.

It works during the prototype phase. Then traffic grows, latency increases, and the API bill becomes uncomfortable.

The solution is not simply choosing a cheaper model.

It is building Lean AI systems.

What Is Lean AI? ๐Ÿง 

Lean AI applies the main idea of lean engineering to LLM applications:

Remove anything that consumes resources without improving the result.

In an LLM system, waste usually appears as:

  • Irrelevant context

  • Repeated instructions

  • Excessive output

  • Unnecessary tool definitions

  • Oversized retrieval results

  • Expensive models handling simple tasks

  • Failed requests and repeated retries

The goal is not to create the shortest possible prompt.

The goal is to use the minimum amount of context and computation required to produce a reliable result.

That distinction matters. Blindly removing tokens can reduce quality. Lean AI removes waste while preserving useful information.

Why Token Efficiency Matters ๐Ÿ’ธ

Every LLM request has a basic cost:

Total cost =
(input tokens ร— input price)
+
(output tokens ร— output price)

However, cost is only part of the problem.

More tokens can also mean:

  • Higher response latency

  • Faster exhaustion of context windows

  • More irrelevant information distracting the model

  • Increased risk of inconsistent answers

  • Higher infrastructure costs at scale

A few hundred unnecessary tokens may look harmless.

But when an agent makes multiple model calls per task and serves thousands of users, small inefficiencies compound quickly.

Token optimization is not just a billing exercise. It is a system-design problem.

1. Measure Before You Optimize ๐Ÿ“Š

Do not optimize prompts based only on intuition.

Track at least:

  • Input tokens per request

  • Output tokens per request

  • Cost per successful task

  • Number of LLM calls per task

  • Cache hit rate

  • Retry rate

  • Latency

  • Task-success score

The most useful metric is often cost per successful task, not cost per API call.

A cheap request that fails three times may cost more than one successful request sent to a stronger model.

2. Keep Prompts Focused โœ‚๏ธ

Long prompts are not automatically better prompts.

Compare these instructions:

Please carefully analyze the following content and provide a
comprehensive but concise summary containing all important information.

And:

Summarize the text in five bullets. Preserve decisions, dates, and risks.

The second version is shorter and more precise.

A good prompt should define:

  • The task

  • The relevant constraints

  • The expected output format

  • The completion condition

Avoid repeated politeness, duplicated rules, and vague instructions such as โ€œgive the best possible answerโ€ unless they create measurable value.

3. Stop Sending the Entire Conversation History ๐Ÿงน

Many chat applications resend every previous message with each request.

That is convenient, but rarely efficient.

A better approach is to divide memory into three layers:

Recent messages โ†’ Preserve short-term conversational flow

Structured facts โ†’ Store durable preferences and decisions

Conversation summary โ†’ Compress older discussions

For example, an assistant does not need 40 previous messages to remember that a user prefers TypeScript.

Store the fact directly:

{
  "preferred_language": "TypeScript"
}

Structured memory is often cheaper and more reliable than repeatedly sending raw conversation history.

4. Retrieve Less, but Retrieve Better ๐Ÿ”Ž

A common Retrieval-Augmented Generation, or RAG, mistake is retrieving too many chunks โ€œjust to be safe.โ€

More context can produce worse answers when irrelevant passages compete with useful evidence.

A stronger retrieval pipeline can include:

  1. Metadata filtering

  2. Hybrid semantic and keyword search

  3. Reranking

  4. Deduplication

  5. A strict context-token budget

Instead of sending the top 20 chunks directly to the model, retrieve broadly, rerank the results, and pass only the most relevant evidence.

The goal is not maximum context.

It is maximum signal per token.

5. Route Tasks to the Right Model ๐Ÿšฆ

Not every task needs your most capable model.

Simple operations such as:

  • Classification

  • Intent detection

  • Entity extraction

  • Formatting

  • Basic summarization

  • Query rewriting

can often run on smaller and faster models.

Reserve expensive reasoning models for tasks involving:

  • Complex reasoning

  • Ambiguous requirements

  • Architecture decisions

  • Difficult debugging

  • High-risk outputs

A basic routing strategy might look like this:

function chooseModel(task: Task): Model {
  if (task.type === "classification") {
    return "small-model";
  }

  if (task.complexity === "high") {
    return "reasoning-model";
  }

  return "general-model";
}

Model routing usually delivers more value than endlessly removing a few words from a prompt.

6. Use Prompt Caching Correctly โšก

Many AI providers can reuse repeated prompt prefixes.

To benefit from caching, organize prompts so that stable content appears first:

1. Static system instructions
2. Stable examples
3. Tool definitions
4. Retrieved context
5. Current user request

Avoid placing timestamps, random identifiers, or frequently changing values near the beginning of an otherwise stable prompt.

A small change near the start may prevent the rest of the prompt from being reused.

Caching is especially useful for:

  • Large system prompts

  • Shared documents

  • Repeated few-shot examples

  • Agent instructions

  • Codebase context

7. Give the Model Only the Tools It Needs ๐Ÿ› ๏ธ

Tool-calling agents often receive every available tool definition with every request.

That creates several problems:

  • More input tokens

  • Slower processing

  • Greater tool-selection ambiguity

  • More opportunities for incorrect calls

If a user asks about the weather, the model probably does not need access to billing, CRM, deployment, database, and calendar tools.

Filter tools before calling the model:

const tools = toolRegistry.getToolsForIntent(detectedIntent);

A smaller toolset improves both efficiency and reliability.

8. Control Output Length ๐Ÿ“

Output tokens can be more expensive and slower to generate than input tokens.

Do not request unlimited detail when your application needs a predictable result.

Prefer explicit boundaries:

Return:
- A maximum of five recommendations
- One sentence per recommendation
- No introduction
- Valid JSON only

Structured outputs can also reduce parsing errors and expensive retry loops.

However, avoid requesting large schemas containing fields your application never uses.

Structured waste is still waste.

9. Prevent Expensive Retry Loops ๐Ÿ”

Retries are one of the least visible sources of token waste.

A poorly designed agent may repeatedly call the model because:

  • Tool errors are vague

  • Validation rules are unclear

  • The model cannot detect completion

  • Maximum iteration limits are missing

Bad tool response:

{
  "error": "Request failed"
}

Better tool response:

{
  "error": "INVALID_DATE_RANGE",
  "message": "endDate must be later than startDate",
  "retryable": true
}

Give the model actionable error messages and enforce hard limits:

const MAX_AGENT_STEPS = 8;

An agent without a step limit is not autonomous.

It is an unbounded invoice.

10. Optimize Quality and Cost Together โš–๏ธ

Token reduction should never be evaluated in isolation.

Use a scorecard such as:

Metric Before After
Average input tokens 8,400 3,100
Average output tokens 900 420
Successful-task rate 91% 92%
P95 latency 8.2s 4.7s
Cost per successful task $0.18 $0.07

A change is useful when it reduces cost or latency without causing an unacceptable quality regression.

Create a representative evaluation dataset before rewriting prompts, changing retrieval limits, or routing requests to smaller models.

Otherwise, you are not optimizing.

You are guessing.

A Practical Lean AI Checklist โœ…

Before sending a request to an LLM, ask:

  • Does the model need every part of this context?

  • Can older messages be summarized?

  • Can structured data replace raw text?

  • Are retrieved chunks relevant and deduplicated?

  • Does this task require the most expensive model?

  • Can stable prompt content be cached?

  • Does the model need every available tool?

  • Is the output length bounded?

  • Are retries limited and observable?

  • Are cost and quality evaluated together?

Final Thoughts ๐ŸŽฏ

The future of AI engineering is not just about using more capable models.

It is about building systems that use intelligence efficiently.

Lean AI does not mean stripping prompts until responses become unreliable. It means understanding which tokens create value and removing the ones that do not.

Start with observability. Improve retrieval. Route models intelligently. Compress memory. Limit tools and retries. Then measure the result.

Because the best LLM system is not the one that uses the most context.

It is the one that knows exactly how much context it needs. ๐Ÿš€

Practical AI Engineering

Part 1 of 1

Practical guides for building efficient, reliable, and production-ready AI applications, covering LLM optimization, prompt design, RAG, agents, evaluation, cost control, and system architecture.

More from this blog

E

Engineering Notes by Abbas Afsharfarnia

3 posts

Abbas Code is a practical engineering blog about backend architecture, engineering leadership, developer experience, and AI-assisted software delivery.

I share lessons from building SaaS platforms, modernizing legacy systems, improving engineering quality, leading teams, and applying AI tools to real-world software development.