Skip to main content

Command Palette

Search for a command to run...

The 11-Step System Design Process I Use Before Designing Any System πŸš€

Updated
The 11-Step System Design Process I Use Before Designing Any System πŸš€
A
I’m Abbas Afsharfarnia, an Engineering Manager and hands-on Technical Lead based in Germany. I write about backend architecture, engineering leadership, developer experience, and AI-assisted software delivery. My focus is practical: scaling systems, improving code quality, reducing legacy complexity, mentoring engineers, and building teams that deliver with ownership.

System design can feel messy when you jump straight into databases, queues, caches, cloud services, and architecture diagrams.

The better approach?

Use a repeatable process.

Before choosing Kafka, Redis, Kubernetes, DynamoDB, or any shiny tool, I like to walk through these 11 questions.

They force clarity.
They expose trade-offs.
They make the design easier to defend in interviews, design reviews, and real production work.


1. What are we building? 🎯

Before architecture, clarify the problem.

This is where I define the functional requirements and non-functional requirements.

Functional requirements answer:

What should the system do?

Examples:

  • Users can upload videos

  • Drivers can accept ride requests

  • Customers can place orders

  • Admins can view analytics

Non-functional requirements answer:

How well should the system do it?

Examples:

  • Low latency

  • High availability

  • Strong consistency

  • Security

  • Scalability

  • Fault tolerance

Bad system design usually starts with vague requirements.

Good system design starts by reducing ambiguity.


2. How big is it? πŸ“

Once the requirements are clear, estimate the scale.

Not perfectly. Roughly.

This includes:

  • Daily active users

  • Requests per second

  • Read/write ratio

  • Storage growth

  • Bandwidth usage

  • Latency expectations

This is where back-of-the-envelope calculations matter.

Numbers turn vague architecture into engineering.


3. What is the system made of? 🧱

Now we define the core architecture.

This is where I decide:

  • Monolith or microservices?

  • Which services exist?

  • What are the domain boundaries?

  • What should be synchronous?

  • What should be asynchronous?

  • What belongs together?

  • What should be isolated?

For many systems, a modular monolith is a better starting point than microservices.

Microservices are not a personality trait.

They add operational cost, distributed transactions, network failures, observability complexity, and deployment coordination.

Use them when the boundaries are real.

Important topics here:

  • Tech stack selection

  • Domain-Driven Design

  • Multi-tenant SaaS architecture

  • Serverless architecture

  • Containers and orchestration

  • Cloud infrastructure, especially AWS

The key question is:

What architecture gives us enough scalability without unnecessary complexity?


4. How do clients talk to the system? πŸ”Œ

Next, design the client communication layer.

This includes:

  • REST APIs

  • GraphQL

  • WebSockets

  • API Gateway

  • Authentication

  • Authorization

  • Rate limits

  • Sessions or JWTs

REST is usually the safest default.

GraphQL is useful when clients need flexible querying.

WebSockets are useful when the server must push updates in real time.

Examples:

  • Chat app β†’ WebSockets

  • Internal admin panel β†’ REST or GraphQL

  • Public API β†’ REST with strong versioning

  • Real-time dashboard β†’ WebSockets or Server-Sent Events

Also define:

  • API versioning

  • Request validation

  • Error format

  • Pagination

  • Idempotency keys

  • Rate limiting

  • Auth boundaries

Your API is the contract. Treat it like a product.


5. How does traffic reach and move through the system? 🌍

Now trace the request path:

Client β†’ DNS β†’ CDN β†’ Load Balancer β†’ API Gateway β†’ Service β†’ Cache/Database

Important components:

  • DNS

  • TCP vs UDP

  • CDN

  • Load balancers

  • Reverse proxies

  • Edge caching

  • Application caching

For example:

  • Static assets should go through CDN

  • Hot reads can use Redis or Memcached

  • Expensive API responses can be cached

  • Global users may need regional routing

Caching is powerful, but dangerous when used lazily.

Always ask:

What can become stale, and how bad is it if it does?


6. Where does data live? πŸ—„οΈ

This is the heart of most system design problems.

Choose storage based on access patterns, not hype.

Ask:

  • Is the data relational?

  • Do we need transactions?

  • Do we need strong consistency?

  • Is the workload read-heavy or write-heavy?

  • Do we need full-text search?

  • Do we need analytics?

  • Do we need time-series storage?

Common choices:

  • PostgreSQL/MySQL for relational data

  • DynamoDB/Cassandra for massive key-value workloads

  • Elasticsearch/OpenSearch for search

  • Redis for cache and ephemeral data

  • S3/object storage for files

  • ClickHouse/BigQuery/Snowflake for analytics

Also consider:

  • Indexing

  • Sharding

  • Replication

  • Partitioning

  • Transaction isolation

  • CAP theorem

  • PACELC theorem

  • CQRS

The real skill is not picking SQL or NoSQL.

The real skill is understanding the trade-off you just accepted.


7. What happens asynchronously? πŸ“¨

Not everything should happen inside the request-response cycle.

Slow, unreliable, or retryable work should often be async.

Examples:

  • Sending emails

  • Processing payments

  • Generating reports

  • Video transcoding

  • Updating search indexes

  • Sending notifications

  • Syncing third-party systems

This is where queues and streams come in.

Common tools:

  • RabbitMQ

  • Kafka

  • SQS

  • SNS

  • Redis Streams

  • Google Pub/Sub

Important patterns:

  • Retries

  • Dead-letter queues

  • Idempotency

  • Eventual consistency

  • Saga pattern

  • Durable execution

  • Workflow orchestration

Async design is not just β€œthrow it into Kafka.”

You need to answer:

What happens if this message is processed twice, late, or never?


8. What fails and how do we recover? πŸ”₯

Everything fails.

Servers fail.
Databases fail.
Networks fail.
Deployments fail.
Humans fail.

So design for failure from the beginning.

Think about:

  • High availability

  • Disaster recovery

  • Graceful degradation

  • Failover

  • Backups

  • Restore testing

  • RPO and RTO

  • Circuit breakers

  • Timeouts

  • Bulkheads

Example:

If the recommendation service is down, the homepage should still load.

Maybe with popular items.
Maybe with cached results.
Maybe without recommendations.

But the whole system should not collapse because one dependency failed.

A serious design explains failure modes clearly.


9. How do we observe and respond? πŸ“Š

If you cannot observe it, you cannot operate it.

A production system needs:

  • Logs

  • Metrics

  • Traces

  • Alerts

  • Dashboards

  • SLOs

  • Error budgets

  • Incident response process

Useful questions:

  • How do we know the system is healthy?

  • What alerts matter?

  • What should wake someone up?

  • How do we debug a slow request?

  • How do we trace a user action across services?

  • What is the rollback plan during an incident?

Observability is not adding logs everywhere.

It is designing the system so failures become diagnosable.


10. How do we ship it safely? 🚒

A good design also covers delivery.

Because many outages are not caused by architecture.

They are caused by deployment.

Safe delivery includes:

  • CI/CD pipelines

  • Automated tests

  • Feature flags

  • Canary releases

  • Blue-green deployments

  • Database migration strategy

  • Rollbacks

  • Backward-compatible changes

For database changes, be especially careful.

A safe migration often has multiple steps:

  1. Add new column

  2. Deploy code that writes both old and new fields

  3. Backfill data

  4. Read from the new field

  5. Remove old field later

Fast teams do not move fast because they skip safety.

They move fast because safety is automated.


11. How do we secure it? πŸ”

Security should not be a final checkbox, but it deserves explicit attention.

Key areas:

  • Identity and Access Management

  • RBAC / ABAC

  • Service-to-service permissions

  • Least privilege

  • Secrets management

  • Network security

  • Data encryption

  • Audit logs

  • Tenant isolation

  • Secure defaults

For multi-tenant SaaS systems, tenant isolation is critical.

Ask:

  • Can one tenant access another tenant’s data?

  • Is tenant ID enforced at the database/query layer?

  • Are background jobs tenant-aware?

  • Are logs leaking sensitive data?

  • Are backups protected?

  • Are admin tools properly restricted?

Security is not only about attackers.

It is also about preventing accidental damage by your own system.


Final Thought 🧠

System design is not about memorizing architecture diagrams.

It is about asking better questions.

My 11-question process is:

  1. What are we building?

  2. How big is it?

  3. What is the system made of?

  4. How do clients talk to the system?

  5. How does traffic reach and move through the system?

  6. Where does data live?

  7. What happens asynchronously?

  8. What fails and how do we recover?

  9. How do we observe and respond?

  10. How do we ship it safely?

  11. How do we secure it?

The best engineers are not the ones who always choose the most advanced technology.

They are the ones who understand the trade-offs, failure modes, and constraints.

Design the system.

Then challenge the design.

That is where real engineering starts. βš™οΈ

System Design Fundamentals

Part 1 of 1

System Design Fundamentals is a practical series for engineers who want to design scalable, reliable, secure, and production-ready systems. Each post breaks down one core topic with clear trade-offs, real-world patterns, and decision-making frameworks.

More from this blog

E

Engineering Notes by Abbas Afsharfarnia

2 posts

Abbas Code is a practical engineering blog about backend architecture, engineering leadership, developer experience, and AI-assisted software delivery.

I share lessons from building SaaS platforms, modernizing legacy systems, improving engineering quality, leading teams, and applying AI tools to real-world software development.