11-Step System Design Process for Engineers

System design can feel messy when you jump straight into databases, queues, caches, cloud services, and architecture diagrams.

The better approach?

Use a repeatable process.

Before choosing Kafka, Redis, Kubernetes, DynamoDB, or any shiny tool, I like to walk through these 11 questions.

They force clarity.
They expose trade-offs.
They make the design easier to defend in interviews, design reviews, and real production work.

1. What are we building? 🎯

Before architecture, clarify the problem.

This is where I define the functional requirements and non-functional requirements.

Functional requirements answer:

What should the system do?

Examples:

Users can upload videos
Drivers can accept ride requests
Customers can place orders
Admins can view analytics

Non-functional requirements answer:

How well should the system do it?

Examples:

Low latency
High availability
Strong consistency
Security
Scalability
Fault tolerance

Bad system design usually starts with vague requirements.

Good system design starts by reducing ambiguity.

2. How big is it? 📏

Once the requirements are clear, estimate the scale.

Not perfectly. Roughly.

This includes:

Daily active users
Requests per second
Read/write ratio
Storage growth
Bandwidth usage
Latency expectations

This is where back-of-the-envelope calculations matter.

Numbers turn vague architecture into engineering.

3. What is the system made of? 🧱

Now we define the core architecture.

This is where I decide:

Monolith or microservices?
Which services exist?
What are the domain boundaries?
What should be synchronous?
What should be asynchronous?
What belongs together?
What should be isolated?

For many systems, a modular monolith is a better starting point than microservices.

Microservices are not a personality trait.

They add operational cost, distributed transactions, network failures, observability complexity, and deployment coordination.

Use them when the boundaries are real.

Important topics here:

Tech stack selection
Domain-Driven Design
Multi-tenant SaaS architecture
Serverless architecture
Containers and orchestration
Cloud infrastructure, especially AWS

The key question is:

What architecture gives us enough scalability without unnecessary complexity?

4. How do clients talk to the system? 🔌

Next, design the client communication layer.

This includes:

REST APIs
GraphQL
WebSockets
API Gateway
Authentication
Authorization
Rate limits
Sessions or JWTs

REST is usually the safest default.

GraphQL is useful when clients need flexible querying.

WebSockets are useful when the server must push updates in real time.

Examples:

Chat app → WebSockets
Internal admin panel → REST or GraphQL
Public API → REST with strong versioning
Real-time dashboard → WebSockets or Server-Sent Events

Also define:

API versioning
Request validation
Error format
Pagination
Idempotency keys
Rate limiting
Auth boundaries

Your API is the contract. Treat it like a product.

5. How does traffic reach and move through the system? 🌍

Now trace the request path:

Client → DNS → CDN → Load Balancer → API Gateway → Service → Cache/Database

Important components:

DNS
TCP vs UDP
CDN
Load balancers
Reverse proxies
Edge caching
Application caching

For example:

Static assets should go through CDN
Hot reads can use Redis or Memcached
Expensive API responses can be cached
Global users may need regional routing

Caching is powerful, but dangerous when used lazily.

Always ask:

What can become stale, and how bad is it if it does?

6. Where does data live? 🗄️

This is the heart of most system design problems.

Choose storage based on access patterns, not hype.

Ask:

Is the data relational?
Do we need transactions?
Do we need strong consistency?
Is the workload read-heavy or write-heavy?
Do we need full-text search?
Do we need analytics?
Do we need time-series storage?

Common choices:

PostgreSQL/MySQL for relational data
DynamoDB/Cassandra for massive key-value workloads
Elasticsearch/OpenSearch for search
Redis for cache and ephemeral data
S3/object storage for files
ClickHouse/BigQuery/Snowflake for analytics

Also consider:

Indexing
Sharding
Replication
Partitioning
Transaction isolation
CAP theorem
PACELC theorem
CQRS

The real skill is not picking SQL or NoSQL.

The real skill is understanding the trade-off you just accepted.

7. What happens asynchronously? 📨

Not everything should happen inside the request-response cycle.

Slow, unreliable, or retryable work should often be async.

Examples:

Sending emails
Processing payments
Generating reports
Video transcoding
Updating search indexes
Sending notifications
Syncing third-party systems

This is where queues and streams come in.

Common tools:

RabbitMQ
Kafka
SQS
SNS
Redis Streams
Google Pub/Sub

Important patterns:

Retries
Dead-letter queues
Idempotency
Eventual consistency
Saga pattern
Durable execution
Workflow orchestration

Async design is not just “throw it into Kafka.”

You need to answer:

What happens if this message is processed twice, late, or never?

8. What fails and how do we recover? 🔥

Everything fails.

Servers fail.
Databases fail.
Networks fail.
Deployments fail.
Humans fail.

So design for failure from the beginning.

Think about:

High availability
Disaster recovery
Graceful degradation
Failover
Backups
Restore testing
RPO and RTO
Circuit breakers
Timeouts
Bulkheads

Example:

If the recommendation service is down, the homepage should still load.

Maybe with popular items.
Maybe with cached results.
Maybe without recommendations.

But the whole system should not collapse because one dependency failed.

A serious design explains failure modes clearly.

9. How do we observe and respond? 📊

If you cannot observe it, you cannot operate it.

A production system needs:

Logs
Metrics
Traces
Alerts
Dashboards
SLOs
Error budgets
Incident response process

Useful questions:

How do we know the system is healthy?
What alerts matter?
What should wake someone up?
How do we debug a slow request?
How do we trace a user action across services?
What is the rollback plan during an incident?

Observability is not adding logs everywhere.

It is designing the system so failures become diagnosable.

10. How do we ship it safely? 🚢

A good design also covers delivery.

Because many outages are not caused by architecture.

They are caused by deployment.

Safe delivery includes:

CI/CD pipelines
Automated tests
Feature flags
Canary releases
Blue-green deployments
Database migration strategy
Rollbacks
Backward-compatible changes

For database changes, be especially careful.

A safe migration often has multiple steps:

Add new column
Deploy code that writes both old and new fields
Backfill data
Read from the new field
Remove old field later

Fast teams do not move fast because they skip safety.

They move fast because safety is automated.

11. How do we secure it? 🔐

Security should not be a final checkbox, but it deserves explicit attention.

Key areas:

Identity and Access Management
RBAC / ABAC
Service-to-service permissions
Least privilege
Secrets management
Network security
Data encryption
Audit logs
Tenant isolation
Secure defaults

For multi-tenant SaaS systems, tenant isolation is critical.

Ask:

Can one tenant access another tenant’s data?
Is tenant ID enforced at the database/query layer?
Are background jobs tenant-aware?
Are logs leaking sensitive data?
Are backups protected?
Are admin tools properly restricted?

Security is not only about attackers.

It is also about preventing accidental damage by your own system.

Final Thought 🧠

System design is not about memorizing architecture diagrams.

It is about asking better questions.

My 11-question process is:

What are we building?
How big is it?
What is the system made of?
How do clients talk to the system?
How does traffic reach and move through the system?
Where does data live?
What happens asynchronously?
What fails and how do we recover?
How do we observe and respond?
How do we ship it safely?
How do we secure it?

The best engineers are not the ones who always choose the most advanced technology.

They are the ones who understand the trade-offs, failure modes, and constraints.

Design the system.

Then challenge the design.

That is where real engineering starts. ⚙️

The 11-Step System Design Process I Use Before Designing Any System 🚀

1. What are we building? 🎯

2. How big is it? 📏

3. What is the system made of? 🧱

4. How do clients talk to the system? 🔌

5. How does traffic reach and move through the system? 🌍

6. Where does data live? 🗄️

7. What happens asynchronously? 📨

8. What fails and how do we recover? 🔥

9. How do we observe and respond? 📊

10. How do we ship it safely? 🚢

11. How do we secure it? 🔐

Final Thought 🧠

Comments

System Design Fundamentals

More from this blog

Your Scrum Carryover Problem Is Probably Not a Carryover Problem

Command Palette

1. What are we building? 🎯

2. How big is it? 📏

3. What is the system made of? 🧱

4. How do clients talk to the system? 🔌

5. How does traffic reach and move through the system? 🌍

6. Where does data live? 🗄️

7. What happens asynchronously? 📨

8. What fails and how do we recover? 🔥

9. How do we observe and respond? 📊

10. How do we ship it safely? 🚢

11. How do we secure it? 🔐

Final Thought 🧠

Comments

System Design Fundamentals

More from this blog