Distributed Systems

20 deep questions — principles, patterns, real-world examples

Score0 / 0

1 of 20

Can you achieve exactly-once delivery in a distributed system?

AYes — Kafka supports exactly-once

BNo — use at-least-once + idempotent consumers for effectively-once

CYes — with two-phase commit

DYes — with at-most-once delivery

Hint

Two Generals Problem — mathematically proven impossible. What's the practical workaround?

Detailed explanation

The Two Generals Problem (1975) proves that two parties communicating over an unreliable channel can never be 100% certain both received the message. Every confirmation can itself be lost, leading to infinite confirmations-of-confirmations.

Three delivery semantics:

At-most-once: send and forget. Fast, but messages can be lost. Acceptable for analytics events — losing 0.01% of page views is fine.

At-least-once: retry until acknowledgment received. No message lost, but duplicates possible. The sender doesn't know if the original was processed and only the ack was lost, or if the message truly didn't arrive.

Exactly-once: impossible across network boundaries. Even Kafka's "exactly-once" is actually "effectively-once" within its own ecosystem — it uses idempotent producers + transactional consumers internally.

The practical solution — effectively-once:

POST /transfer
{ "idempotency_key": "tx-abc-123", "amount": 100 }

Server:
1. INSERT INTO transfers (idempotency_key, ...) 
   -- UNIQUE constraint on idempotency_key
2. If UNIQUE violation → already processed → return stored result
3. If OK → process → return new result

Client timeout → retry with SAME key → gets original result
Result: processed exactly once, even though delivered 2+ times

Related patterns: Outbox pattern (atomic DB write + event), Idempotent consumer, Deduplication table. Real-world: Stripe, Revolut, every payment API uses idempotency keys.

2 of 20

What is the CAP theorem?

AA system can have Caching, Authentication, and Performance simultaneously

BYou must choose between Consistency, Atomicity, and Partitioning

CIn a network partition, you must choose between Consistency and Availability — can't have both

DAll distributed systems are eventually consistent

Hint

C = Consistency, A = Availability, P = Partition tolerance. When the network splits, pick C or A.

Detailed explanation

CAP Theorem (Eric Brewer, 2000): a distributed data store can provide at most two of three guarantees simultaneously:

Consistency (C): every read receives the most recent write or an error. All nodes see the same data at the same time.

Availability (A): every request receives a response (not an error), without guarantee it's the most recent data.

Partition tolerance (P): system continues operating despite network partitions (messages between nodes lost or delayed).

The key insight: partitions WILL happen (network failures are inevitable). So you're really choosing between C and A when a partition occurs:

CP systems (choose Consistency): during partition, reject requests rather than return stale data. PostgreSQL primary, MongoDB primary, ZooKeeper. Use for: financial ledgers, payment processing — wrong balance is catastrophic.

AP systems (choose Availability): during partition, return possibly stale data rather than reject requests. Cassandra, DynamoDB, DNS. Use for: shopping carts, social feeds, display balances — stale data is acceptable.

Common misconception: CAP doesn't mean you permanently lose one property. It only applies DURING a network partition. When the network is healthy, you can have all three. The choice is: what happens when things go wrong?

In practice (banking): write path = CP (ledger, source of truth). Read path = AP (balance cache, might be milliseconds stale). This is CQRS — separate models for writes and reads.

3 of 20

What is the outbox pattern?

APublish events to Kafka before writing to database

BUse a separate event store instead of a relational DB

CWrite business data + event to DB in ONE transaction. A relay publishes from outbox to Kafka

DStore Kafka offsets in the database

Hint

Problem: DB write succeeds but Kafka publish fails → data inconsistency. How to make both atomic?

Detailed explanation

The problem: you need to write to DB AND publish an event. These are two separate systems — no shared transaction.

// Naive approach — breaks!
db.save(transfer)          // ✓ succeeds
kafka.publish(event)       // ✗ fails — network error
// DB has the transfer, but no event published
// Downstream consumers never learn about it

The outbox pattern solves this with a single DB transaction:

BEGIN TRANSACTION;
  INSERT INTO transfers (id, from, to, amount) VALUES (...);
  INSERT INTO outbox (id, event_type, payload) VALUES (...);
COMMIT;
// Both writes are atomic — either both succeed or both fail

Then a separate relay process:

// Runs continuously (or on schedule):
1. SELECT * FROM outbox WHERE published = false;
2. For each row: kafka.publish(row.payload);
3. UPDATE outbox SET published = true WHERE id = row.id;
// If relay crashes → restarts → re-reads unpublished rows

Trade-offs: events are eventually published (slight delay). Consumers may receive duplicates (relay crash after publish but before marking) — must be idempotent.

Alternative implementations: Debezium (CDC — captures DB changes from WAL and publishes to Kafka automatically), polling publisher, transaction log tailing.

Related patterns: Idempotent consumer (handles duplicate events), Saga (multi-step workflows using events), Event sourcing (events AS the primary store). Real-world: standard at Revolut, Starling, most fintechs for reliable event publishing.

4 of 20

What is a circuit breaker pattern?

ARetries failed requests indefinitely until success

BAfter N failures, stops calling downstream service — fails fast instead of waiting

CEncrypts all service communication

DRoutes traffic to a backup service

Hint

Named after electrical circuit breakers — trips when overloaded to prevent fire.

Detailed explanation

The problem — cascade failure: Service A calls Service B. B is slow (5s timeout). A's threads all wait for B. A runs out of threads. Now A can't serve ANY requests — even those that don't need B. A's callers time out too. The failure cascades through the entire system.

Circuit breaker — three states:

CLOSED (normal)
  → Requests pass through to downstream service
  → Count consecutive failures
  → If failures > threshold (e.g., 5) → switch to OPEN

OPEN (tripped)
  → ALL requests immediately fail (no downstream call!)
  → Return fallback or error instantly (1ms, not 5s timeout)
  → Wait for cooldown period (e.g., 30 seconds)
  → After cooldown → switch to HALF-OPEN

HALF-OPEN (testing)
  → Allow ONE request through to test if service recovered
  → If success → switch to CLOSED (resume normal)
  → If failure → switch back to OPEN (still broken)

Why this matters: without circuit breaker, 1000 requests pile up waiting for the dead service, each consuming a thread for 5 seconds. With circuit breaker: fail immediately after 5 failures, free threads for healthy requests.

Implementation: Resilience4j (current standard for JVM), Hystrix (Netflix, deprecated). Spring Cloud Circuit Breaker provides abstraction over both.

Related patterns: Bulkhead (isolate resources per dependency), Retry with exponential backoff (before circuit trips), Fallback (return cached/default value when circuit is open), Timeout (don't wait forever).

Real-world: Payment service calls AML (anti-money-laundering) check. AML is down. Without breaker: all transfers blocked for 30 seconds each. With breaker: transfers fail fast with "temporarily unavailable," queue for retry, don't block the whole system.

5 of 20

What is idempotency and why is it critical for payment APIs?

AEnsuring requests are processed in order

BCaching responses for faster subsequent requests

CRejecting all duplicate requests with 409 Conflict

DExecuting N times = same result as once. UNIQUE on idempotency_key — retries return original result

Hint

Client times out → retries → how to prevent double-charging the customer?

Detailed explanation

Idempotency = performing an operation multiple times produces the same result as performing it once. Mathematical: f(f(x)) = f(x).

Why it's critical: in distributed systems, you can never be sure a message was delivered. Client sends payment, server processes it, response is lost. Client retries. Without idempotency: customer charged twice.

Implementation pattern:

// Client generates a unique key ONCE per logical operation:
POST /transfers
Headers: Idempotency-Key: tx-abc-123
Body: { "from": "A1", "to": "A2", "amount": 100 }

// Server logic:
1. Check: SELECT * FROM transfers WHERE idempotency_key = 'tx-abc-123'
2. If found → return stored result (200 OK with original response)
3. If not found → process transfer → store with idempotency_key
   INSERT INTO transfers (idempotency_key, ...) VALUES ('tx-abc-123', ...)
   -- UNIQUE constraint prevents race condition

// Retry (same key):
→ UNIQUE constraint catches it → return original result
→ Customer charged exactly once

Why not just reject duplicates (option C)? Because the client doesn't know if the first request succeeded. If you return 409 Conflict, client doesn't know: "Was I charged or not?" Returning the original result is better — client knows the state.

Naturally idempotent operations: GET (read), PUT (set to specific value), DELETE (remove). NOT naturally idempotent: POST (create), increment operations (balance += 100). These need explicit idempotency keys.

Related: Two Generals Problem (why retries are necessary), At-least-once delivery (retries produce duplicates), Outbox pattern (reliable event publishing). Real-world: Stripe's Idempotency-Key header, Revolut's request_id, AWS SQS deduplication ID.

6 of 20

What is the difference between orchestration and choreography in microservices?

AOrchestration: central coordinator tells services what to do. Choreography: services react to events independently

BOrchestration is synchronous, choreography is asynchronous

CThey are the same — just different names

DOrchestration uses Kafka, choreography uses REST

Hint

Orchestra conductor vs jazz musicians reacting to each other.

Detailed explanation

Orchestration — central conductor:

TransferOrchestrator:
  1. Call AccountService.debit(from, amount)     → wait for response
  2. Call AccountService.credit(to, amount)      → wait for response  
  3. Call NotificationService.send(result)        → wait for response
  4. Return result to client

The orchestrator KNOWS the full workflow.
If step 2 fails → orchestrator calls compensating action (refund step 1).

Pros: easy to understand (one place shows the full flow), easy to debug (follow the orchestrator), easy to add business logic (if/else in one place). Cons: single point of failure (orchestrator down = everything down), tight coupling (orchestrator knows about all services), can become a "god service."

Choreography — event-driven reactions:

TransferService:
  → publishes "TransferInitiated" event

AccountService (listens for TransferInitiated):
  → debits account
  → publishes "AccountDebited" event

AnotherAccountService (listens for AccountDebited):
  → credits account
  → publishes "TransferCompleted" event

NotificationService (listens for TransferCompleted):
  → sends notification

No central controller. Each service reacts to events.

Pros: no single point of failure, loose coupling (services don't know about each other), independently deployable. Cons: hard to see the full flow (distributed across services), hard to debug (events fly everywhere), hard to handle failures (who is responsible for compensation?).

In practice — hybrid: use orchestration for critical flows (payment processing — need clear error handling), choreography for non-critical side effects (notifications, analytics — fire events, whoever cares can listen).

Related patterns: Saga (orchestrated or choreographed multi-step workflow), Event-driven architecture, CQRS (events feed read models). Real-world: Uber uses orchestration for ride matching (critical), choreography for driver ratings (non-critical).

7 of 20

What is the saga pattern?

AA distributed transaction using two-phase commit

BA sequence of local transactions where each step has a compensating action to undo it on failure

CA pattern for reading data from multiple services

DA logging pattern for distributed tracing

Hint

No distributed ACID. Each service commits locally. If step 3 fails, undo steps 1 and 2.

Detailed explanation

The problem: in a monolith, one DB transaction wraps everything. In microservices, each service has its own DB — no cross-service transactions. How to ensure all-or-nothing across services?

The saga: a sequence of local transactions, each with a compensating action:

Step 1: OrderService.createOrder()
  Compensating: OrderService.cancelOrder()

Step 2: PaymentService.chargeCustomer()
  Compensating: PaymentService.refundCustomer()

Step 3: InventoryService.reserveItems()
  Compensating: InventoryService.releaseItems()

Step 4: ShippingService.scheduleDelivery()
  Compensating: ShippingService.cancelDelivery()

If Step 3 fails:
  → Run compensating for Step 2 (refund)
  → Run compensating for Step 1 (cancel order)
  → Steps 1 and 2 are "undone"

Two implementation styles:

Orchestrated saga: a central saga coordinator manages the sequence. Knows all steps, calls compensations on failure. Easier to understand, harder to scale.

Choreographed saga: each service publishes events, next service reacts. No coordinator. Harder to debug but more resilient.

Key insight: compensating actions are NOT rollbacks. They're new forward actions that semantically undo the previous step. A refund is a new transaction, not a database rollback. The original charge still exists in the audit trail.

Challenges: compensating actions can also fail (need retry), intermediate states are visible to other queries (not isolated like ACID), designing compensations for every step is complex.

Related: Two-phase commit (alternative — but slow, blocking, not recommended for microservices), Outbox pattern (reliable event publishing for choreographed sagas), Idempotent consumers (compensations may be delivered multiple times).

8 of 20

What is event sourcing?

APublishing events to Kafka for real-time analytics

BUsing events instead of REST for service communication

CStoring every state change as an immutable event. Current state = replay of all events

DA pattern for sourcing data from external APIs

Hint

A bank statement IS event sourcing — balance is derived from the list of all transactions.

Detailed explanation

Traditional approach — store current state:

accounts table: { id: "A1", balance: 700 }
// How did it get to 700? No idea. History lost.

Event sourcing — store events:

events table:
  { type: "AccountOpened",  amount: 0,    timestamp: T1 }
  { type: "Deposited",      amount: 1000, timestamp: T2 }
  { type: "Withdrawn",      amount: 200,  timestamp: T3 }
  { type: "Transferred",    amount: 100,  timestamp: T4 }

Current balance = replay: 0 + 1000 - 200 - 100 = 700
Balance at T2 = replay up to T2: 0 + 1000 = 1000

Benefits:

• Complete audit trail: every change is recorded. Required by financial regulations.

• Time travel: reconstruct state at any point in time. "What was the balance on March 15?"

• Debugging: replay events to reproduce bugs. "What happened between 2pm and 3pm?"

• New read models: add a new projection later — replay all events to build it. No data migration needed.

Challenges:

• Complexity: much harder than CRUD. Need event store, projections, snapshots.

• Schema evolution: events are immutable — how to handle format changes? Upcasting.

• Replay performance: million events = slow state rebuild. Fix: periodic snapshots.

• Eventual consistency: projections (read models) are updated asynchronously — slightly stale.

Almost always paired with CQRS: write side = event store (append-only). Read side = projections (materialized views optimized for queries). Events flow from write to read via async subscription.

Real-world: bank ledgers (every transaction is an immutable event), Git (commits are events, working directory is derived state), blockchain (blocks are events, balances are derived).

9 of 20

What is CQRS?

ACommand Query Responsibility Segregation — separate models for writes (commands) and reads (queries)

BA caching strategy for read-heavy workloads

CA pattern for SQL query optimization

DA message queue protocol

Hint

Write model = source of truth (consistent). Read model = fast queries (may be slightly stale).

Detailed explanation

Traditional CRUD: same model for reading and writing. Works fine for simple apps. Breaks down at scale — reads and writes have different optimization needs.

CQRS separates them:

WRITE SIDE (Commands):                   READ SIDE (Queries):
  Command: Transfer                       Query: GetBalance
    ↓                                       ↓
  Validation               events         Read from cache
    ↓                             →       or read replica
  Write to ledger                           ↓
  (source of truth)                       Return fast
  PostgreSQL primary                      Redis / read replica
CP — consistent                         AP — available, fast

Why separate?

• Writes need: ACID transactions, validations, pessimistic locks, strong consistency. Optimized for correctness.

• Reads need: fast response, possibly denormalized data, no locks. Optimized for speed. Can tolerate milliseconds of staleness.

Example — banking:

Write: INSERT INTO journal_entries (...) — append-only ledger, ACID, PostgreSQL primary.

Read: GET /accounts/A1/balance — Redis cache, updated asynchronously from journal events. Balance shown in app may be 100ms behind actual ledger. Acceptable for display.

When NOT to use CQRS: simple CRUD apps, low traffic, when the added complexity isn't justified. CQRS adds event propagation, eventual consistency handling, two models to maintain.

Related: Event sourcing (natural fit — events feed read projections), CAP theorem (write = CP, read = AP), Materialized views (read-side optimization).

10 of 20

What is the difference between at-most-once and at-least-once delivery?

AAt-most-once is more reliable

BAt-least-once never has duplicates

CThey are the same in practice

DAt-most-once: may lose messages (fire and forget). At-least-once: no loss but may duplicate (retry until ack)

Hint

At-most-once = fire and forget. At-least-once = retry until confirmed.

Detailed explanation

At-most-once: send message, don't retry. If it's lost, it's lost. Zero or one delivery.

Producer: send(message)     → sent
         don't wait for ack
         move to next message
// Result: fast, simple, but messages can vanish
// Use for: metrics, analytics, logging — losing 0.01% is OK

At-least-once: send message, wait for acknowledgment, retry if no ack received. One or more deliveries.

Producer: send(message)     → sent
         wait for ack...    → timeout!
         retry: send(message) → sent
         wait for ack...    → received!
// But: did the first message arrive and only the ack was lost?
// Now the consumer might process it TWICE
// Result: no message lost, but duplicates possible
// Use for: payments, orders — losing a payment is worse than processing twice

The trade-off is clear:

                     Messages lost?    Duplicates?
At-most-once         YES               NO
At-least-once        NO                YES
Exactly-once         NO                NO     ← impossible!

For financial systems: always at-least-once + idempotent consumer. Losing a payment is catastrophic. Processing a duplicate is handled by idempotency key — harmless.

Kafka specifics: acks=0 = at-most-once. acks=all + retries = at-least-once. enable.idempotence=true = idempotent producer (deduplicates within Kafka, but still need consumer-side idempotency for end-to-end).

11 of 20

What is a distributed lock and when do you need one?

AA database lock that works across tables

BA lock shared across multiple service instances — only one processes a resource at a time (e.g. Redis Redlock)

CA Kotlin Mutex that works across coroutines

DAn OS-level file lock

Hint

3 instances of PaymentService running. Only ONE should process payment #123 at a time.

Detailed explanation

The problem: you have 3 replicas of PaymentService behind a load balancer. A Kafka message for payment #123 might be picked up by Instance A. A retry might go to Instance B. Both process the same payment simultaneously — double charge.

Distributed lock solution:

// Using Redis:
val lockKey = "payment-lock:${payment.id}"
val acquired = redis.set(lockKey, instanceId, NX, EX, 30)
//                                            ^    ^
//                                            |    30-second expiry (safety net)
//                                            Only if Not eXists

if (acquired) {
    try {
        processPayment(payment)
    } finally {
        redis.del(lockKey)   // release lock
    }
} else {
    // Another instance holds the lock — skip or retry later
}

Redis Redlock algorithm (for Redis cluster): acquire lock on N/2+1 Redis nodes. If majority acquired within timeout → lock obtained. Protects against single Redis node failure.

Dangers of distributed locks:

• TTL expiry: lock expires while processing (took longer than 30s). Another instance acquires lock. Now two are processing.

• Clock skew: different servers have slightly different clocks. TTL expires at different times.

• Network partition: Redis master fails, replica promoted — lock lost.

Martin Kleppmann's critique: distributed locks are fundamentally unreliable. Prefer idempotency (process duplicate safely) over locking (prevent duplicate) when possible.

Alternatives: database advisory locks (pg_advisory_lock), optimistic locking (version column), idempotency keys (often the best choice). Use distributed lock when: idempotency is truly impossible, or the operation has physical side effects (sending email, charging card).

12 of 20

What is backpressure?

ACompressing data to reduce network load

BEncrypting messages in transit

CConsumer signals producer to slow down because it can't keep up

DCaching at the load balancer level

Hint

Producer: 10,000 msg/s. Consumer: 1,000 msg/s. What happens to the other 9,000?

Detailed explanation

The problem: producer creates data faster than consumer can process. Without backpressure: buffer grows → memory exhausted → OOM crash, or messages dropped silently.

Backpressure mechanisms:

Pull-based (Kafka): consumer pulls messages at its own pace. If it's slow, messages accumulate in Kafka (which is designed for this — retention on disk). Consumer lag grows but nothing crashes. Most elegant solution.

Reactive Streams (request(N)): consumer tells producer "send me 10 items." After processing, requests 10 more. Producer never sends more than requested. Kotlin Flow uses this internally — when collector is slow, flow producer suspends.

TCP window size: TCP itself has backpressure. Receiver advertises a window size — sender can't send more than the window allows. If receiver is slow, window shrinks to zero — sender stops.

Bounded queues: in-memory queue with max size. When full, producer blocks (or drops). Java's BlockingQueue(capacity), Kotlin's Channel(capacity).

// Kotlin Channel with backpressure:
val channel = Channel<Payment>(capacity = 100)

// Producer — suspends when channel full:
launch { 
    payments.forEach { channel.send(it) }  // suspends at 100!
}

// Consumer — processes at its own pace:
launch { 
    for (payment in channel) {
        processPayment(payment)  // slow is OK — producer waits
    }
}

Without backpressure: unbounded queue → memory grows → OOM. Or fast producer overwhelms slow consumer → data loss. With backpressure: system self-regulates. Slow consumer → producer slows down → system stays healthy.

13 of 20

What is the difference between horizontal and vertical scaling?

AHorizontal: add more machines. Vertical: add more CPU/RAM to existing machine

BHorizontal: add more RAM. Vertical: add more machines

CThey are the same

DHorizontal is for databases, vertical is for applications

Hint

Horizontal = more servers (scale out). Vertical = bigger server (scale up).

Detailed explanation

Vertical scaling (scale up): upgrade to a bigger machine — more CPU, more RAM, faster SSD. Simple: no code changes, no distributed systems complexity. But: has a ceiling (biggest AWS instance is finite), single point of failure, expensive at the top.

Horizontal scaling (scale out): add more machines of the same size. Theoretically unlimited capacity. But: need load balancing, state management, data consistency, distributed coordination.

What scales how:

Stateless services (API, business logic): horizontal scaling is easy. Add more pods in Kubernetes. Load balancer distributes requests. Each instance is identical.

Databases: harder to scale horizontally. Options: read replicas (easy, read-only), sharding (hard, splits data), or managed solutions (Aurora, CockroachDB, Vitess).

Caches (Redis): Redis Cluster for horizontal scaling. Consistent hashing distributes keys across nodes.

Rule of thumb: scale vertically first (simpler). When you hit the ceiling or need high availability, scale horizontally. Stateless services → horizontal from the start. Databases → vertical as long as possible, horizontal when needed.

Kubernetes auto-scaling: Horizontal Pod Autoscaler (HPA) adds/removes pods based on CPU/memory/custom metrics. This is horizontal scaling automated.

14 of 20

What is a message queue vs event stream (RabbitMQ vs Kafka)?

AThey are the same

BMessage queues are faster than event streams

CEvent streams support transactions, message queues don't

DQueue: message consumed once, then deleted. Stream: retained, multiple consumers can read, replayable

Hint

RabbitMQ: message picked up by one consumer, gone. Kafka: message stays, anyone can read it anytime.

Detailed explanation

Message queue (RabbitMQ, SQS, ActiveMQ):

Producer → [Queue] → Consumer
             ↓
        message DELETED after acknowledgment
        
One message → one consumer (competing consumers pattern)
Message gone after processing — can't replay

Use for: task distribution (process this image, send this email), work queues where each task should be done exactly by one worker.

Event stream (Kafka, Pulsar, Kinesis):

Producer → [Topic/Partition — append-only log]
              ↑                    ↑           ↑
         Consumer Group A    Consumer Group B   Consumer Group C
         (each group gets     (independent      (can start reading
          all messages)        position)          from yesterday)

Messages RETAINED for configurable time (7 days, forever)
Multiple consumer groups read independently
Can replay from any point in time

Use for: event sourcing (store all events), multiple consumers (analytics + notifications + audit all need the same events), replayability (new service joins → replays history).

Kafka specifics: topics divided into partitions. Within a partition: strict ordering guaranteed. Consumer group: each partition consumed by exactly one consumer in the group. Scaling: add partitions + add consumers in the group.

In banking: Kafka for core events (transfers, account changes — need retention, audit, multiple consumers). RabbitMQ/SQS for task queues (send notification, generate PDF — process once and delete).

15 of 20

What is eventual consistency?

AGiven enough time without new updates, all replicas converge to same state — but reads may return stale data temporarily

BData is always consistent across all nodes

CEach node has its own permanent version of data

DA database transaction isolation level

Hint

You change your profile picture. For 5 seconds, some users see the old one. Eventually, everyone sees the new one.

Detailed explanation

Strong consistency: after a write, ALL subsequent reads (on any node) return the new value. Simple for the developer, expensive for the system (need coordination between nodes).

Eventual consistency: after a write, reads MAY return stale data for a window of time. Eventually (milliseconds to seconds), all replicas sync up and return the new value.

T=0:   Client writes balance=700 to primary
T=1ms: Primary responds "OK"
T=1ms: Client reads from read replica → balance=1000 (stale!)
T=50ms: Replication catches up
T=50ms: Client reads from read replica → balance=700 (consistent)

Where it's acceptable:

• Social media feed (seeing a post 1s late is fine)

• Shopping cart (briefly stale, no money involved)

• Display balance in mobile app (shows "as of last sync")

• DNS propagation (famously eventual — up to 48 hours)

Where it's NOT acceptable:

• Ledger writes (double-spend risk if reading stale balance)

• Payment processing (must check real-time balance)

• Inventory for last-item (two people buy the last item)

In banking — use BOTH: write path is strongly consistent (CP — PostgreSQL primary, SELECT FOR UPDATE). Read path is eventually consistent (AP — Redis cache, read replica). This is CQRS.

16 of 20

What is a dead letter queue (DLQ)?

AA queue for successfully processed messages

BA queue that deletes messages after timeout

CWhere messages that repeatedly fail processing are sent for manual inspection, instead of blocking the main queue

DA backup queue for when primary is down

Hint

A message fails 5 times. Do you retry forever and block everything? Or move it aside?

Detailed explanation

The problem — poison message: a message that always fails (malformed data, bug in consumer logic, missing dependency). Without DLQ, this message is retried infinitely, blocking all subsequent messages in the queue.

Main Queue: [msg1] [msg2-POISON] [msg3] [msg4]
                      ↑
                      Fails every time
                      msg3 and msg4 are BLOCKED

With DLQ:
Attempt 1: process msg2 → FAIL
Attempt 2: process msg2 → FAIL  
Attempt 3: process msg2 → FAIL → move to DLQ!

Main Queue: [msg1] [msg3] [msg4]     ← continues!
DLQ:        [msg2-POISON]             ← for investigation

Operations workflow:

1. Alert fires: "DLQ has 3 messages"

2. Engineer inspects: "Ah, field 'amount' was a string instead of number — producer bug"

3. Fix the producer or consumer

4. Replay DLQ messages back to main queue

Configuration: SQS: maxReceiveCount: 5 → after 5 fails, move to DLQ. Kafka: typically implemented in application code (count retries, publish to DLQ topic after N failures). RabbitMQ: x-dead-letter-exchange header.

Related patterns: Retry with exponential backoff (before DLQ), Circuit breaker (stop trying if downstream is down), Monitoring/alerting (DLQ depth > 0 → alert).

17 of 20

What is rate limiting and which algorithm is commonly used?

ALimiting database query execution time

BLimiting requests per user/IP per time window. Token bucket: bucket fills at fixed rate, each request takes a token

CLimiting the number of microservices in a cluster

DLimiting the size of HTTP responses

Hint

User makes 10,000 requests in 1 second — is this legitimate? How to protect the system?

Detailed explanation

Why rate limit: protect against abuse (DDoS), prevent resource exhaustion, enforce API usage tiers (free: 100 req/min, paid: 10,000 req/min), ensure fair usage among clients.

Token bucket algorithm (most popular):

Bucket capacity: 10 tokens
Refill rate: 1 token per second

T=0s:   bucket has 10 tokens
T=0s:   3 requests arrive → take 3 tokens → 7 remain → all served
T=0s:   7 more requests → take 7 tokens → 0 remain → all served
T=0s:   1 more request → 0 tokens → 429 Too Many Requests!
T=1s:   1 token added → bucket has 1
T=10s:  10 tokens added → bucket full again → burst allowed

Why token bucket is popular: allows bursts (full bucket = burst of requests), while limiting sustained rate (refill rate = average limit). Smooth and fair.

Other algorithms:

• Sliding window: count requests in last N seconds. More precise than fixed window. Higher memory.

• Leaky bucket: requests flow out at fixed rate. No bursts — strict constant rate. Good for APIs that need smooth traffic.

• Fixed window counter: count per minute/hour. Simple but has boundary problem (spike at window edge).

Implementation: API Gateway level (Kong, AWS API Gateway — no code change), in-app (Resilience4j RateLimiter, Guava RateLimiter), Redis-based (for distributed rate limiting across instances). Response: HTTP 429 Too Many Requests + Retry-After header.

18 of 20

What is the bulkhead pattern?

AA pattern for encrypting data at rest

BA pattern for database partitioning

CA pattern for logging in microservices

DIsolating resources so one failing component can't exhaust resources for others

Hint

Ship compartments (bulkheads) — if one floods, others stay dry. Same idea for thread pools.

Detailed explanation

The problem: your service calls Service A (fast) and Service B (slow today). Without bulkhead: all 100 threads waiting for slow Service B. Now requests to Service A also fail — no threads available. One unhealthy dependency kills everything.

WITHOUT bulkhead:
Shared thread pool (100 threads):
  80 threads stuck waiting for Service B (slow, 10s timeout)
  20 threads handling Service A
  → new request for Service A → NO THREADS AVAILABLE → rejected!
  
WITH bulkhead:
Service A pool: 50 threads (dedicated)
Service B pool: 50 threads (dedicated)
  50 threads stuck waiting for Service B → Service B's problem
  50 threads still serving Service A → working perfectly!
  → Service B's slowness is isolated

Types of bulkheads:

• Thread pool: separate thread pool per downstream service (Resilience4j, Hystrix).

• Connection pool: separate DB connection pool per service/module.

• Semaphore: limit concurrent calls per dependency (lighter than thread pool). In Kotlin: Semaphore(permits = 10).

• Process/container: separate Kubernetes pods for different responsibilities.

Combined with circuit breaker: bulkhead limits resource usage per dependency. Circuit breaker stops calling when it detects failure. Together: bulkhead prevents resource exhaustion + circuit breaker prevents cascade failure.

19 of 20

What is distributed tracing?

ATracking a request's journey across services using a trace ID propagated through all calls

BLogging errors in each service independently

CMonitoring CPU usage across services

DTracing database query performance

Hint

One HTTP request hits 5 services. "Why was it slow?" How to see the full picture?

Detailed explanation

The problem: user reports "transfer took 8 seconds." The request went through API Gateway → Auth Service → Transfer Service → Account Service → Notification Service. Which one was slow?

How distributed tracing works:

1. API Gateway generates: trace_id = "abc-123"
2. Passes it in header: X-Trace-ID: abc-123
3. Each service:
   a. Reads trace_id from incoming request
   b. Logs a "span": { trace_id, service_name, start_time, end_time }
   c. Passes trace_id to downstream calls
4. Tracing backend (Jaeger/Datadog) collects all spans
5. UI shows the full request timeline:

   API Gateway    |████|                              2ms
   Auth Service        |██████|                       5ms  
   TransferService           |████████████████|       200ms ← SLOW!
     AccountService               |████████|          100ms
     NotificationService                    |██|      10ms
   Total                                              317ms

Key concepts:

• Trace: the full journey of one request across all services.

• Span: one operation within one service (with start/end time, metadata).

• Parent-child: TransferService span is parent of AccountService span.

• Context propagation: trace_id passed via HTTP headers, Kafka headers, gRPC metadata.

Standards: OpenTelemetry (current standard — merges OpenTracing + OpenCensus). Provides SDK for auto-instrumentation of HTTP clients, DB queries, message producers/consumers. Backends: Jaeger (open source), Zipkin (open source), Datadog APM, AWS X-Ray.

In Kotlin/Spring: Spring Cloud Sleuth (now Micrometer Tracing) + OpenTelemetry auto-instruments WebFlux, RestTemplate, Kafka. Your requestId from our banking tutorial serves a similar purpose — but distributed tracing adds timing, service topology, and visualization.

20 of 20

What is service mesh?

AA framework for building microservices in Kotlin

BInfrastructure layer handling service-to-service communication — mTLS, retries, observability — without changing application code

CA database connection pool shared between services

DA monitoring dashboard

Hint

Istio, Linkerd — sidecar proxies handling cross-cutting concerns transparently.

Detailed explanation

The problem: every microservice needs: mTLS encryption, retries, circuit breaking, load balancing, distributed tracing, traffic splitting (canary deployments). Implementing all of this in every service = massive duplication and inconsistency.

Service mesh — infrastructure layer:

Without mesh:
  [Service A]      HTTP     → [Service B]
  Service A must implement: TLS, retries, circuit breaker, tracing...
With mesh (sidecar pattern):
  [Service A] → [Proxy A]      mTLS      [Proxy B] → [Service B]
                 (Envoy)                  (Envoy)
  Proxies handle: encryption, retries, circuit breaking,
  load balancing, tracing, traffic splitting.
  Service A and B know NOTHING about this. Zero code changes.

Architecture:

• Data plane: sidecar proxies (Envoy) deployed alongside each service pod. Intercept all network traffic. Handle the actual work.

• Control plane: central management (Istio, Linkerd). Configures all proxies. Defines policies: "retry 3 times," "encrypt all traffic," "send 5% to canary."

What it provides:

• mTLS everywhere: automatic encryption + authentication between all services.

• Observability: distributed tracing, metrics, access logs — without code changes.

• Traffic management: canary deployments (5% traffic to new version), A/B testing, blue-green deployments.

• Resilience: retries, timeouts, circuit breakers — configured declaratively in YAML, not code.

Trade-offs: added latency (~1ms per hop through proxy), operational complexity (another system to manage), resource overhead (sidecar per pod). Worth it for large microservice deployments (50+ services). Overkill for 5 services.

Implementations: Istio (most popular, complex), Linkerd (simpler, less features), Consul Connect (HashiCorp), AWS App Mesh.