Someone is going to suggest microservices. It happens in almost every growing engineering org. A few quarters into a product getting traction, someone connects a slow deploy, a flaky integration, or an increasingly tangled codebase to the same conclusion: "We should break this into microservices."
Before you take that meeting, ask one question: "How many developers do we have?" If the honest answer is "not many", the move usually is not microservices. The move worth seriously evaluating is an event-driven architecture inside the monolith you already have.

Event-driven is a pattern. Microservices is a deployment choice.
People talk about "event-driven systems" like they are automatically microservices, but they are not. Event-driven architecture is a design pattern for decoupling. Microservices is a deployment topology. You can have domain events, async processing, webhook delivery, and decoupled business logic inside a single deployable codebase, which means you can get many of the benefits people want from microservices without paying the organizational and operational costs.
This is not the right choice for every team. But most teams never seriously evaluate it, and that is the problem. If your organization does not have dedicated platform capacity and your team still fits in a single standup cadence, the coordination overhead of distributed services may cost more than it saves. An event-driven monolith deserves a seat at the table before that decision gets made.
What people are really asking for when they ask for microservices
In practice, the microservices pitch is rarely about microservices. It is about one or more of these problems:
- Changes in one area keep breaking other areas.
- Some work needs to happen in the background, and it is making requests slow.
- Integrations and webhooks are unreliable, and support is getting tired of it.
- The codebase does not have clear boundaries.
- Deploys feel risky because too much ships together.
Microservices can address some of that, but an event-driven monolith can address a surprising amount of it while keeping a single codebase, a single deploy, and a single primary runtime to operate.
A concrete example: an event-driven pipeline inside a monolith
A full event-driven pipeline inside a monolith can behave like a small internal platform. The framework does not matter, but the shape of the pattern does. It looks like this:
- A domain action happens.
- The domain publishes an event that describes what happened.
- One or more listeners respond.
- Listeners dispatch queued jobs for async work.
- Jobs retry with backoff.
- The system records traceability so you can follow a chain end to end.
Here is a simplified version of the idea.
// Domain action
function createRecord(input: RecordInput): Record {
const record = store.create(input);
publish(new RecordCreated({ recordId: record.id }));
return record;
}
// Listener
function onRecordCreated(event: RecordCreated): void {
enqueue("webhooks", new DeliverWebhookJob({ recordId: event.recordId }));
}The mechanics that made it production-grade were not fancy, they were disciplined.
- Jobs used exponential backoff across multiple retry tiers.
- 10 seconds
- 60 seconds
- 10 minutes
- 1 hour
- 2 hours
- Each chain carried a correlation ID so you could trace a record from event to listener to job to outbound delivery.
- Events carried references, not payloads, to keep memory stable and prevent accidental data coupling.
- Handlers were written to be idempotent, because retries are not a "maybe". Retries are guaranteed in any real system.
- Failure had an explicit path. Max attempts. Dead letter strategy. Alerting.
That is the part many teams skip. They stop at "we have events," and then they discover the real work is everything around delivery.
The event-driven monolith patterns that actually matter
If you want this to hold up under real traffic, focus on the patterns that create boundaries and make failure survivable.
1. Domain events are the boundary
Events can define module boundaries as effectively as services do, without the deployment overhead. If your billing module publishes InvoicePaid, other parts of the system should not reach into billing internals. They should subscribe to the event and react. That is decoupling, not a new repository.
2. Queue-based async processing is your leverage
Queues let you take work out of the request path without losing correctness. Give jobs named queues by work type, and treat queue configuration as a product decision, not a default. For example:
webhooksfor outbound deliveriesemailsfor user notificationsimportsfor long-running ingestion
This is how you get responsiveness without pretending every problem requires a separate service.
3. Exponential backoff needs a ceiling
If a dependency is down, retries can turn a small incident into a backlog that never clears. Define max attempts, a backoff schedule with increasing intervals, and a dead letter strategy for manual review or replay. If you do not have an answer for "what happens after the final retry", you do not have a retry strategy yet.
4. Traceability IDs are not optional
If you publish events, you will eventually ask "Did we deliver the webhook?" or "Why did the customer never get the email?" Without a correlation ID that flows through the whole chain, you will be guessing. Add a correlation ID at the boundary of the request or the workflow and carry it through events, logs, jobs, and outbound requests.
5. Practice payload discipline
As a default, events should carry IDs, not blobs. If you ship a full record snapshot as event data, you create data coupling and invite privacy mistakes. Send the record ID, the event name, and minimal metadata, then let the listener fetch what it needs.
The exception is event-carried state transfer: when consumers need a guaranteed snapshot of the state at the time of the event, or when the source system may be unavailable at read time. That is a legitimate pattern, but treat it as an explicit architectural decision, not the default.
Watch out: Shipping full record snapshots in events is also a privacy risk. If events land in a log aggregator or a dead letter queue, you may be storing PII in places your compliance team did not account for.
6. Design for idempotency from day one
Every handler must be safe to run multiple times. Not "usually", always. Idempotency is the difference between "we can retry safely" and "we are terrified to retry." If the handler writes data, it needs a guard, and if the handler calls an external API, it needs a dedupe key.
A simple idempotency guard is a unique job ID stored in a fast lookup (Redis, a database unique constraint). Before the handler does real work, check the ID. If it exists, return early. This turns "hope it doesn't run twice" into "it does not matter if it runs twice."
7. Test the chain, not just the units
This is where event-driven architecture earns its reputation for being hard. Unit testing individual handlers is straightforward. Testing the full event chain from domain action through listener, queue, job execution, and outbound delivery is not.
Async flows break in the gaps. The publisher changes the event shape and the consumer does not update. A retry policy interacts with a timeout in a way nobody anticipated. A job succeeds on the first attempt but fails idempotency on the second because the guard checks the wrong key.
These failures do not surface in unit tests. You need contract tests between publishers and consumers to catch schema drift early. You need integration tests that exercise the full chain in a test environment with a real queue, not a mocked one. And you need to accept that these tests will be slower and require more infrastructure than testing synchronous request/response code.
This is real cost. Do not minimize it. The argument is not that event-driven architecture is free of complexity. It is that this complexity is bounded: you set up the test infrastructure once, you write contract tests once per event type, and you maintain them as the system evolves. Compare that to the unbounded coordination cost of operating, deploying, monitoring, and debugging multiple independent services with independent release cycles. For most teams, the first tradeoff is more manageable. But "more manageable" is not "easy," and any team evaluating this path should go in with eyes open.
So when do you actually need microservices
Microservices are not a technical flex. Microservices are an organizational scaling strategy, not a technical one. There are times where they are the right call, and the clearest ones look like this:
- Independent scaling requirements. One part of the system needs 100x the compute.
- Different runtime requirements. For example, ML model serving versus CRUD APIs.
- Organizational boundaries. Separate teams need separate deploys and separate on-call ownership.
- Regulatory isolation. For example, PCI data must be physically isolated from the rest of the system.
If none of those are true, microservices usually introduce more surface area than they remove.
The pushback you will get, and how to address it
If you argue for an event-driven monolith, some people will assume you are defending a "big ball of mud." You can get ahead of that by naming the real risks and showing how you mitigate them.
"Monoliths always become spaghetti."
They do when boundaries are unclear. The counterargument:
- The proposal is not to keep the monolith as it is.
- Domain events enforce module boundaries without requiring separate deployments.
- Cross-module direct calls become a smell to watch for.
- Observability investment makes failures visible.
The point is not "monolith forever", it is "clean boundaries first, then evaluate whether services are the right next step."
"Microservices are the only way to move fast."
Microservices can help teams move independently, but they can also slow a small org down with coordination work. The counterargument:
- Independence comes from boundaries and ownership, not from repositories.
- Events, queues, and clear module contracts can provide independence within a single codebase.
- Microservices become worth revisiting when team structure, scaling needs, or regulatory constraints create a hard requirement that module boundaries alone cannot satisfy.
"Event-driven systems are hard to debug and hard to test."
True on both counts. Async flows are harder to observe in production and harder to verify in test environments than synchronous code. That is a real cost, not a talking point to dismiss. Correlation IDs, structured logs, explicit failure paths, contract tests between publishers and consumers, and chain-level integration tests are all part of the design, not afterthoughts. If you cannot trace a workflow end to end and you cannot test the full event chain in a non-production environment, the architecture is unfinished.
"Events can become a mess too."
Also true, which is why events need governance:
- Keep the event taxonomy small.
- Name events after domain facts, not implementation details.
- Version events carefully.
- Treat backward compatibility as a requirement.
Why this matters more in 2026 than it did a few years ago
Two trends are making the event-driven monolith even more relevant.
First, AI-assisted development works better when module boundaries are explicit. Code generation tools produce more reliable output when the context is a well-bounded module with a clear event contract than when it is a deeply coupled codebase where a change in one area cascades unpredictably. Clean boundaries are not just good architecture. They are the precondition for getting leverage from AI tooling.
Second, the operational cost of microservices has not gone down. Observability platforms, service meshes, and container orchestration are more mature, but they still require dedicated platform investment. For teams without dedicated platform capacity, that investment competes directly with product work. The calculus has not changed: if your constraint is people, not compute, a smaller deployment surface is worth serious consideration.
The teams that will move fastest in the next few years are the ones that treat architecture decisions as resource allocation decisions, not identity decisions.
The takeaway
Microservices are an organizational scaling strategy, not a technical one. An event-driven monolith is not a universal answer either. It has real costs: async flows are harder to test, harder to debug, and require discipline that many teams underestimate.
The argument is not that one is always better than the other. It is that most teams reach for microservices without seriously evaluating the alternative. An event-driven monolith can deliver decoupling, async throughput, and integration resilience while keeping operations simpler. If you outgrow it later, the boundaries you built with events will make the transition to services cleaner than if you had never built them at all.
A question to close on
Are you considering microservices because of a technical constraint you cannot solve inside the monolith, or because you heard a convincing talk? If you can name the constraint, you are ready to choose the right architecture. If you cannot, start by building boundaries.
Quick self-check: If your biggest problem is "deploys feel scary," fix test coverage, release process, and modularity first. Microservices do not solve quality problems, they distribute them.