Modern web application architecture: components, types, and best practices

2

Web application architecture is the set of structural decisions that determine how a web app's components are separated, connected, and deployed. Those decisions, made early, set the performance ceiling and deployment flexibility you live with at scale, long before any single line of code becomes the bottleneck.

This guide gives senior engineers and technical founders a precise map of every layer, a practical comparison of the main architecture types, the patterns worth knowing, and a decision framework for choosing the right structure before traffic forces the question.

TL;DR: Web application architecture in 30 seconds

Web application architecture is the structural blueprint that defines how a web app's components: presentation, logic, and data, are organized, separated, and connected. Get this blueprint wrong early, and no amount of refactoring fixes the performance ceiling or deployment bottleneck you've locked in.

Our engineering team has architected and scaled web platforms across fintech, marketplace, and media products, from modular monoliths to distributed systems migrated via strangler-fig to microservices. The patterns that work depend on scale, team size, and consistency requirements, not on what's trending.

The core model: a three-tier architecture separates presentation (browser/client), application logic (server), and data (database). In front of that stack sit two critical infrastructure nodes most diagrams omit: an API gateway that routes, authenticates, and rate-limits every inbound request, and a Content Delivery Network that serves static assets and cached responses from the edge, cutting time-to-first-byte before a single line of application code runs. This guide covers the core components, common architecture types, a practical decision framework, and how AI inference is reshaping the stack in 2026.

What web application architecture actually governs

Web application architecture governs the structural decisions that determine how a web app's components are separated, connected, and deployed, not how individual functions are coded. It sits one abstraction above software design: where design answers "how does this class work?", architecture answers "which process owns this concern, and how do processes communicate at scale?"

The scope is concrete. Web application architecture controls:

  • Tier separation: whether presentation, logic, and data layers run in the same process or across independently deployable services
  • Communication contracts, synchronous HTTP/REST, GraphQL, or asynchronous event buses between components
  • State ownership: where sessions, cache, and derived read models live
  • Deployment topology, single server, containerised cluster, serverless functions, or edge nodes
  • Failure boundaries, which component failures cascade and which are isolated

The classical three-tier architecture, presentation tier, application/logic tier, data tier, remains the reference model most teams start from. What has changed is the number of runtime concerns that now sit between those tiers: CDNs, API gateways, caching layers, and message queues that the original three-tier diagram never accounted for.

A modular monolith is often the right starting point before those boundaries harden. Martin Fowler: "You shouldn't start a new project with microservices, even if you're sure your application will be big enough to make it worthwhile." (Martin Fowler (quoted on milanjovanovic.tech and referenced across multiple sources), 2015) The monolith packages multiple logical modules into one deployable unit with clean internal interfaces, a working web application architecture that avoids distributed-systems complexity until independent deployment genuinely becomes necessary.

Architecture decisions made at sprint one compound over years. A team that conflates application logic with data-access code in a three-tier system will hit a refactoring ceiling at roughly the same scale every time.

Core layers and infrastructure components

A modern web application architecture separates concerns across four logical tiers: presentation, application logic, data, and infrastructure, plus a fifth infrastructure tier that most diagrams omit but every production system depends on.

Presentation layer (client tier)

The browser renders what users interact with. This tier handles everything from initial HTML parsing to JavaScript hydration, DOM event handling, and client-side routing. In React 19 with the Next.js App Router, this layer is no longer purely client-side: React Server Components execute on the server and stream pre-rendered HTML to the browser, reducing the round-trip fetch that client components require for initial data. The practical split is that components with user inputs, local state, or browser APIs stay as client components; data-heavy, read-heavy views become Server Components. Server Components can run once at build time on your CI server, or they can be run for each request using a web server (React documentation - Server Components (react.dev), 2024)

Application logic layer (server tier)

This tier owns business rules, orchestration, and protocol translation. An API gateway sits at the entry point: routing requests, enforcing auth, rate-limiting, and translating between HTTP/REST and internal gRPC or GraphQL contracts. Behind it, application servers execute domain logic: validating inputs, running workflows, calling downstream services. Node.js, Go, and JVM runtimes all appear here depending on team familiarity and throughput requirements. We discuss backend framework trade-offs in depth at backend frameworks.

Data layer

The data tier stores, retrieves, and manages persistence. Most production architectures run at minimum a primary relational store (PostgreSQL is the default choice for transactional data in the projects we architect) and a cache. Redis handles the caching tier: hot reads, session state, rate-limit counters, and pub/sub fan-out. Persistent object storage (S3-compatible) handles binary assets. Optional: Elasticsearch or OpenSearch for full-text and faceted search; BigQuery or Redshift for analytical workloads that would otherwise create read contention on the primary.

Infrastructure tier (the layer most diagrams omit)

The four components that sit in front of your application servers but outside your application code are where most production incidents originate:

Component Primary responsibility Where it lives
DNS Resolves domain to IP; traffic steering via latency/geo policies Managed service (Route 53, Cloudflare)
Content Delivery Network Caches static assets and edge-rendered responses; reduces TTFB and LCP Edge PoPs globally distributed
Load balancer Distributes TCP/HTTP connections across healthy application instances; terminates TLS Cloud LB (ALB, GCP LB) or ingress controller in Kubernetes
API gateway Auth, rate limiting, request routing, protocol translation Managed (AWS API GW, Kong) or self-hosted
Redis In-memory caching, session store, message broker for lightweight pub/sub Co-located with app tier or managed (ElastiCache)
Message queue / event bus Async decoupling, back-pressure management, fan-out to consumers SQS, Kafka, or RabbitMQ
Object storage Binary assets, backups, data lake landing zone S3, GCS

The Content Delivery Network is the component teams most often treat as an afterthought. In practice, moving static asset serving and edge caching to a CDN is the fastest single change to reduce TTFB, the time to first byte that feeds into LCP scores. TTFB Good threshold: ≤200ms; Needs Improvement: 200-500ms; Poor: >500ms (web.dev / Google Core Web Vitals, 2024)

How the tiers connect in a real request

A user submits a form. The browser sends an HTTPS POST. DNS resolved the hostname at page load; the request hits the nearest CDN PoP, which passes cache misses to the load balancer. The load balancer selects a healthy application server instance (or a Kubernetes pod) via least-connections or round-robin. The API gateway validates the JWT, rate-checks the caller, and routes to the correct service. The service reads session state from Redis (microseconds), writes the mutation to PostgreSQL, publishes a domain event to the message queue, and returns. Downstream consumers (email worker, analytics pipeline, search-index updater) process the event asynchronously, maintaining eventual consistency without coupling response latency to their execution time.

This flow represents the three-tier architecture extended with the infrastructure tier, and it is the reference model the rest of this guide builds on.

Request lifecycle: browser to database and back

A web application request travels through eight distinct hops between the user pressing Enter and the first byte of content rendering in the browser. Trace each hop and you have a working mental model of web application architecture, and a checklist for where latency hides. The diagram below maps each hop visually; the prose that follows explains what happens at every stage and why it matters for your approach to system design.

Diagram: Full request path

Request:   Browser → DNS Resolver → CDN Edge Node → Load Balancer → API Gateway → App Server → Redis → Database
Response:  Browser ← CDN Edge Node ← Load Balancer ← API Gateway ← App Server ← Database

Here is the full path, in order:

1. Browser → DNS resolver. The browser checks its local cache, then the OS cache, then queries a recursive DNS resolver. A cache miss can add tens of milliseconds before the TCP handshake even starts.

2. DNS → Content Delivery Network edge node. For any properly configured production system, DNS resolves to a Content Delivery Network PoP, not the origin. Static assets, JS bundles, CSS, images, are served here. For cached HTML (SSG or ISR pages), the response ends at this layer; the origin never wakes up.

3. CDN → Load balancer. Cache misses and dynamic requests pass to the origin. The load balancer distributes traffic across application server instances using least-connections or round-robin, and terminates TLS.

4. Load balancer → API gateway. The API gateway handles authentication token validation, rate limiting, and request routing to the correct upstream service. In a modular monolith, "routing" means a path prefix; in microservices, it means a named service.

5. API gateway → Application server. The app server executes business logic: query construction, permission checks, data transformation. The interface between the gateway and app server is where request context, headers, and auth tokens are passed. In a Next.js App Router deployment, React Server Components run here, fetching data server-side before streaming HTML to the client. The App Router was introduced in Next.js 13 and became stable in version 13.4 (OneUptime - How to Use Next.js App Router, 2026).

6. Application server → Redis / database. The app server checks Redis first. A cache hit short-circuits the database round-trip entirely, critical when p99 latency budgets are tight (Managed Transactional Consistency for Web Caching (Cornell University)). A miss triggers a query to PostgreSQL or equivalent. Write paths bypass Redis and go directly to the primary database, then invalidate or update the cache.

7. Database → Application server → API gateway → CDN → Browser. The response travels back through the same layers. The CDN may cache the response for the next user. The browser's Time to First Byte (TTFB) is the cumulative latency of hops 1-7 (Google web.dev - Time to First Byte (TTFB)).

Keep that diagram in front of you whenever you evaluate a new architecture. Every architectural decision, serverless vs container, monolith vs microservices, SSR vs SSG, changes how much of this path fires per request and which components carry the load. This plays out in production: Newst.se / SF Invest, a Ruby commercial real-estate marketplace, serves 4,500 active users and 6,000 active listings with custom third-party integrations layered across exactly this stack.

Types of web application architecture (with comparison table)

Web application architecture types fall into six main categories, each representing a different answer to the same question: how should application logic, data, and presentation be separated and deployed? The right choice depends on team size, scale targets, and how fast requirements change.

Three-tier architecture (the baseline)

Three-tier architecture separates a web application into presentation, application logic, and data tiers, running on distinct servers or processes. Most production systems, whether monolithic or distributed, trace back to this clean, layered structure. The tiers communicate over well-defined contracts: HTTP between browser and app server, SQL (or ORM queries) between app server and database.

This model works effectively for the large majority of internal tools, admin dashboards, and early-stage SaaS products. Where it breaks down is independent scaling: if your application layer is CPU-bound but your data tier is idle, you scale both anyway.

The main types

Single-page application (SPA) vs. multi-page application (MPA). SPAs load once and update the DOM through client-side JavaScript, keeping the user in a single browser context. MPAs generate a new HTML document per navigation. SPAs deliver fluid interactive experiences at the cost of a larger initial bundle and weaker TTFB; MPAs render faster first bytes but feel heavier on navigation. Next.js App Router now lets you mix both on a per-route basis.

Rendering mode: CSR / SSR / SSG / ISR. Client-side rendering (CSR) ships an empty shell and hydrates in the browser, fine for authenticated dashboards, damaging for LCP on public pages. Server-side rendering (SSR) renders HTML per request, good for personalised or dynamic content, costs more compute. Static site generation (SSG) pre-builds HTML at deploy time, serving pages from a Content Delivery Network with no origin hit. Incremental static regeneration (ISR) revalidates individual pages on a schedule, splitting the difference between freshness and CDN cache hit rates. In the App Router, static routes are rendered at build time, and dynamic routes are rendered at request time (Next.js App Router documentation (Rendering Fundamentals), 2024)

Modular monolith. A modular monolith runs as a single deployable unit but enforces hard module boundaries inside the codebase, shared nothing between modules except explicit interfaces. Martin Fowler's 2019 article on modular monoliths argues, and our experience confirms, that most teams should start here rather than microservices. We've seen teams that extracted services only at the point of independent deployment need achieve the operational flexibility of microservices without the distributed-systems overhead of day one.

That played out at Anime Digital Network (ADN): the platform was transformed into a modern, high-performance cloud video streaming system ready to handle big traffic.

Microservices. Independent services deployed separately, each owning its data store. The coordination cost is real: you acquire event-driven architecture, API gateway routing, distributed tracing via OpenTelemetry, and eventual consistency debt all at once. Justified when two or more services have genuinely divergent scaling needs or release cadences.

Serverless. Functions-as-a-service (AWS Lambda, Vercel Edge Functions) remove server provisioning from the equation. Cold-start latency is the trade-off, p99 latency spikes on low-traffic endpoints that haven't been warm in minutes (EdgeDelta - AWS Cold Starts article). Best suited for event-triggered workloads, background jobs, and API routes with spiky traffic.

JAMstack / edge. Pre-rendered assets served from CDN edge nodes, with dynamic behaviour handled by serverless functions or edge workers. Excellent TTFB; weak fit for applications requiring server-side session state or complex write logic at the edge.

Micro-frontends. Multiple independent browser applications composed into one UI at runtime (via module federation or iframe isolation). Scales frontend ownership across large teams but adds inter-team API contract management and bundle coordination overhead.

Comparison table

Architecture type Scale fit Team size Time-to-market Main trade-off
Three-tier monolith Low, medium 2-8 engineers Fastest Hard to scale tiers independently
Modular monolith Medium 5-20 engineers Fast Requires discipline on module boundaries
Microservices High 20+ engineers Slow initially Distributed-systems complexity from day one
Serverless Spiky / event-driven Any Fast for functions Cold-start latency; stateless only
SPA + API Medium 3-12 engineers Medium SEO / LCP risk without SSR layer
JAMstack / edge High read traffic 2-10 engineers Fast Limited dynamic write capability
Micro-frontends Large frontend surface 15+ frontend engineers Slow Bundle coordination; inter-team contracts

The modular monolith sits in a sweet spot that most architecture diagrams undervalue: it delivers the separated concerns of three-tier architecture while deferring the operational complexity of microservices until the team genuinely needs independent deployment. It is the default starting point most experienced teams now recommend over a microservices-first build.

For teams choosing between SSR and static rendering, Google's INP threshold guidance sets 200 ms as the "good" boundary, an architectural constraint, not just a performance suggestion, because rendering mode determines whether the main thread blocks on server round-trips during user interaction. INP thresholds: Good ≤200ms, Needs Improvement 200-500ms, Poor >500ms (web.dev, 2025)

Architecture patterns: when to use each

Each architecture pattern is a structural decision about where logic lives and how components communicate. Choose based on your team's size, the rate of change in your domain, and your consistency requirements, not on what's popular.

Layered (n-tier) architecture

Layered architecture organizes code into horizontal tiers: presentation, application logic, domain, and data, where each layer only talks to the layer directly below it. The clean separation makes it easy to reason about, and for most internal tools and admin panels, it's enough. Where it breaks down: once a single layer becomes a bottleneck (typically the domain layer under fan-out reads), the strict coupling forces you to scale the whole application rather than the hot path. This approach tends to trip up teams around the 200 requests/sec mark when read volume spikes asymmetrically.

MVC and MVVM

MVC (Model-View-Controller) separates domain data, UI rendering, and user input handling. MVVM (Model-View-ViewModel) shifts the binding logic into the ViewModel, which suits reactive client frameworks. In practice, MVC maps well to server-rendered applications, Rails, Django, Laravel, while MVVM appears naturally in Vue.js and Angular frontends. The pattern itself rarely causes architectural problems; the failure mode is when business logic migrates into controllers or ViewModels over time, eroding the clean boundary.

Event-driven architecture

Event-driven architecture routes work through asynchronous events: producers emit, consumers react, rather than making synchronous calls. The pattern decouples components effectively and handles bursty workloads well because back-pressure is managed at the queue rather than at the caller. Consider this approach when you have clear bounded contexts that need to evolve independently, or when a write action must fan out to multiple downstream services without the originating request waiting.

A practical example of where this pays off: an e-commerce platform processing flash-sale orders routed each `order.placed` event through an event bus to inventory, billing, and notification services in parallel. Before the migration, a synchronous call chain across those three services added roughly 800 ms to the critical path under peak load. After decoupling through the event bus, the originating request completed in under 120 ms, with downstream services processing independently through the queue. The team did, however, spend significant time retrofitting idempotency keys and dead-letter queue handling, work that should have been designed in from the start.

Where the pattern fails: teams that introduce it prematurely for simple CRUD applications end up with distributed debugging problems that far outweigh the decoupling benefit. Event ordering and idempotency must be designed in from day one, not added later as corrective measures.

CQRS

CQRS (Command Query Responsibility Segregation) separates write models from read models at the application layer. Commands mutate state; queries serve read-optimized projections. The motivation is write amplification: a single write that must fan out to multiple read representations becomes expensive in a unified model. CQRS solves this by letting the read side be eventually consistent and purpose-built, a PostgreSQL write model can project into a Redis cache or a search index without that projection being on the critical write path.

A concrete example of this approach in practice: a high-throughput SaaS analytics platform handling roughly 15,000 writes per minute maintained a normalized PostgreSQL write model for transactional integrity. A separate projection pipeline consumed change events from a Postgres logical replication slot and materialized denormalized read models into Redis and Elasticsearch. The team accepted a consistency lag of 200 to 400 ms on the read side in exchange for a substantial drop in p99 read latency, from several hundred milliseconds to under 90 ms. The read schema was optimized entirely for the query interface each client surface needed, with no joins on the read path. Maintaining two models and reasoning about eventual consistency added overhead, but the separation was justified given a read/write ratio well above 20:1.

The pattern carries real overhead: two models to maintain, eventual consistency to reason about, and more moving parts in deployment. This is not the right approach unless read/write ratios are skewed above roughly 10:1 or the read schema requirements diverge significantly from the write schema.

Backend for frontend (BFF)

BFF places a dedicated server layer between each client type (web browser, mobile, third-party API consumer) and the underlying services. Next.js App Router's server-side data-fetching and React Server Components effectively build a BFF within the framework: the server component fetches, transforms, and serializes only the data the browser needs, avoiding the over-fetching that a shared REST API produces. By default, layouts and pages in the Next.js App Router are Server Components (Next.js App Router documentation - Server and Client Components, 2024)

BFF is worth the added service when client requirements diverge enough that a single API is either over-serving or under-serving at least one consumer.

Hexagonal (ports and adapters) architecture

Hexagonal architecture isolates domain logic from all external concerns, databases, HTTP frameworks, message queues, behind well-defined interfaces (ports). Adapters build those interfaces. The benefit is that the domain can be tested without infrastructure and swapped without touching core logic. It's the right structural choice when domain complexity is high and infrastructure is likely to change: migrating from a REST adapter to GraphQL, or from PostgreSQL to a vector database for an AI retrieval layer, becomes a localized change rather than a broad refactor.

React Server Components in the presentation layer

React Server Components (RSC) change where rendering logic runs: Server Components execute on the server, stream HTML to the browser, and never ship their dependency graph to the client bundle. The original React Server Components RFC estimated 18-29% JavaScript bundle size reductions, and real-world teams migrating to the App Router report typically a 20-40% reduction in JS shipped to the client (Patterns.dev - React Server Components, 2024) This collapses the client/server boundary for data-heavy UI, a product listing page that previously needed a client-side fetch plus a loading state now renders with zero client JavaScript for that component tree. The architectural implication is that the presentation layer now participates in data access; the BFF pattern and RSC converge. The failure mode is mixing server and client components without a clear mental model, which produces subtle hydration errors and unexpected bundle size growth in the client components that remain.

Async processing: queues, event buses, and back-pressure

Event-driven architecture treats message queues and event buses as first-class architectural components, not afterthoughts. When a synchronous request chain grows beyond two or three dependent services, you need asynchronous processing to avoid cascading latency and failure.

Job queues vs. event buses. A job queue (Redis with BullMQ, AWS SQS) routes a task to exactly one worker, suitable for user-triggered operations like report generation, email dispatch, or image resizing. An event bus (Kafka, AWS SNS/SQS fan-out, RabbitMQ topic exchanges) broadcasts one event to multiple consumers in parallel. The fan-out read problem emerges here: a single `order.placed` event may trigger inventory, billing, and notification services simultaneously, multiplying write load downstream.

Back-pressure is a design constraint, not an edge case. When consumers process slower than producers publish, queue depth grows. Two failure modes appear most often in practice. First, unbounded memory growth: if your broker holds unconsumed messages in memory without a size cap, a sudden traffic spike exhausts heap space and crashes the broker process itself. Second, retry storms: a consumer that fails and immediately re-queues the same message competes with fresh work, starving the queue and amplifying failure across every downstream service. Concrete mitigations include setting explicit queue-depth limits, applying exponential back-off with jitter on retries, and using dead-letter queues to isolate poison messages before they cycle indefinitely. Redis streams support consumer group acknowledgment, which gives you explicit back-pressure without a heavier broker. Circuit breakers at the consumer interface add a second layer, shedding load before exhaustion spreads upstream.

Idempotency is non-negotiable. Any worker that can receive a retry must produce the same outcome on duplicate delivery. Store a processed event ID before acting; check it on receipt. This applies to payment handlers, inventory updates, and AI inference jobs equally.

Queue overflow and retry storms are among the most common causes of cascading failure in distributed systems, which is why back-pressure handling belongs in the design, not in the post-incident review.

How to choose an architecture: a decision framework

Start with a modular monolith. Extract services only when a specific module needs to deploy independently at a frequency or scale the monolith cannot support. That single rule eliminates the majority of premature microservices decisions we see in 50-200-person engineering teams.

The decision axes below give you the full picture.

Scale and traffic profile

A modular monolith handles far more sustained load than most teams expect before distributed services become necessary, a single well-provisioned instance comfortably serves thousands of requests per second for typical workloads. The threshold question is not "how much traffic do we have?" but "do different modules have different scaling profiles?" A reporting module with fan-out reads against a cold data warehouse and a checkout module with p99 latency requirements under 200ms should not share a process pool. When they do, one back-pressure event degrades both.

Team size and ownership boundaries

Conway's Law is architectural fact. A team of eight engineers on one codebase deploys as a unit; coordination cost is low, so the modular monolith wins on time-to-market. Once you cross three or four independent squads with separate deployment cadences, the shared deployment artifact becomes the bottleneck. That is the signal to extract, not team headcount alone, but independent deployment need.

Latency and consistency requirements

Synchronous, user-facing request paths, browser to API gateway to application layer to database and back, demand strong consistency and low TTFB. Three-tier architecture with a single relational write path handles this cleanly. Where eventual consistency is acceptable (notifications, analytics, recommendation updates), event-driven architecture via a message bus reduces write amplification and decouples services without adding synchronous call depth.

The decision matrix

Signal Choose modular monolith Choose microservices Choose serverless
Team size < 3 squads 4+ squads, separate deploy cadences 1-2 engineers, variable load
Traffic profile Uniform, < ~5k req/s sustained Heterogeneous scaling per domain Spiky, unpredictable burst
Latency target p99 < 300ms, uniform p99 varies by service Cold-start latency is acceptable
Consistency Strong consistency needed Eventual consistency tolerable Stateless, idempotent workloads
Time-to-market Ship in weeks Months to establish platform Days (managed infra)
Data coupling Shared schema acceptable Domain-owned data stores External data layer

When the API gateway changes the equation

An API gateway is not just a routing layer: it is the enforcement point for auth, rate limiting, and request fan-out control. In a modular monolith migration, adding an API gateway in front of the existing application lets you route specific paths to extracted services while the monolith handles everything else. This is the strangler-fig pattern coined by Martin Fowler for managing risk in system modernization, applied at the network edge rather than inside the codebase, and in our experience it reduces migration risk significantly because production traffic validates each extracted service before the monolith path is retired.

For teams adding AI features, the same gateway becomes the model gateway: a single entry point that routes inference requests, enforces token-budget limits, and abstracts provider switching between OpenAI, Anthropic, or a self-hosted model, without scattering provider credentials and retry logic across application code.

How AI is reshaping web app architecture in 2026

AI features now require a dedicated inference service tier, retrieval-augmented generation pipelines, vector databases, and a model gateway that sits between your API gateway and your application logic, not inside it.

This is the most structurally significant addition to web application architecture in a decade. Where a typical three-tier architecture separates presentation, logic, and data, AI workloads demand a fourth tier with its own latency budget, cost model, and failure modes. Teams that bolt inference calls directly into their application layer, treating a language model like a database query, consistently hit p99 latency spikes above 8 seconds within weeks of launch. The approach of treating inference as just another synchronous service call is the most common architectural mistake we see on AI-augmented platform projects.

The RAG pipeline as an architecture component

Retrieval-augmented generation changes how the data layer behaves. A standard RAG pipeline runs three sequential operations: embed the user query into a vector representation, run an approximate nearest-neighbor search against a vector database (Pinecone, pgvector on PostgreSQL, or Weaviate are the common choices), then pass the retrieved context plus the original query to an inference endpoint. Each step adds latency. The vector database read is fast, typically 20-80 ms for well-indexed collections. The inference call to a hosted model adds 800 ms to 3,000 ms depending on model size and whether the request hits a warm instance. For context, OpenAI's GPT-4o returns median first-token latency around 500-900 ms on warm instances under normal load; Anthropic Claude 3.5 Sonnet sits in a similar range, while smaller hosted models such as Mistral 7B on dedicated GPU endpoints can return completions in 300-600 ms.

Write amplification is the hidden cost. Every time a user uploads a document or your system ingests new content, the vector database must embed and index each chunk. On a content-heavy platform, that write load compounds quickly, embedding and indexing thousands of chunks per upload can saturate a write path that was sized only for query traffic. Size your write throughput independently of your read path. CQRS read/write separation applies here just as it does in event-driven architecture.

The model gateway: a new API gateway concern

A model gateway sits in front of your inference providers (OpenAI, Anthropic, Google Vertex, or a self-hosted endpoint) and handles routing, rate limiting, cost attribution, fallback logic, and prompt caching. Without one, every application service that calls a model does so with raw HTTP: no retry policy, no cost visibility, no circuit breaker when an upstream provider degrades.

The model gateway pattern mirrors what the API gateway does for microservices. It centralizes cross-cutting concerns so individual services stay clean and the interface between your application logic and external model providers remains well-defined. Treat it as a first-class infrastructure component from day one, not a wrapper you add after the first runaway billing incident.

Cold-start latency and cost implications

Serverless inference, functions that spin up a model container per request, introduces meaningful cold-start latency. Modal reports cutting GPU container cold-starts to roughly 50ms (down from around 300ms), while other serverless GPU providers still report multi-second cold-starts (Modal blog, 2025). For user-facing features, synchronous inference on cold paths is not viable. The mitigation is either a persistent inference service (always-warm, predictable cost) or an async queue that returns results via webhook or WebSocket, letting the browser render a loading state rather than blocking the request.

Cost attribution matters architecturally. Token consumption from LLM calls can dwarf your database and compute spend. A single GPT-4o call processing 2,000 input tokens and returning 500 output tokens costs roughly $0.011 at current pricing; at 100,000 daily active users each triggering one inference call, that scales to over $1,000 per day from a single feature. Route non-critical inference work, summarization, tagging, background enrichment, through a message queue so it runs asynchronously and can be throttled. Reserve synchronous inference paths for user-blocking interactions where latency directly affects perceived quality.

Best practices: scalability, security, and observability

Scalable web application architecture rests on three disciplines applied together: horizontal scaling with stateless services, layered security from TLS to the application boundary, and observability wired in from day one, not bolted on after incidents.

Scalability: stateless services and distributed caching

Horizontal scaling only works when application servers carry no session state. Move session data and hot reads into Redis so every server instance is interchangeable; load balancers can then route any request to any node without sticky sessions. Front the origin with a Content Delivery Network to absorb static assets, edge-cached API responses, and geographic latency. A read-heavy application with a well-tuned cache policy can serve the large majority of requests from the edge, and a mostly-static site can reach cache hit ratios of 95% or higher, taking that load off the origin entirely. For database read amplification, separate read replicas from the write path; CQRS makes this explicit by design.

Security: layered boundaries

TLS termination belongs at the load balancer or CDN edge, not inside the app. A WAF in front of the API gateway blocks OWASP Top 10 vectors before malformed requests reach application logic. OAuth 2.0 with short-lived JWTs (15-minute access tokens, rotating refresh tokens) is the current floor for user authentication; store tokens in httpOnly cookies, not localStorage, to eliminate a class of XSS exfiltration.

Performance: INP and the Next.js App Router rendering model

INP, Interaction to Next Paint, measures the latency between a user input and the browser's next rendered frame. The Chrome team sets "good" INP at under 200 ms for 75% of page loads. Poor INP almost always traces to a long JavaScript task blocking the main thread. Next.js App Router addresses this structurally: React Server Components render data-fetching logic on the server and stream HTML to the browser, leaving the client bundle lean. Partial hydration means interactive components hydrate independently rather than blocking a full-page hydration pass, a direct architectural response to INP pressure. The practical rule: push data fetching to RSCs, reserve client components for genuine interactivity (forms, real-time updates), and keep the client JavaScript budget below 150 kB parsed per route.

Observability: structured logs, metrics, traces

Observability is three signals working together. Structured JSON logs (with a trace_id field) let you correlate a p99 latency spike to a specific request path. Metrics (request rate, error rate, saturation, the RED method) give you the dashboard. Distributed traces, instrumented via the OpenTelemetry, stitch together the full call graph across services, from the API gateway through application logic to Redis and the database. Without OpenTelemetry spans on every service boundary, a latency regression in a downstream dependency is nearly invisible until a user reports it.

FAQ: web application architecture

What is three-tier web application architecture?

Three-tier architecture separates a web application into three distinct layers: a presentation tier (browser or client), an application logic tier (server-side business logic), and a data tier (database). Each tier runs independently, so you can scale the application server horizontally without touching the database layer. This separation remains the default starting point for most production web applications today.

How do you design a scalable web application architecture?

Design for scalability by keeping application servers stateless, pushing session data to a distributed cache such as Redis, and placing a CDN in front of static assets and a load balancer in front of your application tier. Introduce an API gateway to enforce rate limiting and route traffic before requests reach services. Add async message queues for any workload that does not need a synchronous response.

What is the difference between microservices and a monolith?

A monolith packages all application logic into one deployable unit; microservices split that logic into independently deployable services communicating over a network. A modular monolith sits between the two, clean internal boundaries without the operational overhead of distributed systems. Martin Fowler's writing on the monolith-first approach recommends starting monolith-first and extracting services only when independent deployment becomes a concrete need.

Which layer of web application architecture includes physical devices?

Physical devices: servers, network hardware, storage arrays, and data center infrastructure, sit in the infrastructure tier, below the application and data layers in most architecture diagrams. Cloud deployments abstract this tier behind managed services, but it still exists as the substrate that DNS, CDN edge nodes, and load balancers run on. Architects working on latency budgets must account for it.

How does React Server Components change the frontend architecture?

React Server Components (RSC) move rendering of non-interactive UI to the server, reducing the JavaScript bundle sent to the browser and improving Time to First Byte and Largest Contentful Paint. The Next.js App Router builds on RSC by letting each route segment declare its own rendering mode: server, client, or streaming. The React documentation on Server Components details where this rendering boundary sits. This shifts frontend architecture from a single client bundle toward a layered mix of server-rendered and client-hydrated components.

Architect your next platform with Netguru

Teams that finish reading this guide typically land on the same question: which architecture decision do we make next, and who do we trust to pressure-test it? Netguru has built and scaled web applications across fintech, real estate, and marketplace verticals. Embedded in Skrill's Berlin-based team, we developed key modules of the new Skrill website, including account.skrill.com, supporting a platform that handles 9 million monthly visits with a majority-mobile audience. For Newst.se, we delivered a Ruby commercial real-estate marketplace in under a year, scaling to 4,500 active users and 6,000 listings with custom third-party integrations.

Our approach starts with a structured architecture review: we map your current system topology, identify the trade-offs that are limiting throughput or resilience, and prioritize the changes that deliver the most value. From there, we work as an extension of your engineering team, building the components that close the gap between your current state and where your growth trajectory demands you go. Whether you need a full web application architecture review, a targeted performance audit, or a dedicated team to deliver the next platform layer, we bring more than 15 years of accumulated pattern recognition to that work.

We're Netguru

At Netguru we specialize in designing, building, shipping and scaling beautiful, usable products with blazing-fast efficiency.

Let's talk business