What is the minimum bar for production-ready?

Automated tests on critical paths, monitored health endpoints, secrets outside code, rollback path, backups with tested restore, and runbooks for top failure modes. Fancy microservices optional; operability not optional.

How do I know if my app is production-ready?

Run a game day: simulate dependency failure, restore from backup, deploy a bad build and roll back, and page yourself. Gaps you feel during the exercise are backlog items, not shame.

Should I prioritize features or production hardening?

Harden incrementally alongside features—health checks and structured logging belong in the first sprint, not a mythical hardening week before launch. Untested launches borrow time from future incidents.

Building Production-Ready Applications: A Practical Checklist

"Works on my machine" and "ready for production" are separated by a gap teams feel during the first real traffic spike, the first leaked API key, or the first database restore that takes twelve hours because nobody practiced.

Production-ready means operable: you can deploy safely, observe behavior, limit blast radius, recover data, and respond to incidents without guessing. This checklist is framework-agnostic—apply whether you ship a Next.js SaaS, a Node API, or workers on Kubernetes.

Configuration and twelve-factor basics

Config in environment variables, not committed .env files
One codebase, many deploys via config only
Stateless processes; session in Redis or DB if needed
Logs as event streams to stdout for aggregation
Graceful shutdown: handle SIGTERM, drain in-flight requests, close DB pools

process.on("SIGTERM", async () => {
  server.close();
  await db.end();
  process.exit(0);
});

Document required env vars in .env.example with descriptions.

Health checks and readiness

Expose:

/health/live — process up (for restart if hung)
/health/ready — dependencies reachable (DB, cache)

Orchestrators use readiness to route traffic; liveness to restart. Do not conflate them—ready should fail when Postgres is down even if Node responds.

Observability trio

Logs: structured JSON with level, message, requestId, userId (hashed if needed).

Metrics: request rate, latency percentiles, error rate, queue depth, DB pool usage.

Traces: OpenTelemetry across HTTP and DB calls.

Define SLOs (e.g., 99.9% availability, p95 < 300ms). Alert on error budget burn, not every CPU blip.

Security baseline

HTTPS terminated correctly; HSTS in production
Dependency scanning (Dependabot, Snyk) in CI
Secret managers (Vault, cloud SM)—rotate keys
Principle of least privilege IAM for services
CSP and XSS defenses for web surfaces
Rate limiting and WAF where exposed publicly

Run threat modeling lite for auth, payments, and admin paths.

Data layer reliability

Migrations versioned and applied in CI/CD before app rollout
Backups automated; restore tested quarterly
Connection pooling sized to reality
Indexes reviewed for production query shapes
Idempotent consumers for queues

Document RPO/RTO targets (how much data loss and downtime are acceptable).

Deployment and rollback

Immutable artifacts (container digest, build ID)
Blue/green or rolling deploys with automatic rollback on health check failure
Database migrations backward-compatible when possible (expand-contract pattern)
Feature flags decouple deploy from release

Never SSH to prod to edit files. Patch image, redeploy.

Testing pyramid in production context

Unit tests for domain logic
Integration tests for DB and API contracts
Smoke tests post-deploy hitting /health/ready and one critical user journey
Load tests before major events (sales, launches)

Synthetic monitoring from outside your VPC catches DNS and certificate issues.

Incident response preparedness

On-call rotation with escalation
Runbooks for top five alerts (DB CPU, 5xx spike, queue backlog)
Status page communication template
Blameless postmortems with action items tracked

Practice once per quarter. Incidents are inevitable; chaos is optional.

Performance and capacity

Load test to 2× expected peak
Autoscaling policies with cooldowns
CDN for static assets
Budget Core Web Vitals for customer-facing apps

Capacity planning includes cost—autoscale max limits prevent surprise bills.

Compliance and privacy (when applicable)

Data retention policies, encryption at rest, audit logs for admin actions, GDPR export/delete flows if you have EU users. Legal informs; engineering implements.

Documentation that ops actually uses

Architecture diagram (boxes and arrows, not novel)
How to rotate secrets
How to scale workers
Contact tree and severity definitions

Keep docs next to code; review in PRs when behavior changes.

Launch gate checklist

Item	Done?
Secrets not in repo
Health + readiness endpoints
Structured logging + trace IDs
Alerts on 5xx rate and latency SLO
Backup + tested restore
Rollback procedure documented
Rate limits on public API
Runbook for top alerts

Post-launch habits

Review error logs weekly
Deprecate unused feature flags
Pay down dependency CVEs
Revisit SLOs as usage grows

Production readiness is not a one-time gate—it is continuous operability.

Chaos engineering and failure injection

Game days become more valuable when you automate small experiments. Chaos engineering does not mean breaking production randomly—it means hypothesizing "what if Redis disappears?" and validating behavior in staging weekly.

Start with controlled failures:

Kill one API replica during load test—clients should retry or fail gracefully
Introduce 500ms latency to database calls—timeouts should fire, not pile up threads
Fill disk on a worker node—logging should degrade, not crash silently

Record observations in the runbook: actual vs expected. Fix timeouts before adding replicas—scaling hides backpressure until everything falls over at once.

Multi-region and disaster recovery (when you need them)

Not every product needs active-active regions on day one. Document RPO/RTO honestly: if backups are daily, you may lose twenty-four hours of data—stakeholders must accept that or fund continuous archiving.

When you expand regions:

Route users to nearest edge for static assets
Replicate Postgres with understood lag; avoid cross-region synchronous writes unless requirements demand it
Use global load balancing with health checks that reflect readiness, not just TCP open ports

Failover drills should include DNS TTL realities—lowering TTL before migration prevents hour-long stale routes.

Cost and operability tradeoffs

Managed services (RDS, Elasticache, managed Kafka) trade dollars for on-call hours. A three-person startup might run Postgres on a single VM with backups; a regulated fintech buys multi-AZ RDS and audit logs. Production-ready includes financial sustainability: autoscaling without max caps, log retention without lifecycle policies, and oversized staging environments all create surprise invoices.

Tag cloud resources by service and environment. Finance questions are easier when service=billing env=prod maps to a line item.

Collaboration with platform teams

If a platform team exists, align early on golden paths: approved base images, standard Helm charts, and observability agents preinstalled. Product teams ship features; platform teams shrink the gap between "works" and "operable." RFC platform requirements when you need exceptions—running a custom message broker without expertise is a future incident report.

Feature flags and progressive delivery

Feature flags separate deploy from release. Kill switches for new checkout flows prevent all-or-nothing rollouts. Flag hygiene matters: remove dead flags quarterly or tech debt accumulates in if (flag) branches nobody understands.

Pair flags with metrics: compare error rates between cohorts. Canary deploys at the infrastructure layer (5% traffic to new version) complement application-level flags.

Dependency readiness

Third-party APIs fail. Circuit breakers, cached fallbacks, and clear degraded-mode UX ("payments delayed, try again") beat infinite spinners. Timeouts on every outbound client are non-negotiable—document defaults in shared HTTP libraries so individual services do not invent their own.

Legal pages and operational contacts

Production includes status page, support email, and privacy policy links users can find. On-call rotation must be staffed before launch—not "we will add paging later."

Minimum viable observability stack

You do not need every vendor on day one. A pragmatic stack:

Logs: structured JSON to your host or Loki/Datadog
Metrics: Prometheus or hosted equivalent with RED dashboards per service
Traces: OpenTelemetry SDK exporting to Tempo/Jaeger/Datadog
Uptime: synthetic checks on /health/ready and one login path

Add profiling (continuous or on-demand) when CPU or memory SLOs slip. The stack matters less than every engineer knowing where to click during an incident.

Runbooks should be executable, not prose-only: copy-paste commands, expected outputs, and links to dashboards. During outages, cognitive load is high—friction in docs becomes downtime.

Conclusion

Production-ready applications fail safely, tell you when they hurt, and recover without folklore. Invest in health checks, observability, secrets hygiene, backups, and practiced incident response alongside features.

Run a game day before your biggest launch marketing push. The checklist you complete that afternoon is cheaper than the outage tweet you avoid. Production readiness is proven under failure, not declared in slide decks.

Production readiness review

Schedule a game day before launch: simulate database failover, kill a random API pod, and verify graceful degradation messages reach users. Confirm feature flags can disable risky paths without redeploying. Validate backup restore drills quarterly—not only backup creation. Ensure on-call runbooks link to dashboards for golden signals (latency, traffic, errors, saturation) per service. Production readiness is less about a single checklist moment and more about rehearsed habits your team repeats every release train.

Workshop: apply this week

Pick one idea from this article and ship it before Friday. Write a short internal note explaining what changed, what metric you expect to move, and how you will verify the result. Share the note with your team so the learning compounds. If the experiment fails, document the failure mode—it is as valuable as success for the next engineer reading this guide.

Building Production-Ready Applications: A Practical Checklist

Configuration and twelve-factor basics

Health checks and readiness

Observability trio

Security baseline

Data layer reliability

Deployment and rollback

Testing pyramid in production context

Incident response preparedness

Performance and capacity

Compliance and privacy (when applicable)

Documentation that ops actually uses

Launch gate checklist

Post-launch habits

Chaos engineering and failure injection

Multi-region and disaster recovery (when you need them)

Cost and operability tradeoffs

Collaboration with platform teams

Feature flags and progressive delivery

Dependency readiness

Legal pages and operational contacts

Minimum viable observability stack

Conclusion

Production readiness review

Workshop: apply this week

Frequently asked questions

Comments

Enjoyed this article?

More in Software Architecture

Monolith vs Microservices: An Honest Architecture Guide

Building Scalable APIs: Design Patterns That Survive Growth

Monolith vs Microservices: An Honest Architecture Guide

Building Scalable APIs: Design Patterns That Survive Growth

Frequently asked questions

Comments

Enjoyed this article?

More in Software Architecture

Monolith vs Microservices: An Honest Architecture Guide

Building Scalable APIs: Design Patterns That Survive Growth

You may also like

Monolith vs Microservices: An Honest Architecture Guide

Building Scalable APIs: Design Patterns That Survive Growth