Building Production-Ready Applications: A Practical Checklist
From health checks and graceful shutdown to secrets, backups, and incident response—ship software you can run at 3 a.m. without heroics.
Building Production-Ready Applications: A Practical Checklist
"Works on my machine" and "ready for production" are separated by a gap teams feel during the first real traffic spike, the first leaked API key, or the first database restore that takes twelve hours because nobody practiced.
Production-ready means operable: you can deploy safely, observe behavior, limit blast radius, recover data, and respond to incidents without guessing. This checklist is framework-agnostic—apply whether you ship a Next.js SaaS, a Node API, or workers on Kubernetes.
Configuration and twelve-factor basics
- Config in environment variables, not committed
.envfiles - One codebase, many deploys via config only
- Stateless processes; session in Redis or DB if needed
- Logs as event streams to stdout for aggregation
- Graceful shutdown: handle
SIGTERM, drain in-flight requests, close DB pools
process.on("SIGTERM", async () => {
server.close();
await db.end();
process.exit(0);
});
Document required env vars in .env.example with descriptions.
Health checks and readiness
Expose:
/health/live— process up (for restart if hung)/health/ready— dependencies reachable (DB, cache)
Orchestrators use readiness to route traffic; liveness to restart. Do not conflate them—ready should fail when Postgres is down even if Node responds.
Observability trio
Logs: structured JSON with level, message, requestId, userId (hashed if needed).
Metrics: request rate, latency percentiles, error rate, queue depth, DB pool usage.
Traces: OpenTelemetry across HTTP and DB calls.
Define SLOs (e.g., 99.9% availability, p95 < 300ms). Alert on error budget burn, not every CPU blip.
Security baseline
- HTTPS terminated correctly; HSTS in production
- Dependency scanning (Dependabot, Snyk) in CI
- Secret managers (Vault, cloud SM)—rotate keys
- Principle of least privilege IAM for services
- CSP and XSS defenses for web surfaces
- Rate limiting and WAF where exposed publicly
Run threat modeling lite for auth, payments, and admin paths.
Data layer reliability
- Migrations versioned and applied in CI/CD before app rollout
- Backups automated; restore tested quarterly
- Connection pooling sized to reality
- Indexes reviewed for production query shapes
- Idempotent consumers for queues
Document RPO/RTO targets (how much data loss and downtime are acceptable).
Deployment and rollback
- Immutable artifacts (container digest, build ID)
- Blue/green or rolling deploys with automatic rollback on health check failure
- Database migrations backward-compatible when possible (expand-contract pattern)
- Feature flags decouple deploy from release
Never SSH to prod to edit files. Patch image, redeploy.
Testing pyramid in production context
- Unit tests for domain logic
- Integration tests for DB and API contracts
- Smoke tests post-deploy hitting
/health/readyand one critical user journey - Load tests before major events (sales, launches)
Synthetic monitoring from outside your VPC catches DNS and certificate issues.
Incident response preparedness
- On-call rotation with escalation
- Runbooks for top five alerts (DB CPU, 5xx spike, queue backlog)
- Status page communication template
- Blameless postmortems with action items tracked
Practice once per quarter. Incidents are inevitable; chaos is optional.
Performance and capacity
- Load test to 2× expected peak
- Autoscaling policies with cooldowns
- CDN for static assets
- Budget Core Web Vitals for customer-facing apps
Capacity planning includes cost—autoscale max limits prevent surprise bills.
Compliance and privacy (when applicable)
Data retention policies, encryption at rest, audit logs for admin actions, GDPR export/delete flows if you have EU users. Legal informs; engineering implements.
Documentation that ops actually uses
- Architecture diagram (boxes and arrows, not novel)
- How to rotate secrets
- How to scale workers
- Contact tree and severity definitions
Keep docs next to code; review in PRs when behavior changes.
Launch gate checklist
| Item | Done? |
|---|---|
| Secrets not in repo | |
| Health + readiness endpoints | |
| Structured logging + trace IDs | |
| Alerts on 5xx rate and latency SLO | |
| Backup + tested restore | |
| Rollback procedure documented | |
| Rate limits on public API | |
| Runbook for top alerts |
Post-launch habits
- Review error logs weekly
- Deprecate unused feature flags
- Pay down dependency CVEs
- Revisit SLOs as usage grows
Production readiness is not a one-time gate—it is continuous operability.
Chaos engineering and failure injection
Game days become more valuable when you automate small experiments. Chaos engineering does not mean breaking production randomly—it means hypothesizing "what if Redis disappears?" and validating behavior in staging weekly.
Start with controlled failures:
- Kill one API replica during load test—clients should retry or fail gracefully
- Introduce 500ms latency to database calls—timeouts should fire, not pile up threads
- Fill disk on a worker node—logging should degrade, not crash silently
Record observations in the runbook: actual vs expected. Fix timeouts before adding replicas—scaling hides backpressure until everything falls over at once.
Multi-region and disaster recovery (when you need them)
Not every product needs active-active regions on day one. Document RPO/RTO honestly: if backups are daily, you may lose twenty-four hours of data—stakeholders must accept that or fund continuous archiving.
When you expand regions:
- Route users to nearest edge for static assets
- Replicate Postgres with understood lag; avoid cross-region synchronous writes unless requirements demand it
- Use global load balancing with health checks that reflect readiness, not just TCP open ports
Failover drills should include DNS TTL realities—lowering TTL before migration prevents hour-long stale routes.
Cost and operability tradeoffs
Managed services (RDS, Elasticache, managed Kafka) trade dollars for on-call hours. A three-person startup might run Postgres on a single VM with backups; a regulated fintech buys multi-AZ RDS and audit logs. Production-ready includes financial sustainability: autoscaling without max caps, log retention without lifecycle policies, and oversized staging environments all create surprise invoices.
Tag cloud resources by service and environment. Finance questions are easier when service=billing env=prod maps to a line item.
Collaboration with platform teams
If a platform team exists, align early on golden paths: approved base images, standard Helm charts, and observability agents preinstalled. Product teams ship features; platform teams shrink the gap between "works" and "operable." RFC platform requirements when you need exceptions—running a custom message broker without expertise is a future incident report.
Feature flags and progressive delivery
Feature flags separate deploy from release. Kill switches for new checkout flows prevent all-or-nothing rollouts. Flag hygiene matters: remove dead flags quarterly or tech debt accumulates in if (flag) branches nobody understands.
Pair flags with metrics: compare error rates between cohorts. Canary deploys at the infrastructure layer (5% traffic to new version) complement application-level flags.
Dependency readiness
Third-party APIs fail. Circuit breakers, cached fallbacks, and clear degraded-mode UX ("payments delayed, try again") beat infinite spinners. Timeouts on every outbound client are non-negotiable—document defaults in shared HTTP libraries so individual services do not invent their own.
Legal pages and operational contacts
Production includes status page, support email, and privacy policy links users can find. On-call rotation must be staffed before launch—not "we will add paging later."
Minimum viable observability stack
You do not need every vendor on day one. A pragmatic stack:
- Logs: structured JSON to your host or Loki/Datadog
- Metrics: Prometheus or hosted equivalent with RED dashboards per service
- Traces: OpenTelemetry SDK exporting to Tempo/Jaeger/Datadog
- Uptime: synthetic checks on
/health/readyand one login path
Add profiling (continuous or on-demand) when CPU or memory SLOs slip. The stack matters less than every engineer knowing where to click during an incident.
Runbooks should be executable, not prose-only: copy-paste commands, expected outputs, and links to dashboards. During outages, cognitive load is high—friction in docs becomes downtime.
Conclusion
Production-ready applications fail safely, tell you when they hurt, and recover without folklore. Invest in health checks, observability, secrets hygiene, backups, and practiced incident response alongside features.
Run a game day before your biggest launch marketing push. The checklist you complete that afternoon is cheaper than the outage tweet you avoid. Production readiness is proven under failure, not declared in slide decks.
Production readiness review
Schedule a game day before launch: simulate database failover, kill a random API pod, and verify graceful degradation messages reach users. Confirm feature flags can disable risky paths without redeploying. Validate backup restore drills quarterly—not only backup creation. Ensure on-call runbooks link to dashboards for golden signals (latency, traffic, errors, saturation) per service. Production readiness is less about a single checklist moment and more about rehearsed habits your team repeats every release train.
Workshop: apply this week
Pick one idea from this article and ship it before Friday. Write a short internal note explaining what changed, what metric you expect to move, and how you will verify the result. Share the note with your team so the learning compounds. If the experiment fails, document the failure mode—it is as valuable as success for the next engineer reading this guide.
Frequently asked questions
- What is the minimum bar for production-ready?
- Automated tests on critical paths, monitored health endpoints, secrets outside code, rollback path, backups with tested restore, and runbooks for top failure modes. Fancy microservices optional; operability not optional.
- How do I know if my app is production-ready?
- Run a game day: simulate dependency failure, restore from backup, deploy a bad build and roll back, and page yourself. Gaps you feel during the exercise are backlog items, not shame.
- Should I prioritize features or production hardening?
- Harden incrementally alongside features—health checks and structured logging belong in the first sprint, not a mythical hardening week before launch. Untested launches borrow time from future incidents.
Comments
Discussion is coming soon. Share this article and join the conversation on social media.
Enjoyed this article?
Get weekly engineering guides delivered to your inbox.