VPC Design Guide for Production Workloads

Your VPC is the foundation of everything you run in AWS. Get it wrong and you will spend months untangling subnet conflicts, routing issues, and security gaps. Get it right from the start and your infrastructure scales cleanly as your business grows. This guide covers the decisions that matter for production workloads, from CIDR planning to multi-VPC architectures.

Why VPC Design Matters

A VPC is not something you set up once and forget. It defines your network boundaries, controls traffic flow, and determines how services communicate. Poor VPC design creates problems that compound over time:

Overlapping CIDR ranges prevent you from connecting VPCs together later. If your production VPC uses 10.0.0.0/16 and your development VPC also uses 10.0.0.0/16, you cannot peer them or connect them through Transit Gateway. Rearchitecting CIDR ranges means migrating every resource.

Undersized address space means you run out of IP addresses as workloads grow. Adding secondary CIDR blocks works but adds routing complexity. Planning for growth upfront avoids this entirely.

Flat network architectures where everything lives in public subnets with overly permissive security groups create an attack surface that grows with every new resource you deploy.

CIDR Planning: Leave Room to Grow

Choose your VPC CIDR block carefully because changing it later is painful. The key principles:

Use a /16 for production VPCs. A /16 gives you 65,536 IP addresses. That sounds excessive until you factor in multiple subnets across three AZs, EKS pod networking, Lambda ENIs, and future growth. You will never use all of them, but having headroom means you never hit limits.

Avoid overlaps across all VPCs and on-premises networks. Document every CIDR range in use. Use a consistent scheme like 10.0.0.0/16 for production, 10.1.0.0/16 for staging, 10.2.0.0/16 for development. If you have on-premises networks using 10.x ranges, shift to 172.16.0.0/12 space.

Plan subnets in advance. Divide your /16 into /20 or /24 subnets depending on workload density. A /20 gives 4,096 IPs per subnet which handles most production workloads comfortably. Reserve CIDR blocks for future subnet tiers even if you do not need them today.

Subnet Strategy: Public, Private App, Private Data

A well-designed VPC uses at least three subnet tiers, each replicated across multiple Availability Zones:

Public subnets contain resources that need direct internet access: Application Load Balancers, NAT Gateways, and bastion hosts. Nothing else belongs here. Keep public subnets small since they host few resources.

Private application subnets host your compute workloads: EC2 instances, ECS tasks, EKS pods, and Lambda functions. These subnets route outbound traffic through NAT Gateways but are not directly reachable from the internet.

Private data subnets host databases, ElastiCache clusters, and other data stores. These have no internet route at all, not even through a NAT Gateway. They communicate only with the application tier subnets above them.

Replicate across AZs. Create each subnet tier in at least two, ideally three, Availability Zones. This gives you high availability and allows services like RDS Multi-AZ and ALB to function properly. A three-AZ, three-tier design gives you nine subnets minimum.

Routing Tables: Controlling Traffic Flow

Each subnet tier needs its own route table with appropriate routes:

Public subnet route table: Local VPC route (automatic) plus a default route (0.0.0.0/0) pointing to the Internet Gateway. This allows ALBs and NAT Gateways to communicate with the internet.

Private application route table: Local VPC route plus a default route pointing to the NAT Gateway in the same AZ. Deploy one NAT Gateway per AZ for high availability. If one AZ fails, traffic from other AZs is unaffected.

Private data route table: Local VPC route only. No default route. Data tier resources should not initiate outbound connections. If a database needs to reach AWS APIs (for example, RDS accessing S3 for backups), use VPC endpoints instead of NAT Gateways.

Security Groups vs NACLs

Both control traffic, but they work differently and serve different purposes:

Security groups are stateful firewalls attached to individual resources. They are your primary traffic control mechanism. Use them to define which resources can talk to each other. Reference other security groups as sources to create dynamic, self-maintaining rules. For example, your application security group allows inbound on port 443 from the ALB security group.

NACLs are stateless firewalls at the subnet level. They process rules in order and require explicit allow rules for both inbound and outbound traffic. Use NACLs as a coarse-grained backup layer, for example blocking known malicious IP ranges or restricting entire port ranges at the subnet boundary.

Best practice: Do most of your filtering with security groups. Keep NACLs simple with broad allow rules and specific deny rules for known threats. Overly complex NACLs are difficult to troubleshoot and easy to misconfigure.

VPC Peering vs Transit Gateway

When you need to connect multiple VPCs, you have two main options:

VPC peering creates a direct one-to-one connection between two VPCs. It is simple, low-cost, and low-latency. Traffic stays on the AWS backbone. Peering works well when you have two or three VPCs that need to communicate. The limitation is that peering is non-transitive. If VPC A peers with VPC B, and VPC B peers with VPC C, traffic from A cannot reach C through B.

Transit Gateway is a regional hub that connects multiple VPCs and on-premises networks through a single gateway. It supports transitive routing, route tables for segmentation, and scales to thousands of connections. Use Transit Gateway when you have more than three VPCs, need transitive routing, or connect to on-premises via VPN or Direct Connect.

Cost consideration: VPC peering has no hourly charge. You pay only for data transfer. Transit Gateway charges $0.05/hour per attachment plus $0.02/GB of data processed. For low-traffic connections between a few VPCs, peering is cheaper. For complex multi-VPC architectures, Transit Gateway's operational simplicity justifies the cost.

VPC Endpoints: Avoid NAT Gateway Costs

VPC endpoints let your private resources access AWS services without sending traffic through a NAT Gateway or the public internet:

Gateway endpoints are free and support S3 and DynamoDB. They add a route to your route table that directs traffic to the service through AWS internal networks. Every production VPC should have gateway endpoints for S3 and DynamoDB. There is no reason not to since they cost nothing and reduce NAT Gateway data processing charges.

Interface endpoints create an ENI in your subnet with a private IP address. They support hundreds of AWS services including SQS, SNS, KMS, Secrets Manager, and CloudWatch. They cost $0.01/hour plus data processing but eliminate NAT Gateway charges for that traffic. Add interface endpoints for services your workloads call frequently.

Private DNS: Enable private DNS on interface endpoints so that the standard AWS service endpoint (e.g., sqs.us-east-1.amazonaws.com) resolves to your private endpoint automatically. No application code changes required.

DNS Resolution and Route 53 Resolver

DNS in a VPC is handled by the Amazon-provided DNS server at the VPC CIDR base plus two (e.g., 10.0.0.2). For most workloads this just works. Things get more complex with hybrid environments:

Route 53 Resolver allows bidirectional DNS resolution between your VPC and on-premises networks. Inbound endpoints let on-premises clients resolve private hosted zones in your VPC. Outbound endpoints let VPC resources resolve DNS names in your corporate domain.

Private hosted zones give your internal services human-readable DNS names (e.g., api.internal.company.com) that resolve only within associated VPCs. Use these instead of hardcoding private IP addresses. Associate private hosted zones with all VPCs that need to resolve those names.

VPC Flow Logs for Visibility

VPC Flow Logs capture metadata about IP traffic flowing through your network interfaces. They are essential for security monitoring, troubleshooting connectivity issues, and understanding traffic patterns.

Enable flow logs at the VPC level to capture all traffic across all subnets and ENIs. Send them to CloudWatch Logs for real-time analysis or to S3 for long-term storage and cost-effective querying with Athena.

Use custom log formats to capture additional fields like TCP flags, traffic path, and packet-level information. The default format captures source, destination, port, protocol, and action (ACCEPT/REJECT). Custom formats help you identify which security group or NACL is blocking traffic during troubleshooting.

Multi-VPC vs Single-VPC Architectures

The question of whether to use one VPC or many depends on your organization's complexity:

Single VPC is appropriate for small teams running a single application with dev, staging, and production in separate accounts. It is simple to manage and has no cross-VPC networking complexity. Most startups and small businesses start here.

Multi-VPC makes sense when you need network-level isolation between workloads, different security requirements per application, or when teams need independent control over their network configuration. Common patterns include one VPC per environment, one VPC per business unit, or a shared services VPC with spoke VPCs for workloads.

Shared services VPC: A common pattern places shared infrastructure like Active Directory, CI/CD tools, and monitoring in a central VPC connected to workload VPCs via Transit Gateway. This reduces duplication while maintaining workload isolation.

Common VPC Design Mistakes

We see the same mistakes repeatedly when reviewing client environments:

Using the default VPC for production. The default VPC has public subnets, an internet gateway, and auto-assign public IP enabled. It was designed for getting started quickly, not for production workloads.
Putting databases in public subnets. RDS instances should never have public accessibility enabled. Place them in private data subnets with security groups restricting access to application tier only.
Single NAT Gateway for all AZs. If that AZ fails, all private subnets lose internet access. Deploy one NAT Gateway per AZ for resilience.
Overly permissive security groups. Rules allowing 0.0.0.0/0 on all ports defeat the purpose of having a VPC. Restrict to specific ports and source security groups.
No VPC endpoints for S3. Every request from a private subnet to S3 through a NAT Gateway costs $0.045/GB in processing fees. Gateway endpoints for S3 are free.
Ignoring CIDR planning. Picking random CIDR ranges without documenting them makes future peering and Transit Gateway connections impossible without migration.

VPC Design Is a Day-One Decision

The cost of rearchitecting a VPC after production workloads are running is significant. Every IP address change means downtime or complex migration. Every CIDR conflict means routing workarounds. Invest the time to design your VPC properly before deploying workloads and you will avoid months of technical debt later.