Skip to content

Abhiram-Rakesh/RKE2-Kubernetes-HA-AWS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RKE2 HA Kubernetes on AWS


Overview

This Infrastructure-as-Code (IaC) project provisions a highly available Rancher Kubernetes Engine 2 (RKE2) cluster on AWS using Terraform for infrastructure and Ansible for cluster bootstrap and configuration.

The goal of this project is to demonstrate a production-style Kubernetes deployment with fully automated provisioning and configuration management, suitable for Proof-of-Concept (PoC) and learning environments.

Key features include:

  • Multi-node high availability control plane with embedded etcd
  • NGINX TCP load balancing for Kubernetes API and RKE2 supervisor traffic
  • Ansible roles with Jinja2 templating, idempotent tasks, and handlers
  • Dynamic inventory — Terraform outputs feed directly into Ansible, no manual IP management
  • Bastion-host access model for secure, private-subnet cluster operations
  • End-to-end cluster validation after deployment

Architecture

High-Level Architecture

The cluster is deployed within a dedicated AWS VPC with public and private subnets.

RKE2-Kubernetes-HA-AWS drawio

Components

  • NGINX Load Balancer / Bastion (t3.micro, public subnet)

    • SSH jump host for operator and Ansible access
    • TCP load balancer for Kubernetes API (port 6443) and RKE2 supervisor (port 9345)
  • Control Plane Nodes × 3 (t3.medium, private subnet)

    • Run Kubernetes control plane components
    • Form an embedded etcd quorum for high availability
  • Worker Nodes (t3.medium, private subnet)

    • Run application workloads (kubelet + kube-proxy)
  • Networking

    • VPC 10.0.0.0/16 with public (10.0.1.0/24) and private (10.0.2.0/24) subnets
    • NAT Gateway for outbound internet access from private nodes
    • No direct public access to control plane or worker nodes

Stack

Layer Tool Purpose
Infrastructure Terraform VPC, subnets, EC2, security groups, SSH key, NAT
Inventory Python (dynamic) Translates Terraform output → Ansible groups + hostvars
Configuration Ansible Node prep, NGINX, RKE2 install, kubeconfig setup
Templates Jinja2 NGINX config, RKE2 config.yaml files

Project Structure

.
├── terraform/                  # AWS infrastructure
│   ├── vpc.tf                  # VPC, subnets, IGW, NAT, route tables
│   ├── compute.tf              # EC2 instances (bastion, control planes, workers)
│   ├── security.tf             # Security groups
│   ├── ssh.tf                  # Auto-generated RSA key pair
│   ├── inventory.tf            # Generates inventory/inventory.json
│   ├── outputs.tf
│   └── variables.tf
│
├── inventory/
│   └── inventory.json          # Generated by Terraform — do not edit manually
│
├── ansible/
│   ├── ansible.cfg             # Ansible settings (inventory, SSH options)
│   ├── inventory.py            # Dynamic inventory: reads inventory.json,
│   │                           #   outputs Ansible JSON with ProxyCommand per host
│   ├── site.yml                # Master playbook — runs all roles in order
│   ├── group_vars/
│   │   └── all.yml             # Shared variables (timeouts, retry counts)
│   └── roles/
│       ├── common/             # All k8s nodes: disable swap, install deps
│       ├── nginx_lb/           # Bastion: install NGINX, deploy stream config
│       ├── rke2_init/          # CP-1: install RKE2, cluster-init, start etcd
│       ├── rke2_cp/            # CP-2/3: fetch token, join cluster (serial: 1)
│       ├── rke2_worker/        # Workers: install rke2-agent, register node
│       └── kubectl_access/     # Bastion: install kubectl, configure kubeconfig
│
├── install.sh                  # One-command: terraform apply → ansible-playbook
├── start.sh                    # Bootstrap only (assumes infra exists)
└── shutdown.sh                 # terraform destroy

Ansible Playbook Flow

ansible/site.yml runs 7 plays in sequence:

Play Hosts Role What it does
1 k8s_nodes common Disable swap, install curl/jq
2 bastion nginx_lb Install NGINX, deploy Jinja2 stream config, restart via handler
3 control_plane_init rke2_init Install RKE2, write cluster-init: true config, start service, wait for port 9345
4 control_plane_join rke2_cp Fetch token (delegate_to CP1), install RKE2, join cluster — serial: 1
5 workers rke2_worker Fetch token, install rke2-agent, register with cluster
6 bastion kubectl_access Install kubectl, slurp kubeconfig from CP1, patch server URL to 127.0.0.1
7 bastion inline tasks Wait for all nodes Ready, label workers, verify system pods

Dynamic Inventory

ansible/inventory.py reads inventory/inventory.json (generated by Terraform) and outputs Ansible inventory JSON with:

  • Groups: bastion, control_plane_init, control_plane_join, control_plane, workers, k8s_nodes
  • Per-host vars: ansible_ssh_private_key_file, ansible_ssh_common_args with ProxyCommand for private nodes
  • Shared vars: lb_private_ip, rke2_version injected into every host

This means no manual inventory editing — destroy and recreate infrastructure and Ansible automatically picks up the new IPs.


Key Design Decisions

  • Embedded etcd on control planes — no external etcd to manage or back up separately
  • Jinja2 NGINX template — control plane IPs rendered at runtime from inventory groups
  • serial: 1 for CP joins — nodes join the etcd cluster one at a time to preserve quorum
  • delegate_to for token fetch — Ansible retrieves the RKE2 join token from CP1 directly, no intermediate files
  • wait_for after service start — explicit readiness checks on port 9345 before proceeding to the next play
  • TLS SANs include LB IPkubectl connects via load balancer without certificate errors
  • Bastion ProxyCommand — all private-node SSH tunnels through bastion, configured per-host in dynamic inventory

Installation

Prerequisites

Local tools

  • Linux or macOS with Bash
  • Terraform v1.5+
  • Ansible (pip install ansible)
  • Python 3
  • AWS CLI configured with valid credentials (aws configure)

AWS account permissions

  • Create VPCs, subnets, and route tables
  • Launch EC2 instances and manage security groups
  • Provision NAT Gateways and Elastic IPs

Verify everything is in place before running:

terraform version && ansible --version && python3 --version && aws sts get-caller-identity

Quick Install — single command

curl -fsSL https://raw.githubusercontent.com/Abhiram-Rakesh/RKE2-Kubernetes-HA-AWS/main/install.sh | bash

If the repo is not already present locally, the script clones it to ~/RKE2-Kubernetes-HA-AWS automatically before proceeding.

install.sh will:

  1. Check all local prerequisites and AWS credentials
  2. Run terraform init and terraform apply to provision the VPC, subnets, EC2 instances, and security groups
  3. Generate inventory/inventory.json from Terraform outputs
  4. Run ansible-playbook ansible/site.yml to bootstrap the full cluster end-to-end

Manual Install — step by step

Use this if you want full control over each phase, or if you need to re-run a specific step.

Step 1 — Clone the repository

git clone https://github.com/Abhiram-Rakesh/RKE2-Kubernetes-HA-AWS.git
cd RKE2-Kubernetes-HA-AWS

Step 2 — Install Ansible

pip install ansible

Step 3 — Provision AWS infrastructure

cd terraform
terraform init
terraform apply
cd ..

Terraform will create the VPC, subnets, IGW, NAT Gateway, security groups, EC2 instances, and write inventory/inventory.json.

Step 4 — Make the dynamic inventory executable

chmod +x ansible/inventory.py

Step 5 — Bootstrap the cluster

cd ansible
ansible-playbook site.yml

This runs all 7 plays in sequence: node prep → NGINX LB → CP init → CP join → workers → kubeconfig → verify.


Access the Cluster

The bastion public IP is printed by Terraform at the end of terraform apply. SSH in using the auto-generated key:

# curl install
ssh -i ~/RKE2-Kubernetes-HA-AWS/terraform/ssh_key.pem ubuntu@<BASTION_PUBLIC_IP>

# manual install (from repo root)
ssh -i terraform/ssh_key.pem ubuntu@<BASTION_PUBLIC_IP>

Then verify the cluster:

kubectl get nodes
kubectl get pods -A

All nodes should be in Ready state.


Re-run Bootstrap Only (infrastructure already exists)

If the AWS infrastructure is already provisioned and you only want to re-run Ansible:

# curl install
bash ~/RKE2-Kubernetes-HA-AWS/start.sh

# manual install (from repo root)
bash start.sh

Teardown

Quick Teardown — single command

curl -fsSL https://raw.githubusercontent.com/Abhiram-Rakesh/RKE2-Kubernetes-HA-AWS/main/shutdown.sh | bash

The script resolves the installation at ~/RKE2-Kubernetes-HA-AWS (where the quick install placed it) and runs terraform destroy -auto-approve against the existing state.

Manual Teardown — step by step

Step 1 — Destroy all AWS infrastructure

cd terraform
terraform destroy

Terraform will prompt for confirmation before destroying. Type yes to proceed.

Step 2 — Clean up local state (optional)

cd ..
rm -f inventory/inventory.json terraform/ssh_key.pem terraform/ssh_key.pem.pub

Warning: terraform destroy is irreversible. All EC2 instances, networking components, and cluster data will be permanently deleted.


Troubleshooting

Terraform Apply Fails

  • Verify AWS credentials: aws sts get-caller-identity
  • Check you have the required IAM permissions
  • Verify the region has sufficient EC2 capacity for t3.medium

Ansible Cannot Connect

  • Confirm the bastion is reachable: ssh -i terraform/ssh_key.pem ubuntu@<BASTION_IP>
  • Ensure terraform/ssh_key.pem has correct permissions: chmod 600 terraform/ssh_key.pem
  • Check inventory/inventory.json exists and contains valid IPs

Kubernetes API Not Reachable

  • Verify NGINX is running on the bastion: sudo systemctl status nginx
  • Check that control plane security group allows port 6443 and 9345 from the bastion
  • Review RKE2 server logs: sudo journalctl -u rke2-server -f

Node Stuck in NotReady

  • Check kubelet/agent logs on the affected node: sudo journalctl -u rke2-agent -f (workers) or sudo journalctl -u rke2-server -f (control planes)
  • Ensure the node can reach the LB on port 9345

Recap

This project demonstrates a production-aligned Kubernetes deployment using two of the most widely used DevOps tools:

  • Terraform manages all cloud infrastructure declaratively
  • Ansible handles configuration management with roles, Jinja2 templates, idempotent tasks, and proper handlers

The combination of a Terraform-generated dynamic inventory and Ansible ProxyCommand SSH tunneling reflects real-world patterns used in enterprise Kubernetes deployments.

About

Production-grade Proof of Concept for deploying a highly available RKE2 Kubernetes cluster on AWS using Terraform and automated bootstrap scripts.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors