So here we need to update our cluster and our instance groups to add a lifecycle flag to onDemand instances on each availability zone. Let’s start checking the instance groups:
kops get instancegroups --state s3://eu-north-1-training-dx-book-kops-state
NAME ROLE MACHINETYPE MIN MAX ZONES
master-eu-north-1a Master t3.large 1 1 eu-north-1a
master-eu-north-1b Master t3.large 1 1 eu-north-1b
master-eu-north-1c Master t3.large 1 1 eu-north-1c
nodes-eu-north-1a Node t3.large 1 1 eu-north-1a
nodes-eu-north-1b Node t3.large 1 1 eu-north-1b
nodes-eu-north-1c Node t3.large 1 1 eu-north-1c
And investigate one in detail:
kops get instancegroups nodes-eu-north-1a --state s3://eu-north-1-training-dx-book-kops-state -o yaml
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
labels:
kops.k8s.io/cluster: eu-north-1.training.dx-book.com
name: nodes-eu-north-1a
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230616
instanceMetadata:
httpPutResponseHopLimit: 1
httpTokens: required
machineType: t3.large
maxSize: 1
minSize: 1
role: Node
subnets:
- eu-north-1a
Here is a script to get the current instance groups from our kOps s3 state and write to a yaml file, copy and paste in the terminal of your preference:
state="s3://eu-north-1-training-dx-book-kops-state"
for ig in $(kops get ig --state $state -o json | jq -r '.[].metadata.name'); do
kops get ig $ig --state $state -o yaml > $ig.yaml > /dev/null 2>&1
# add the onDemand flag for the instance group starting with nodes*
if [[ $ig = nodes* ]]; then
yq eval '.spec.nodeLabels += {"kops.k8s.io/lifecycle": "OnDemand"}' -i $ig.yaml
yq eval '.spec.minSize = 1' -i $ig.yaml
yq eval '.spec.maxSize = 3' -i $ig.yaml
yq eval '.spec.manager = Karpenter' -i $ig.yaml
kops replace --state $state -f $ig.yaml
fi
done
Let’s now see the maxSize is increased
kops validate cluster --wait 10m --state s3://eu-north-1-training-dx-book-kops-state
Spot Instances
Activating the termination handler because of the spot instance unstable nature.
spec:
nodeTerminationHandler:
enabled: true
enableSQSTerminationDraining: true
managedASGTag: "aws-node-termination-handler/managed"
Now we need to create a new instance group with a desired number of node count, mininum and max, and a desired instance type group, which in this case we’ll ask for m5.large.
The following examples find similar instances for m5.large and t3.medium:
state="s3://eu-north-1-training-dx-book-kops-state"
kops toolbox instance-selector "spot-group-base-m5-large" \
--usage-class spot --cluster-autoscaler \
--base-instance-type "m5.large" --burst-support=false \
--deny-list '^?[1-3].*\..*' --gpus 0 \
--node-count-max 3 --node-count-min 1 \
--state ${state}
Here is another example for a t3.medium for a more cost efficient auto scalling:
kops toolbox instance-selector "spot-group-base-t3-medium" \
--usage-class spot --cluster-autoscaler \
--base-instance-type "t3.medium" --burst-support=false \
--deny-list '^?[1-3].*\..*' --gpus 0 \
--node-count-max 2 --node-count-min 1 \
--name ${NAME}
These commands will return instance types that match the base instance type, but have different specs. This is helpful when looking for Spot Instances, which are available at up to a 90% discount compared to On-Demand Instances.
Let’s see again our instance groups now:
kops get instancegroups --state s3://eu-north-1-training-dx-book-kops-state
NAME ROLE MACHINETYPE MIN MAX ZONES
spot-group-base-m5-large Node m5.large 1 3 eu-north-1a,eu-north-1b,eu-north-1c
Let’s now validate our cluster and see what’s the current state after the instance groups modifications:
kops validate cluster --wait 10m --state s3://eu-north-1-training-dx-book-kops-state
VALIDATION ERRORS
KIND NAME MESSAGE
InstanceGroup spot-group-base-m5-large InstanceGroup "spot-group-base-m5-large" is missing from the cloud provider
Time to rollout the modifications to the cloud:
kops update cluster --config cluster.yaml --state s3://eu-north-1-training-dx-book-kops-state --yes
kops rolling-update cluster --state s3://eu-north-1-training-dx-book-kops-state --yes
While kOps update the cluster and instantiate new ec2s, we can watch it using aws cli, querying by PublicDnsName, State.Name, LaunchTime:
aws ec2 describe-instances --query 'Reservations[*].Instances[*].[PublicDnsName, State.Name, LaunchTime]' --output text --region eu-north-1
And also watch the nodes:
kubectl get nodes -w
NAME STATUS ROLES AGE VERSION
i-089e8c4a860dcd203 Ready node,spot-worker 4m14s v1.25.11
i-014ec53c7396ae56b Ready node 3m11s v1.25.11
i-04fa865118cfd5758 Ready node 45h v1.25.11
i-077af190f94301180 Ready node 45h v1.25.11
i-06e72faf0b0c2e6c4 Ready control-plane 45h v1.25.11
i-0a0cf6a563bc29a5e Ready control-plane 45h v1.25.11
i-0d4cc13d84c394245 Ready control-plane 45h v1.25.11
Now we have a new instance group spot-worker, ready to auto scale.
Load test
And buckle up, we going to do some load test here now. We’ll create a dummy deployment and then scale it up to test the autoscaling, using the kubectl command.
First, let’s create a simple deployment of busybox
kubectl run load-generator --image=busybox --restart=Never -- /bin/sh -c "while true; do yes > /dev/null; done"
kubectl run load-generator --image=busybox --restart=Always -- /bin/sh -c "while true; do echo 'Hello, Kubernetes!'; done"
kubectl patch deployment load-generator -p '{"spec":{"template":{"spec":{"containers":[{"name":"busybox","resources":{"requests":{"cpu":"500m"}}}]}}}}'
kubectl patch deployment load-generator -p '{"spec":{"template":{"spec":{"containers":[{"name":"load-generator","image":"busybox","command":["/bin/sh","-c","while true; do echo test; done"]}]}}}}'
kubectl create deployment load-generator --image=busybox
kubectl patch deployment load-generator -p '{"spec":{"template":{"spec":{"containers":[{"name":"load-generator","command":["/bin/sh","-c","while true; do echo 'Hello, Kubernetes!'; done"]}]}}}}'
kubectl scale deployment load-generator --replicas=10
kubectl create deployment load-generator --image=polinux/stress
kubectl patch deployment load-generator -p '{"spec":{"template":{"spec":{"containers":[{"name":"load-generator","resources":{"limits":{"cpu":"500m"}},"args":["--cpu","1"]}]}}}}'
kubectl create deployment load-generator --image=polinux/stress
kubectl patch deployment load-generator --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/resources", "value": {"limits":{"cpu":"500m"}}},{"op": "add", "path": "/spec/template/spec/containers/0/command", "value": ["/bin/stress"]},{"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--cpu","1"]}]'
kubectl scale deployment load-generator --replicas=20
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: cpu-stress
spec:
replicas: 10
selector:
matchLabels:
app: cpu-stress
template:
metadata:
labels:
app: cpu-stress
spec:
containers:
- name: cpu-stress
image: polinux/stress
command: ["/bin/sh"]
args: ["-c", "stress --cpu 1", "--vm", '1']
resources:
requests:
memory: "500Mi"
limits:
memory: "500Mi"
EOF
This will create a deployment named nginx running the nginx image. Next, we can scale this deployment to 100 replicas using the kubectl scale command:
kubectl scale deployment nginx --replicas=20
After running these commands, you should have a deployment of 100 nginx pods. Kubernetes will start creating the pods, and this should trigger the autoscaler if the current nodes don’t have enough capacity to run all the pods.
You have now successfully created a Kubernetes cluster using kOps on AWS!
Remember to periodically check your cluster’s health by using the kops validate cluster command. If you need to change the cluster, use the kops edit command. If you need to delete the cluster, use the kops delete cluster –name ${NAME} command.
Please note that costs will accrue for as long as the cluster is running. Be sure to clean up and delete any resources you no longer need to avoid unexpected charges.
Activating the Metrics Server
The Metrics Server is an important tool for managing and scaling workloads in a Kubernetes cluster. Here are several reasons why it’s commonly used:
Resource Metrics Pipeline: Metrics Server is a critical part of the resource metrics pipeline in Kubernetes, which is the primary avenue through which CPU and memory usage data (for nodes and pods) gets exposed to the Kubernetes scheduler and other system components.
Auto-Scaling: The Metrics Server is a prerequisite for the Kubernetes Horizontal Pod Autoscaler (HPA) and the Vertical Pod Autoscaler (VPA), which automatically scales the number of pods or resources based on observed metrics. Without Metrics Server, these autoscaling features will not work.
Node Resource Management: Metrics Server helps the Kubernetes scheduler in making better decisions while scheduling pods. It provides current resource usage metrics of nodes and pods, enabling the scheduler to avoid placing pods on nodes that are running out of resources.
Visibility and Monitoring: While Metrics Server itself doesn’t store metrics data, it’s used as an in-cluster API for fetching the latest relevant metrics, which can then be displayed in CLI tools like kubectl top or GUI dashboards.
Cluster Health: It provides necessary data to observe and ensure the health of applications running on the cluster and the cluster itself.
Overall, the Metrics Server provides valuable insights into how workloads are performing in a cluster, and it’s fundamental to the effective scaling and management of applications in Kubernetes.
You can enable the Metrics Server by editing your cluster configuration.
state=s3://eu-north-1-training-dx-book-kops-state
kops get cluster --state $state -o yaml > cluster.yaml
yq e '.spec.metricsServer.enabled = true' -i cluster.yaml
yq e '.spec.metricsServer.insecure = true' -i cluster.yaml
yq e '.spec.metricsServer.insecure = true' -i cluster.yaml
kops replace -f cluster.yaml --state $state
kops update cluster --config cluster.yaml --state $state --yes
kops rolling-update cluster --state $state --yes
Next Steps
Now that you have a Kubernetes cluster running, you can install workloads, explore other AWS services, or set up a CI/CD pipeline.
Please see the official kOps documentation for more information and usage examples.
Node Termination Handler
yq e '.spec.nodeTerminationHandler.cpuRequest = "200m"' -i cluster.yaml
yq e '.spec.nodeTerminationHandler.enabled = true' -i cluster.yaml
yq e '.spec.nodeTerminationHandler.enableRebalanceMonitoring = true' -i cluster.yaml
yq e '.spec.nodeTerminationHandler.enableSQSTerminationDraining = true' -i cluster.yaml
yq e '.spec.nodeTerminationHandler.managedASGTag = "aws-node-termination-handler/managed"' -i cluster.yaml
yq e '.spec.nodeTerminationHandler.prometheusEnable = true' -i cluster.yaml
Enabling Karpenter
export KOPS_FEATURE_FLAGS=“Karpenter”
state=s3://eu-north-1-training-dx-book-kops-state
kops get cluster --state $state -o yaml > cluster.yaml
yq e '.spec.karpenter.enabled = true' -i cluster.yaml
kops update cluster --config cluster.yaml --state $state --yes
kops rolling-update cluster --state $state --yes