Prerequisites

If you’re using our workspace, we can ensure that you have the following prerequisites on your local machine:

  • AWS CLI, kOps, yq, and more

Additionally, we already have configured your AWS credentials during the workspace preparation.

Creating the Cluster and Instance Groups

Whenever the workspace starts, we set some variables for convenience. The updated list of zones from your AWS_DEFAULT_REGION, and the AWS_ACCOUNT, which we get from the postStartCommand of the .devcontainer.json. To make sure the variables are available, please run:

source ~/.zshrc

With the environment set up, now let’s create a cluster definition using the kOps CLI.

export KOPS_FEATURE_FLAGS="Karpenter"
kops create cluster \
  --name=${CLUSTER_NAME} \
  --state=${KOPS_STATE_STORE} \
  --cloud=aws \
  --control-plane-size=t3.large \
  --control-plane-count=3 \
  --control-plane-zones=${AWS_REGION_AZS} \
  --zones=${AWS_REGION_AZS} \
  --node-size=t3.large \
  --instance-manager=karpenter \
  --discovery-store=s3://${KOPS_OIDC_STORE}/${CLUSTER_NAME}/discovery \
  --node-count=4 \
  --dry-run \
-oyaml > cluster.yaml

This cluster creation was only the dry-run. Let’s configure some add-ons before it’s for real.

Add-ons

The following addons are managed by kOps and will be upgraded following the kOps and kubernetes lifecycle, and configured based on your cluster spec. kOps will consider both the configuration of the addon itself as well as what other settings you may have configured where applicable. Here is some of them we’ll configure:

Karpenter

Karpenter is a Kubernetes-native capacity manager that directly provisions Nodes and underlying instances based on Pod requirements. We can activate by setting the params below in the cluster.yaml

spec:
  karpenter:
    enabled: true

Or using the --instance-manager=karpenter like we did in the kops create cluster command before. Also the we can change the desired instance type to be autoscaled. In this case, it’s t3.xlarge and t3.2xlarge:

yq eval -i 'select(.metadata.name == "nodes") .spec.machineType = "t3.xlarge"' cluster.yaml
yq eval -i 'select(.metadata.name == "nodes") .spec.mixedInstancesPolicy.instances = ["t3.xlarge", "t3.2xlarge", "m5.xlarge", "c5.xlarge"]' cluster.yaml
Cert-manager

The following cert-manager configuration allows provisioning cert-manager externally and allows all dependent plugins to be deployed. Please note that addons might run into errors until cert-manager is deployed.

spec:
  certManager:
    enabled: true
    defaultIssuer: letsencrypt-production

Here is the snippet for updating the cluster.yaml

yq e '.spec.certManager.enabled = true' -i cluster.yaml
yq e '.spec.certManager.defaultIssuer = "letsencrypt-production"' -i cluster.yaml
Metrics Server

The Metrics Server is an important tool for managing and scaling workloads in a Kubernetes cluster. Here are several reasons why it’s commonly used:

  • Resource Metrics Pipeline: Metrics Server is a critical part of the resource metrics pipeline in Kubernetes, which is the primary avenue through which CPU and memory usage data (for nodes and pods) gets exposed to the Kubernetes scheduler and other system components.

  • Auto-Scaling: The Metrics Server is a prerequisite for the Kubernetes Horizontal Pod Autoscaler (HPA) and the Vertical Pod Autoscaler (VPA), which automatically scales the number of pods or resources based on observed metrics. Without Metrics Server, these autoscaling features will not work.

  • Node Resource Management: Metrics Server helps the Kubernetes scheduler in making better decisions while scheduling pods. It provides current resource usage metrics of nodes and pods, enabling the scheduler to avoid placing pods on nodes that are running out of resources.

  • Visibility and Monitoring: While Metrics Server itself doesn’t store metrics data, it’s used as an in-cluster API for fetching the latest relevant metrics, which can then be displayed in CLI tools like kubectl top or GUI dashboards.

  • Cluster Health: It provides necessary data to observe and ensure the health of applications running on the cluster and the cluster itself.

Overall, the Metrics Server provides valuable insights into how workloads are performing in a cluster, and it’s fundamental to the effective scaling and management of applications in Kubernetes.

You can enable the Metrics Server by editing your cluster configuration.

yq e '.spec.metricsServer.enabled = true' -i cluster.yaml
yq e '.spec.metricsServer.insecure = false' -i cluster.yaml
Node Local DNS Cache

It is used to improve the Cluster DNS performance by running a dns caching agent on cluster nodes as a DaemonSet.

yq e -i '.spec.kubeDNS.provider = "CoreDNS" | .spec.kubeDNS.nodeLocalDNS.enabled = true | .spec.kubeDNS.nodeLocalDNS.memoryRequest = "5Mi" | .spec.kubeDNS.nodeLocalDNS.cpuRequest = "25m"' cluster.yaml
Node Termination Handler

Node Termination Handler ensures that the Kubernetes control plane responds appropriately to events that can cause your EC2 instance to become unavailable, such as EC2 maintenance events, EC2 Spot interruptions, and EC2 instance rebalance recommendations. If not handled, your application code may not stop gracefully, take longer to recover full availability, or accidentally schedule work to nodes that are going down. Here is what a default config looks like:

spec:
  nodeTerminationHandler:
    cpuRequest: 200m
    enabled: true
    enableRebalanceMonitoring: false
    enableSQSTerminationDraining: true
    managedASGTag: "aws-node-termination-handler/managed"
    prometheusEnable: true

Let’s enable it by editing into the cluster.yaml:

yq e -i '.spec.nodeTerminationHandler.cpuRequest = "200m" | .spec.nodeTerminationHandler.enabled = true | .spec.nodeTerminationHandler.enableRebalanceMonitoring = false | .spec.nodeTerminationHandler.enableSQSTerminationDraining = true | .spec.nodeTerminationHandler.managedASGTag = "aws-node-termination-handler/managed" | .spec.nodeTerminationHandler.prometheusEnable = true' cluster.yaml
Node Problem Detector

Node Problem Detector aims to make various node problems visible to the upstream layers in the cluster management stack. It is a daemon that runs on each node, detects node problems and reports them to apiserver.

spec:
  nodeProblemDetector:
    enabled: true
    memoryRequest: 32Mi
    cpuRequest: 10m
yq e -i '.spec.nodeProblemDetector.enabled = true | .spec.nodeProblemDetector.memoryRequest = "32Mi" | .spec.nodeProblemDetector.cpuRequest = "10m"' cluster.yaml

Before we deploy the whole cluster with all of these add-ons. Let’s look first into configuring authentication, in the next section.

Cronjobs
kubeAPIServer:
  runtimeConfig:
    batch/v2alpha1: true
yq e '.spec.kubeAPIServer.runtimeConfig.batch/v2alpha1 = "true"' -i cluster.yaml