← Blog

Karpenter: Rethinking Node Autoscaling on Kubernetes

If you run Kubernetes for any length of time, you eventually hit the same problem: pods are pending, there's nowhere to put them, and you need more nodes. The two most common open-source answers to that problem are the Cluster Autoscaler and Karpenter. Both are open-source projects, both solve node-level autoscaling, and both are widely deployed on managed clusters like EKS. The difference is in how they get nodes onto your cluster, and that difference turns out to matter a great deal for cost, speed, and operational simplicity.

How the Cluster Autoscaler works

The Cluster Autoscaler doesn't provision instances itself. It works through Auto Scaling Groups (ASGs), which on EKS are wired up to node groups. When pods can't be scheduled, the Cluster Autoscaler increases the desired count of the appropriate ASG, and the ASG launches another instance.

The catch is in the shape of a node group. A node group is typically pinned to a single instance type, so the ASG behind it can only ever launch more of that one type. When the Cluster Autoscaler scales a node group up, it has no choice in what it gets — it gets another copy of whatever that group was configured with. If your pending pod needs 2 vCPUs and the node group is built on a large general-purpose instance, you'll get the large instance regardless. That mismatch between what the workload needs and what the node group offers is where resources (and money) get wasted. The usual workaround is to maintain many node groups for different instance shapes, which pushes the complexity back onto you.

What Karpenter does differently

Karpenter throws out the intermediary entirely. There are no node groups and no Auto Scaling Groups. Instead, Karpenter watches for unschedulable pods, reads their requirements straight from the pod spec, and provisions an instance that actually fits — directly through the cloud provider's API. Because it's not constrained by a pre-baked set of node groups and isn't waiting on an ASG to reconcile a desired count, it can pick the right instance and bring it online faster than the Cluster Autoscaler.

In short: the Cluster Autoscaler answers "how many of this fixed instance type do I need?" while Karpenter answers "what instance, out of everything available, best fits the pods that are pending right now?"

The two core CRDs: NodePool and NodeClass

Karpenter introduces two main custom resources that replace the node-group model:

  • NodePool is where you describe the kind of compute you're willing to run: which instance types and families, which availability zones, which CPU architectures, and how large the pool is allowed to grow. It defines the constraints; Karpenter picks the best instance that satisfies them.
  • NodeClass is where the cloud-specific, lower-level configuration lives: networking (subnets and security groups), the AMI / image family the nodes boot from, the instance profile, and similar provider details.

A NodePool references a NodeClass, so the two together give Karpenter everything it needs to launch a node that is both correctly shaped and correctly wired into your environment.

Karpenter does more than scale

It's tempting to think of Karpenter as a faster autoscaler, but the design unlocks a few things node groups never did well:

  • Cost optimization. Because Karpenter evaluates all available instance types rather than a fixed list, it can automatically choose the right-sized, most cost-efficient instance for the workload at hand — including Spot capacity where appropriate.
  • Diverse workloads. The same mechanism that picks a cheap general-purpose instance can pick GPU-backed instances for ML and generative-AI workloads, so specialized compute fits into the same model rather than requiring bespoke node groups.
  • Upgrades and patching. Karpenter participates in node lifecycle management, which makes rolling out new AMIs, patches, and version upgrades less of a manual node-group dance.

Defining a NodePool

The NodePool spec is where you set the requirements that any node Karpenter schedules must satisfy — sizes, families, generations, CPU architectures, zones, and so on. Karpenter then chooses the best instance type from everything AWS offers, prioritizing cost. You can also cap how much compute the pool is allowed to provision via limits, which is a useful guardrail against runaway scaling.

The example below uses operator: NotIn to exclude the categories and sizes you don't want, which leaves Karpenter free to pick from everything else. Note the nodeClassRef at the bottom of the template — that's the link to the NodeClass we define next.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values:
            - amd64
            - arm64
        - key: karpenter.k8s.aws/instance-category
          operator: NotIn
          values:
            - c
            - m
            - r
            - t
        - key: karpenter.k8s.aws/instance-size
          operator: NotIn
          values:
            - nano
            - micro
            - small
            - medium
        - key: topology.kubernetes.io/zone
          operator: In
          values:
            - eu-west-1a
            - eu-west-1b
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  limits:
    cpu: "100"

Reading this top to bottom: nodes may be either amd64 or arm64; the c, m, r, and t instance categories are off the table; anything medium or smaller is excluded; nodes are restricted to two zones in eu-west-1; the nodeClassRef points at the EC2NodeClass named default; and the pool will never provision more than 100 vCPUs in total.

Defining the EC2NodeClass

The NodePool decides what shape of instance to launch. The NodeClass decides how that instance is configured and wired into AWS — the details Karpenter needs to actually boot a working node in your VPC. On AWS this resource is the EC2NodeClass:

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  role: KarpenterNodeRole-my-cluster
  amiSelectorTerms:
    - alias: al2023@latest
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

Each field answers a "how do I boot this node correctly" question:

  • role is the IAM role the nodes assume once they join the cluster. It grants the instance the permissions it needs (pulling images, talking to the cluster, etc.) and is what the corresponding node instance profile is built from. Alternatively you can supply an existing instanceProfile directly.
  • amiSelectorTerms chooses the operating system image the nodes boot from. Using an alias like al2023@latest pins the image family — here Amazon Linux 2023 — and tells Karpenter to track the latest published AMI for it. You can also select AMIs by ID, name, or tags when you need to lock to a specific build.
  • subnetSelectorTerms tells Karpenter which subnets it may place nodes in, and by extension which availability zones and networking layout the nodes land in. Selecting by tag (rather than hard-coding subnet IDs) means new subnets are discovered automatically as long as they carry the expected tag.
  • securityGroupSelectorTerms attaches the right security groups to each node, again discovered by tag. This is what governs the network traffic allowed to and from the nodes.

The karpenter.sh/discovery tag used in the selectors is a common convention: you tag the subnets and security groups you want Karpenter to use with that key, and the selectors match on it. This keeps the NodeClass declarative and lets your infrastructure evolve without editing the manifest.

The division of labour is worth restating, because it's the whole point of having two resources. The NodePool is the provisioning policy — the constraints and limits that decide which instance to pick — and is the resource application teams tend to interact with. The NodeClass is the infrastructure binding — the AMI, IAM, subnets, and security groups that are largely a property of the cluster and its VPC. One NodeClass is commonly shared by many NodePools, since the networking and IAM story is usually the same across pools even when their instance requirements differ.

It still plays by Kubernetes' scheduling rules

Karpenter doesn't replace the Kubernetes scheduler or invent its own placement logic — it works alongside the scheduling mechanisms you already use. Node selectors, node affinity, taints and tolerations, pod disruption budgets, and pod topology spread constraints are all respected. That means the constraints you've already expressed on your workloads continue to govern where pods land; Karpenter simply makes sure there's an appropriate node available for them to land on.

Consolidation: scaling back down

Provisioning nodes is only half the job. With consolidation enabled, Karpenter continuously looks for opportunities to reduce node count or replace nodes to achieve better bin-packing — packing pods onto fewer, better-fitting instances and removing what's no longer needed. There are a few policies that control how aggressive it is:

  • WhenEmpty — the conservative option. Karpenter only removes nodes that have no workload pods left running on them. It will never disrupt a running pod to consolidate, so the trade-off is minimal churn but fewer savings: a node that's only 10% utilized will sit there as long as something is on it.
  • WhenEmptyOrUnderutilized — the cost-optimizing option. In addition to reaping empty nodes, Karpenter will actively reschedule pods off underutilized nodes onto cheaper or more tightly packed alternatives, then remove or replace the now-redundant nodes. This recovers the savings that WhenEmpty leaves on the table, at the cost of more pod movement, so it pairs best with sensible pod disruption budgets.
  • ConsolidateAfter — a duration rather than a mode. It tells Karpenter to wait a set amount of time after the last pod is added to or removed from a node before considering it for consolidation. This is your anti-churn dial: a longer window keeps Karpenter from constantly reshuffling nodes in response to short-lived spikes and lulls, trading a little efficiency for a lot of stability.

In practice, most teams enable WhenEmptyOrUnderutilized for the cost wins and then tune ConsolidateAfter until the amount of node churn feels acceptable for their workloads.

Wrapping up

The Cluster Autoscaler and Karpenter aim at the same target, but Karpenter gets there by removing the node-group and ASG layer rather than managing it. The result is faster, right-sized, cost-aware node provisioning driven directly by what your pods actually need — plus consolidation to claw back the capacity you stop using. For teams already wrestling with a sprawl of node groups, that shift in model is usually the whole reason to make the switch.

← Back to all posts