Ampere Computing Logo
Ampere Computing Logo

Deploying Java applications on Arm64 with Kubernetes

In the first part of this two-part series on tuning Java applications for Ampere®- powered cloud instances, we concentrated on tuning your Java environment for cloud applications, including picking the right Java version, tuning your default heap and garbage collector, and some options that enable your application to take advantage of underlying Arm64 features. In this article, we will look more closely at the operating system and Kubernetes configuration. In particular, we take a deep dive into container awareness in recent versions of Java, how to restrict the system resources made available to Java containers, and some common Linux configuration options to optimize your system for specific workloads. Much of the advice related to operating system tuning and workload placement applies to all workloads, not just JVM workloads, but since our focus is on the deployment of Java applications on Arm64 to Kubernetes, we will focus on that use-case here.

Resource Allocation in Kubernetes

In this section, we’ll step outside the JVM and look at the infrastructure layer. Understanding how Kubernetes allocates resources, and how your Java application perceives those allocations, is fundamental to ensuring that you allocate the right amount of resources to your JVM.

In Kubernetes, container templates can include resource “requests” for initial scheduling, and “limits” that are enforced by Kubernetes for CPU and memory. Since Java 11, the JVM is container aware by default, meaning that resource limits imposed on containers by the container orchestration are reflected in what the JVM sees as available resources. However, there are some nuances that can lead to suboptimal resource usage if you are not careful.

Let’s look at an example. We will assume that we are running applications in a Kubernetes environment where the compute nodes for the cluster are 16 core VMs with 64GB of memory. When we allocate 2000 millicores and 4GB of memory to our container, the JVM process running in that container will report that it has two available processors, and 4GB of available memory. However, the Kubernetes scheduler may be giving our container time slices of CPU across all 16 cores in the VM (see the CPU pinning section below for more details). If your container tries to use more than its resource allotment, the scheduler throttles the workload - the kernel on the compute node pauses processes that are running in the container context, regardless of whether there are idle CPU cores available. In this scenario, your Java application may spend as little as 12.5% of the time running (2/16ths of the total CPU for the compute node), and 87.5% waiting for CPU time. In addition, if the process is throttled during a garbage collection run, this may completely block your Java process across multiple scheduling windows, and result in very high latency for your application threads.

A recent study by Akamas and Microsoft found that over 50% of all Java applications running on Kubernetes have a resource limit of less than 2 CPUs and a heap allocation of less than 1 gigabyte. The same report compared a reference Java application running in two pods, each with 3 CPUs allocated, to the same application running across six pods with a resource limit of only 1 CPU. They found that this can result in a massive increase in tail latency and a decrease in throughput. From a practical perspective, this means that you should almost always be choosing a multi-threaded garbage collector such as G1GC for your cloud workloads, and allocating multiple cores to it, rather than running multiple copies of your application in smaller containers with a quota of 1 CPU or less.

With older versions of Java, a JVM running in a container would report that it had all the CPU cores and memory available on the host when making decisions about resource allocations like garbage collector threads or heap sizes, regardless of how many other applications were running on the host. This could result in unanticipated application throttling or out of memory errors. We recommend running a more recent version of Java to take advantage of all the new features that have been added in the past 10 years, in addition to the many performance enhancements for Arm64 servers. If you are running an older JVM, you can turn on early container awareness in versions of Java before Java 11 with –XX:+UseContainerSupport – this enables CGroup awareness, but the functionality is much more mature in more recent versions of Java.

You can see what resources the JVM believes it has available to it by running:

java -XX:+PrintFlagsFinal –showversion | grep –E 'RAM|HeapSize|Container|Processor’

This command prints the type, name, and value of the relevant configuration options calculated for the JVM, and will indicate if they have been detected, set as defaults by JVM ergonomics, or set explicitly by the user on the command line. In particular, the value of ActiveProcessorCount contains how many CPU cores the JVM believes it has available to it, which dictates both the default choice of garbage collector, and how many threads will be used for it. UseContainerSupport will tell you whether your JVM is container aware. And MaxRAM, MaxHeapSize, InitialHeapSize, InitialRAMPercentage, and MinRAMPercentage all reflect the detected amount of RAM available, and how much will be allocated to the Heap in the JVM.

For predictable performance, we recommend setting requests equal to limits for both CPU and memory and explicitly setting the number of CPU cores used by your JVM with the -XX:ActiveProcessorCount=<n> option to the JVM. We also recommend using the MaxRAMPercentage and InitialRAMPercentage flags, which, unlike the -Xms and -Xmx configuration options, are container aware, to set the default and max heap size for your application to 80-85% of the available RAM.

Pulling this all together, the Kubernetes yaml for your application workload might look like this:

apiVersion: apps/v1 kind: Deployment spec: # The 'template' field specifies which containers to run in this # deployment, and defines workload placement and quota options for # containers of this type. template: spec: containers: - name: app image: your-java-app:latest env: - name: JAVA_OPTS # Use '|' or '>' for multiline strings in YAML value: > -XX:+UseG1GC -XX:+UseContainerSupport -XX:MaxRAMPercentage=80.0 -XX:InitialRAMPercentage=80.0 -XX:ActiveProcessorCount=2 -XX:ParallelGCThreads=2 resources: requests: cpu: "2000m" memory: "4Gi" limits: cpu: "2000m" memory: "4Gi"

Workload Placement in Kubernetes

Once resource allocation is under control, the next lever to consider is where your workload runs in the cluster. Kubernetes gives you several mechanisms to influence workload placement without hard-coding node identities, allowing you to express capabilities rather than locations.

At a high level, workload placement is about matching application requirements with node characteristics. For Java applications, this often means paying attention to CPU architecture, memory configuration, kernel settings, or other host-level properties that materially affect runtime behavior.

Node Labels and Affinity

Kubernetes allows nodes to be annotated with arbitrary key–value labels. These labels can then be referenced by pods using nodeSelector or the more expressive nodeAffinity rules.

For example, nodes might be labeled to reflect:

  • CPU architecture or microarchitecture
  • NUMA topology
  • Kernel configuration
  • Kernel page size (4K vs 64K)
  • Performance profile applied using tuned to the host

A simple example label might look like:

kubectl label nodes node1 disktype=ssd

Pods can then express a preference or requirement for nodes with that capability:

affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: disktype operator: In values: - "ssd"

This allows Kubernetes to place workloads on nodes that are known to be configured in a way that benefits them, without embedding node names or relying on manual scheduling.

Page Size as a Placement Signal

One concrete example where node-level configuration can matter is kernel page size. On Arm64 systems, for base kernel page size, Linux supports page sizes of 4K (the default page size), 16K, and 64K. On x86 systems, the base kernel page size is fixed at 4K.

Larger page sizes can reduce TLB pressure and improve performance for applications with large heaps or memory footprints — including many JVM-based workloads. When adding new compute nodes to your Kubernetes cluster, it is possible to provision nodes with different kernel page sizes and label them accordingly, allowing you to steer suitable workloads to those nodes.

However, this is an area where managed Kubernetes services impose real constraints. On hosted Kubernetes platforms such as OKE, the changes that you can make to kernel configuration on managed node pools are limited. In those environments, workload placement must focus on higher-level characteristics such as instance type, CPU count, or memory capacity.

Required vs Preferred Placement

Workload placement in Kubernetes can happen both during initial scheduling and during the regular operation of an application – for example, when a compute node is put into maintenance mode or experiences CPU or memory pressure, workloads can be moved from one node to another. Kubernetes uses node affinity rules to determine on which nodes workloads may be scheduled during these events.

Kubernetes lets you choose between:

  • Hard requirements (requiredDuringSchedulingIgnoredDuringExecution)
  • Soft preferences (preferredDuringSchedulingIgnoredDuringExecution)

For characteristics that are essential to correct behavior, hard requirements make sense. For performance optimizations—such as preferring nodes with larger page sizes or specific tuning profiles, it’s often better to express a preference, allowing the scheduler to fall back gracefully if ideal nodes are temporarily unavailable.

This distinction is particularly important for Java services in elastic environments, where strict placement rules can reduce scheduling flexibility and impact availability. From a practical standpoint:

  • Use node labels to describe capabilities, not roles
  • Prefer affinity rules over hard-coded node names
  • Treat advanced host characteristics (like page size) as opt-in optimizations
  • Accept that managed Kubernetes services limit how far down the stack you can tune

Workload placement won’t magically fix an under-provisioned or poorly tuned JVM, but when combined with explicit resource allocation, it ensures your application is running on infrastructure that can actually deliver the performance you expect.

Dedicating Resources to Containers: CPU Core Pinning

Up to this point, we’ve focused on where a workload runs in the cluster and how much CPU and memory it is allowed to consume. The next layer down is how that CPU time is delivered, and whether execution is free to move across all the cores on a compute node or is restricted to a fixed set of physical CPUs.

By default, Kubernetes enforces CPU limits using time-based quotas, not by assigning specific cores. CPU pinning changes this model by trading scheduling flexibility for stronger isolation and predictability.

In the default Kubernetes configuration, CPU limits are enforced using Linux cgroups and the Completely Fair Scheduler (CFS). When a container is limited to, for example, two CPUs, it is allowed to consume two cores’ worth of CPU time, but its threads may be scheduled on any available core on the node.

As we said in the section on JVM configuration and quotas, this means that threads can migrate between cores during execution. As a result, cache locality is not guaranteed; performance can be impacted by expensive context switches, and CPU time may be throttled even when idle cores exist.

CPU pinning replaces this temporal isolation model with spatial isolation. Instead of allowing a container to run anywhere for a limited amount of time, Kubernetes assigns it exclusive access to a fixed set of physical cores.

How CPU Pinning Works in Kubernetes

CPU pinning is enabled through the combination of:

  • Guaranteed quality of service
  • Static CPU Manager policy

The CPU Manager capability allows administrators to support the exclusive reservation of CPU cores for a container. To be eligible for CPU pinning, a pod must request and limit an integer number of CPUs, and these amounts must be identical. Fractional CPU requests (for example, 1500 millicores) are not eligible for pinning. In the resource requests for a pod (or ReplicaSet/Deployment), the CPU request and limit must match. For example:

resources: requests: cpu: "2" memory: "4Gi" limits: cpu: "2" memory: "4Gi"

On the node, the kubelet must be configured with:

--cpu-manager-policy=static

With this policy enabled, Kubernetes assigns exclusive cores to eligible pods and removes those cores from the shared scheduling pool. These cores are then exposed to the container via a CPU set.

Inside the container, this is visible as:

cat /sys/fs/cgroup/cpuset.cpus

The container and all processes inside it may execute only on those cores.

Interaction with CPU Quotas

Pinned containers still respect CPU limits, but the enforcement model changes. For pinned workloads, CPU throttling largely disappears. The container can use 100% of its assigned cores, and CPU usage is bounded by which cores it owns, not by time slices. This distinction matters for performance-sensitive workloads. When CPU time is delivered consistently on the same cores, applications benefit from:

  • Improved cache locality
  • Reduced scheduler migration and context switches
  • More stable latency under load

For JVM-based workloads, this also means that the JVM’s view of available processors aligns closely with the physical execution model, making GC and JIT heuristics more predictable.

CPU pinning is not a universal recommendation. It makes the most sense when workloads are latency-sensitive, and CPU usage is sustained and predictable – otherwise you are reserving resources that could be used by other tasks. For highly elastic services that scale up or down based on demand, CPU pinning can reduce scheduling flexibility and make capacity management harder.

Compute Node Tuning with Tuned

So far, we’ve focused on how Kubernetes allocates resources and places workloads. The final lever to consider lives one layer below the scheduler: how the operating system on the compute node itself is configured. Even with identical hardware and identical container resource limits, differences in host tuning can materially affect latency, throughput, and predictability.

On Linux, the easiest way to manage host-level performance tuning is the tuned service, using the tuned-adm tool. A set of predefined profiles for tuned are provided out of the box that adjust kernel and system settings—such as CPU frequency scaling, scheduler behavior, power management, and I/O parameters—to favor different workload characteristics. Some of the profiles that are well suited to Kubernetes compute nodes are:

  • Throughput-performance: Optimizes for maximum sustained throughput. This profile typically disables aggressive power-saving features, keeps CPUs at higher frequencies, and favors throughput-oriented scheduler behavior.
  • Latency-performance: Prioritizes reduced scheduling latency and jitter. This profile often disables CPU power management features and may adjust scheduler parameters to reduce wake-up latency.
  • Balanced: This profile is usually the default on Linux distributions out of the box. This profile is a compromise between power efficiency and performance, and offers the best experience on desktops and laptops. This is often the default profile on cloud images but is not the best match for performance-sensitive workloads.

For most production Kubernetes clusters running performance-sensitive Java services, throughput-performance is a reasonable default. For latency-sensitive workloads, such as RPC-heavy services or event-driven systems, latency-performance may be a better fit, particularly when combined with CPU pinning and explicit resource allocation.

The key point is consistency: all nodes intended to run a given class of workload should use the same tuned profile, and be grouped into a single node pool, with a shared tag for workload placement. Mixing profiles within a node pool can lead to difficult-to-diagnose performance variability.

Applying Tuned Profiles to Kubernetes Nodes

In self-managed Kubernetes environments, tuned is typically configured directly on the node operating system using automation tooling. A typical workflow looks like this:

  • Infrastructure provisioning: Nodes are created as virtual machines or bare metal instances using an infrastructure-as-code tool such as Terraform, choosing the instance type, disk layout, and base operating system image
  • Operating system configuration: As part of node initialization, tools like cloud-init, Ansible, or Puppet are used to configure the operating system. At this stage, you can install and activate the tuned service and choose a tuned profile using tuned-adm profile <profile-name>. Note that when using tuned inside a virtual machine or cloud instance, settings apply to the virtualized CPU cores – some settings, such as the CPU governor settings for the CPU core, are controlled at the host operating system level.
  • Attach the compute node to Kubernetes: Once the OS is tuned, and the kubelet is installed and started, the node joins the cluster and can be labelled immediately with the tuned profile that has been applied.

Once a tuned profile is applied, it affects all workloads running on that node. Kubernetes itself is unaware of the tuning change, which makes tagging critical when adding new nodes. New compute nodes can be tagged with:

kubectl label nodes node1 tuned-profile=throughput-performance

or whichever tuned profile was applied.

Workloads can then express placement preferences using node affinity, with hard requirements for workloads that depend on a specific tuning, or with soft requirements when a certain profile is preferable but not required.

On managed Kubernetes services, your ability to control tuned varies. Some platforms expose limited tuning options via node pools or instance templates; others lock down host configuration entirely. In those environments, tuned-based tuning may not be possible, and workload placement must rely on higher-level characteristics such as instance type or CPU generation.

A warning: tuned will not compensate for undersized resource requests or poorly chosen JVM settings, and it will not turn a general-purpose Kubernetes node into a real-time system. However, when used deliberately, it provides a clean and supportable way to align host behavior with workload intent.

Bringing it all Together

Across all these sections, the common theme is enabling operators to make their intent explicit. By clearly expressing resource requirements, placement constraints, and host-level tuning, you reduce ambiguity in how Kubernetes and the operating system make scheduling decisions on your behalf.

While this tutorial does not include performance benchmarks, experience shows that thoughtful use of these hardware and operating system “levers” can unlock meaningful improvements in efficiency, throughput, and predictability—often without changing a single line of application code.

To learn more about our developer efforts and find best practices, visit Ampere’s Developer Center and join the conversation in the Ampere Developer Community. 

Created At : March 9th 2026, 5:50:07 pm
Last Updated At : March 18th 2026, 4:58:21 pm
Ampere Logo

Ampere Computing LLC

4655 Great America Parkway Suite 601

Santa Clara, CA 95054

image
image
image
image
image
 |  |  | 
© 2025 Ampere Computing LLC. All rights reserved. Ampere, Altra and the A and Ampere logos are registered trademarks or trademarks of Ampere Computing.
This site runs on Ampere Processors.