Optimizing Java Applications for Arm64 in the Cloud
Java remains one of the most popular languages for enterprise applications running on the cloud. While languages like Go, Rust, Javascript, and Python have a high profile for cloud application developers, the RedMonk language rankings have ranked Java in the top three most popular languages throughout the history of the ranking.
When deploying applications to the cloud, there are a few key differences between deployment environments and development environments. Whether you’re spinning up a microservice application on Kubernetes or launching virtual machine instances, it is important to tune your Java Virtual Machine (JVM) to ensure that you are getting your money’s worth from your cloud spend. It pays to know how the JVM allocates resources, and to ensure that you are using the resources efficiently.
Most of the information and advice in this series is platform independent and will work just as well on x86_64 and Arm64 CPUs. As Java was designed as a platform-independent language, this is not surprising. As the Java community has spent effort optimizing the JVM for Arm64 (which you will also see called aarch64, for “64-bit Arm architecture”), Java developers should see the performance of their applications improve on that architecture without doing anything special. However, we will point out some areas where the Arm64 and x86_64 architectures differ, and how to take advantage of those differences for your applications. Additionally, we will generally only refer to long-term supported versions of Java tooling. For example, G1GC was introduced as the default garbage collector in the Java 9 development cycle but was not available in a long-term supported JDK (Java Development Kit) until Java 11. As most enterprise Java developers are using LTS versions of the JDK, we will limit version references to those (at the time of writing those are Java 8, 11, 17, 21, and 25).
In this two-part series on tuning Java applications for the cloud, we come at the problem from two different perspectives. In part 1 (this article), we will concentrate on how the JVM allocates resources and identify some of the options and operating system configurations that can result in higher performance on Ampere powered instances in the cloud or dedicated bare metal hardware. In part 2, we will look more closely at the infrastructure side, and Kubernetes and Linux kernel configuration, in particular. We will walk through some of the architectural differences between the Arm64 architecture and x86, and how to ensure that your Kubernetes, operating system, and JVM are all tuned to maximize the bang per buck that you are getting from your Java application.
When running Java applications in the cloud, tuning the JVM is not necessarily one of the things that is front of mind for deployment teams, but getting things wrong, or running with just default options, can impact the performance and cost for your cloud applications.
In this article, we will walk through some of the more helpful tunable elements in the JVM, covering:
Arm64 support was first introduced into the Java ecosystem with Java 8, and it has been steadily improving since then. If you are still using Java 8, your Java applications can run up to 30% slower than if you are using a more recent version of Java like Java 21 or the recently released Java 25. The reason is two-fold:
It is worth noting that it is possible to develop applications with the Java 8 language syntax, while taking advantage of the performance improvements of a more recent JVM, using Oracle’s Java SE Enterprise Performance Pack. This is (simplifying slightly) a distribution of tools that compiles Java 8 applications to run on a JVM from the Java 17 JDK. That said, the language has included a great many improvements over the past 10 years, and we recommend updating your Java applications to run on a more recent distribution of Java.
What is the Difference between Cloud Instances and Developer Desktops?
The JVM’s default ergonomics were designed with the assumption that your Java application is just one of many processes running on a shared host. On a developer laptop or a multi-tenant server, the JVM intentionally plays nice, limiting itself to a relatively small percentage of system memory and leaving headroom for everything else. That works fine on a workstation where the JVM is competing with your IDE, your browser, and background services, but in cloud environments your Java application will typically be the only application you care about in that VM or Docker (more generally OCI) container instance.
By default, if you don’t explicitly set initial and max heap size, the JVM uses a tiered formula to size the heap based on “available memory”. You can see what the heap size is by default for your cloud instances using Java logging:
java -Xlog:gc+heap=debug [0.005s][debug][gc,heap] Minimum heap 8388608 Initial heap 524288000 Maximum heap 8342470656
The defaults for heap sizing, based on system RAM available, are:
So, for a VM with 512 MB RAM, the JVM will still only allow 192 MB for the heap. On a laptop with 16 GB RAM, the default cap is ~4 GB. On a container with a 2 GB memory limit, the heap defaults to ~512 MB.
That’s a perfectly reasonable choice if your JVM is sharing a machine with dozens of other processes. But in the cloud, when you spin up a dedicated VM or a container instance, the JVM is often the only significant process running. Instead of trying to be a good neighbor and leave resources for other applications, you want it to use the majority of the resources you’ve provisioned — otherwise, you’re paying for idle memory and under-utilized CPU.
This shift has two key implications:
| Scenario | Default Ergonomics (no flags) | Recommended for Cloud Workloads |
|---|---|---|
| Initial heap (-Xms or -XX:InitialRAMPercentage) | ~1/64th of memory (capped at 1 GB) | Match initial heap close to max heap (stable long-lived services): -XX:InitialRAMPercentage=80 |
| Max heap (-Xmx or -XX:MaxRAMPercentage) | - ≤ 384 MB RAM → 50% of RAM - 384–768 MB → fixed 192 MB - ≥ 768 MB → 25% of RAM | Set heap to 80-85% of container/VM limit: -XX:MaxRAMPercentage=80 |
| GC choice | G1GC (default in Java 11+) or Parallel GC (Java 8) when processor count is greater than or equal to 2 SerialGC when processor count is less than 2 | G1GC (-XX:+UseG1GC) is a sensible default for most cloud services |
| CPU count | JVM detects host cores, may overshoot container quota | XX:ActiveProcessorCount=(cpu_limit with min of 2) |
| Cgroup awareness | Java 11+ detects container limits | Set explicit percentages as you would for VMs |
Regardless of your target architecture, if you only tweak a few JVM options for cloud workloads, start here. These settings prevent the most common pitfalls and align the JVM with the resources you’ve explicitly provisioned:
Garbage Collector: Use the G1GC (-XX:+UseG1GC) for most cloud services. It balances throughput and latency, scales well with heap sizes in the multi-GB range and is the JVM’s default in recent releases when you have more than one CPU core.
Active Processor Count:
-XX:ActiveProcessorCount=<cpu_limit with minimum 2>
Match this value to the number of CPUs or millicores assigned to the underlying compute hosting your container. For example, even if Kubernetes allocates a quota of 1024 millicores to your container, if it is running in a 16-core virtual machine, you should be setting ActiveProcessorCount to 2 or more. This allows the VM to appropriately allocate thread pools and choose a garbage collector like G1GC instead of SerialGC, which halts your application entirely during GC runs. The optimal value for this will depend on what else is running in the virtual machine – if you set the number too high, you will have noisy neighbor impacts for other applications running on the same compute node.
Heap Sizing:
-XX:InitialRAMPercentage=80 -XX:MaxRAMPercentage=85
These options tell the JVM to scale its heap based on the container’s memory limits rather than host memory, and to claim a larger fraction than desktop defaults. Use 80% as a safe baseline; push closer to 85% if your workload is steadystate.
Consistency Between Init and Max: For long-lived services, set InitialRAMPercentage equal to or slightly smaller than MaxRAMPercentage. This avoids the performance penalty of gradual heap expansion under load.
With these three knobs, most Java applications running in Kubernetes or cloud VMs will achieve predictable performance and avoid out-of-memory crashes.
Beyond heap sizing and CPU alignment, a handful of JVM options can give you measurable improvements for servers running Ampere’s Arm64 CPUs. These are not “one size fits all”. They depend on workload characteristics like RAM usage, latency vs throughput trade-offs, and network I/O, but they’re worth testing to see if they improve your application performance.
Enabling HugePages Transparent Huge Pages allocates a large contiguous block of memory consisting of multiple kernel pages in one try? and treats it as a single memory page from an application perspective. It enables large memory pages by booting the appropriate Linux kernel and using Transparent Huge Pages in your JVM with -XX:+UseTransparentHugePages to allocate large, continuous blocks of memory can offer a massive performance boost for workloads that can take advantage.
Using a 64k-page kernel Booting your host OS with a 64K kernel page size makes sure that memory is allocated and managed by your kernel in larger blocks than the 4K default. This will reduce TLB misses, and speed up memory access, for workloads that tend to use large contiguous blocks of memory. Note that booting kernels with a specific kernel page size and configuring TransparentHugePages require OS support and configuration, so they’re best handled in coordination with your ops team.
Memory Prefetch Some workloads benefit from pre-touching memory pages on startup. By default, virtual memory pages are not mapped to physical memory until they are needed. The first time a physical memory page is needed, the operating system generates a page fault, which fetches a physical memory page, maps the virtual address to the physical address, and stores the pair of addresses in the kernel page table. Pre-touch maps virtual memory addresses to physical memory addresses at start-up time, which makes the first access of those memory pages at run-time faster. Adding the option:
-XX:+AlwaysPreTouch
forces the JVM to commit and map all heap pages at startup, avoiding page faults later under load. The tradeoff: slightly longer startup time, but more consistent latency once running. This option is good for latency-sensitive services that stay up for a long time. This has the additional benefit of ensuring a fast failure at start-up time, if you are requesting more memory than can be made available to your application.
Tiered Compilation vs. Ahead-of-Time JIT The JVM normally compiles hot code paths incrementally at runtime. Options like -XX:+TieredCompilation (enabled by default) balance startup speed with steady-state performance. For cloud workloads where startup time is less important than throughput, you can bias toward compiling more aggressively up front. In some cases, compiling JIT profiles ahead of time (using jaotc or Class Data Sharing archives) can further reduce runtime CPU overhead. However, ahead-of-time compilation comes with both risks and constraints. Just-In-Time (JIT, or runtime) compilation takes advantage of gathering profiling information while running the application. To identify hot methods, method calls that need not be virtual method calls, calls that can be inlined, hot loops within methods, constant parameters, branches frequencies, etc. An Ahead-Of-Time (AOT) compiler is missing all that information and may produce sub-optimal code performance. In addition, language features related to dynamic class loading, where class definitions are not available ahead of time, or are generated at run-time, cannot be used with ahead-of-time compilation.
Vectorization and Intrinsics Modern JVMs on Arm64 include optimized intrinsics for math, crypto, and vector operations. No flags are needed to enable these, but it’s worth validating that you’re running at least Java 17+ to take advantage of these optimizations.
The JVM has a long history of making conservative assumptions, tuned for developer laptops and multi-tenant servers rather than dedicated cloud instances. On Ampere®-powered VMs and containers, those defaults often leave memory and CPU cycles unused. By explicitly setting heap percentages, processor counts, and choosing the right garbage collector, you can ensure your applications take full advantage of the hardware beneath them. By using a more recent version of the JVM, you are benefiting fromnthe incremental improvements that have been made since Arm64 support was first added in Java 8.
That’s just the beginning, though. JVM flags and tuning deliver real wins, but the bigger picture includes the operating system and Kubernetes itself. How Linux allocates memory pages, how Kubernetes enforces CPU and memory quotas, and how containers perceive their share of the host all have a direct impact on JVM performance.
In the next article in this series, we’ll step outside the JVM and look at the infrastructure layer:
Part 1 provides guidance on the JVM to “use what you’ve paid for,” part 2 will ensure your OS and container platform are tuned for optimal performance.
We invite you to learn more about Ampere developer efforts, find best practices, insights, and give us feedback at: https://developer.amperecomputing.com and https://community.amperecomputing.com/