Ampere and NVIDIA Android cloud gaming SDK
With cloud gaming gaining in popularity and the Metaverse pending as a major XR use case, there is high demand for centralized resources built to render, encoder, and stream from the cloud to different devices, like mobile handsets and wearable devices. With these devices connected by 5G, the combination of low latency and high bandwidth provides a better experience for many use cases including cloud gaming and the metaverse.
The Ampere Altra (80 cores) and Ampere Altra Max (128 cores) AArch64 processors are complete system-on-chip (SOC) solutions built for cloud native applications, including Application streaming, gaming and XR. In addition to incorporating many high-performance cores, Ampere’s innovative architecture delivers predictable high performance, linear scaling, and high energy efficiency. More importantly, the high IO bandwidth provides direct connections to multiple PCIe devices like GPUs, NICs, and video encoders, which are essential for streaming high-density applications required by cloud service providers (CSP).
The growing ODM ecosystem of Ampere platform suppliers provides a variety of single socket (1P) and dual socket (2P) platforms for CSP customers. Several platforms are designed with high density peripheral devices like GPUs. The platforms provide space for single or double width, full or half length, full or half height slots, and auxiliary power to accommodate multiple power sources (up to 300W per card), and a balance of CPU/GPU configurations.
Android Versions: 9 included, others can be built with drivers
Components: GPU Driver, SurfaceFlinger, input, audio
APIs: frame capturing, encoding, and streaming
Android streaming client
Docker container management scripts
Memory
Connectivity
System
Performance
Ampere Altra Family
Nvidia Gaming SDK
The Nvidia gaming stack has a very efficient and balanced rendering, encoding, and streaming pipeline. After the GPU renders and composes frames, only the file descriptor of frame buffers, not frame data, are copied into a shared memory to allow high performance frame capturing, encoding, and streaming. Modules run in hosts, not inside Android containers for peak efficiency. A frame capturing module reads shared frame buffer properties and passes it to CUDA based video encode drivers, which can directly read and encode frames into video streams. Encoded video streams can be copied from GPU to CPU memory and then streamed. User input and audio are processed with similar approaches.
Fig 1: Rendering, encoding, and streaming pipeline
Benchmarks were performed with both Mt Collins, a dual socket Ampere Altra server and Mt Snow, a single socket Ampere Altra Max server with the following configurations:
Mt Collins:
- 2x Altra with total 160 cores @3.0GHz
- 512GB DRAM
- 1TB NVME
- 4x NVIDIA T4 (2 on each socket)
Mt Snow:
- 1x Altra Max with 128 cores @3.0GHz
- 512GB DRAM
- 1TB NVME
- 4x NVIDIA T4 (4 on one socket)
On Mt Collins, a 2P system, instances are pinned to specific GPU and CPU (cores) to isolate the given CPU sockets.
Based on the rendering, encoding, and streaming pipeline shown in Figure 1, the data of an Android instance flows from CPU to GPU and back to CPU. Good isolation practices are employed with high density Android game instances. There is no data shared between GPUs after GPU resources are allocated and persists until they are released. CPU cores and the GPU allocated to an Android instance all reside on the same socket, thus cross-socket data traffic is minimized. Exceptions to this isolation principle are GPU resource allocation and context related to functions that are still global to each instance of a GPU driver and could become a bottleneck, special attention is paid to prevent overloading of GPU context or resource allocation related functions in the system.
On Mt Collins, each GPU renders 30 instances of Android containers and encodes the final surfaces for a total of 120 instances. The 80 cores of each CPU socket are isolated from the 60 instances running on the two GPUs attached to the same socket. On Mt Snow, Android instances are partitioned 30 instance per GPU, while sharing all 128 cores.
Game frames are rendered at 1280x720@30fps. After encoding, they are streamed via UDP based RSTP. Data packages are transmitted via NICs if clients are connected or discarded. Remote client connection does not materially change the results.
CPU and GPU performance data shown were collected when all 40-120 containers are running games in steady state, without any clients attached, i.e., all frames are discarded after encoding.
Game Ran | 1P AltraMax | 2P Altra |
---|---|---|
Platformer3D | 0.48 cores/instance | 0.53 cores/instance |
BombSquad | 0.18 cores/instance | 0.24 cores/instance |
With these settings, benchmark results show that CPU utilizations are incredibly low. When running 120 instances of Java based Plaformer3D, the CPU utilization is ~45% on 1P Altra Max, and ~40% on a 2P Altra. For the NDK based BombSquad, the CPU utilizations are ~17% for 1P Altra Max and ~18% for 2P Altra, the table above shows Platformer3D games consume 0.48 cores on 1P Altra Max and 0.53 cores on 2P Altra, and BombSquad consumes 0.18 cores on 1P Altra Max and 0.24 cores on 2P Altra. In both cases, Altra Max has marginally better CPU efficiency than Altra. Note that the CPU cores required per instance depends on game titles. For example, CPU per core on Altra Max is 0.48 running Platformer3D while it is 0.18 for BombSquad.
As mentioned above, GPU context and resource allocation related functions are also key factors in determining system wide performance, including stages like starting and quitting games, starting and stopping containers, or starting and quitting video encode operations. Instance density is also predicated on GPU resource contention. The Nvidia T4 employs 15G of GPU memory, which gets shared between instances at a rate of 500MB per instance. Again, this is title dependent, but a good rule of thumb is 500MB for middle range Android games. These considerations limit the density of instances on a given machine to the total capacity of GPU context and resource allocations mentioned. Typically, an NVIDIA T4 runs ~30 instances per GPU for middle range games, thus limiting the density of game instances in these tests to a maximum of 120 instances for servers with 4x NVIDIA T4 GPUs. Future examinations of other GPU devices may yield greater density.
For cloud hosted Android applications, instance density is everything. Ampere is the first platform to support 120 or more 3D cloud gaming instances per server. Ampere processors natively support both 32 & 64 bit Android applications and require no binary translation for maximum instance density! Ampere Altra and Altra Max processors are extremely efficient in running 3D game titles such as Platformer 3D and Bombsquad leaving plenty of CPU headroom for value added services to complete any cloud gaming or cloud phone solution.
All data and information contained herein is for informational purposes only and Ampere reserves the right to change it without notice. This document may contain technical inaccuracies, omissions and typographical errors, and Ampere is under no obligation to update or correct this information. Ampere makes no representations or warranties of any kind, including but not limited to express or implied guarantees of noninfringement, merchantability, or fitness for a particular purpose, and assumes no liability of any kind. All information is provided “AS IS.” This document is not an offer or a binding commitment by Ampere. Use of the products contemplated herein requires the subsequent negotiation and execution of a definitive agreement or is subject to Ampere’s Terms and Conditions for the Sale of Goods.
System configurations, components, software versions, and testing environments that differ from those used in Ampere’s tests may result in different measurements than those obtained by Ampere.
©2022 Ampere Computing. All Rights Reserved. Ampere, Ampere Computing, Altra and the ‘A’ logo are all registered trademarks or trademarks of Ampere Computing. Arm is a registered trademark of Arm Limited (or its subsidiaries). All other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
Ampere Computing® / 4655 Great America Parkway, Suite 601 / Santa Clara, CA 95054 / amperecomputing.com