Ampere Computing Logo
Ampere Computing Logo
DEVELOPER STORY

Building a 300 Channel Video Encoding Server

NETINT VPU Technology with Ampere® Altra® Processors set new operational cost and efficiency standards

02 August 2024

Overview

The exponential growth in demand for high-quality live video streaming has placed immense pressure on operational costs and user expectations across all markets. To address this, NETINT collaborated with Supermicro and Ampere Computing to reimagine the video transcoding server. This groundbreaking live video server leverages NETINT VPUs for intensive encoding and transcoding, while powerful Ampere® Altra® CPUs perform crucial additional functions such as deinterlacing, software decoding for various formats, and demanding AI inference tasks like real-time automated subtitling using OpenAI’s Whisper. This unique, hybrid architecture delivers unparalleled dense, high-performance, and cost-effective video processing capabilities within a compact 1U server footprint. It significantly expands system functionality beyond legacy x86 processors, enabling a new era of efficient live broadcasting and content delivery by simultaneously managing a vast array of diverse video streams.

Challenges

  • Legacy x86 systems faced limited CPU processing power and skyrocketing operational costs for live video transcoding.
  • The goal was to achieve a 20x throughput increase and 80% operational cost reduction in a smaller, faster server.
  • Overcoming performance degradation over time, initially traced to IOMMU-related issues impacting TLB miss rates.
  • Enabling simultaneous deinterlacing and processing of diverse video formats (e.g., 100x 576i, 100x 720i, 10x 1080i) not natively supported by VPUs.

Solution

The collaborative solution involved integrating NETINT's Quadra VPUs with Ampere Altra processors within a Supermicro 1U server, establishing a powerful hybrid architecture. Ampere engineers optimized FFmpeg for Arm64 NEON SIMD instructions, achieving a 2.9x deinterlacing speedup. A critical performance degradation issue, linked to IOMMU, was resolved by implementing the Linux kernel iommu.passthrough=1 boot option, significantly reducing TLB miss rates. Subsequent Arm64 deinterlacing optimizations by NETINT engineers further refined performance, reducing CPU utilization to 50-60% while exceeding all aggressive targets.

Results

The NETINT 300 Channel Live Stream Video Server delivered a 20x throughput increase and 80% operational cost reduction compared to x86 software. This 1U Supermicro server simultaneously transcodes highly diverse workloads, such as 95x 1080i30 streams or a combined 100x 576i, 100x 720i, and 10x 1080i stream setup. It expanded functionality by supporting CPU-intensive formats like decoding 96x 1080i30 H.264/H.265 streams. All this is achieved efficiently within a dense, power-effective 1U footprint, with CPU utilization remaining at an impressive 50-60%.

“The punchline is that with an Ampere Altra Max Processor and NETINT VPU, a Supermicro 1U server unlocks a whole new world of value.”
Alex Liu, Co-founder,
NETINT

NETINT’s Vision

Responding to customers’ concerns about limited CPU processing and skyrocketing power costs, NETINT built a custom ASIC for one purpose: highest-quality, lowest-cost video processing and encoding. NETINT reinvented the live video transcoding server by combining NETINT Quadra VPUs with Ampere’s Altra Max processor to create a smaller and faster server that costs 80% less to operate and increases throughput by 20x compared to software on x86.

Requirements to Reinvent the Video Server

1. Engineer it smaller and faster.
2. Make it cost 80% less to operate.
3. Increase throughput by 20x.

Why NETINT Chose Ampere Processors

NETINT was already familiar with Ampere Computing’s high-performance and low-power processors, which perfectly complement NETINT’s Quadra VPUs. The Ampere® Altra® Max Cloud Native Processor is designed for a new era of computing and an energy-constrained world—delivering unprecedened efficiency and performance. From web and video service infrastructure to CDNs to demanding AI inference, Ampere products are the most efficient dense computing platforms on the market. The benefits of using a Cloud Native Processor like Ampere Altra Max include improved efficiency and scalability, which have great synergy with NETINT’s high-performance and energy-efficient VPUs.

Problem
Could Ampere Altra Max simultaneously deinterlace 100 576i, 100 720i, and 10 1080i simultaneous video streams that legacy x86 processors couldn’t in a cost-effective 1RU form factor?

How Ampere Responded
Engineers from NETINT, Supermicro, and Ampere unlocked the high performance available with NETINT’s Quadra VPU and Ampere Altra Max 96-core processor to redefine the live stream video server. Initial results with Ampere Altra Max using FFmpeg 5.0 were encouraging compared to legacy x86 processors but didn’t meet NETINT’s goal to increase throughput by 20x while reducing costs by 80%.

Ampere engineers studied different deinterlacing filters available in FFmpeg and investigated recent Arm64 optimizations available in recent FFmpeg releases. An FFmpeg avfilter patch that provides optimized assembly implementation using Arm64 NEON SIMD instructions showed a significant performance increase in video deinterlacing with up to 2.9x speedup using FFmpeg 6.0 compared to FFmpeg 5.0. With all architectures, and especially true for the Arm64 architecture, using the “latest and greatest” versions of software is recommended to take advantage of performance improvements.

Performance Challenges

NETINT, Supermicro, and Ampere engineers went to work running the full video workload, combining CPU-based video deinterlacing and transcoding using NETINT’s Quadra VPUs. With outstanding results just running the deinterlacing jobs, initial results running the full video workload didn’t meet the performance target. Combining their broad expertise in hardware and software optimization, the team analyzed, root caused, and were able to meet the aggressive requirements and, in the end, used just 50-60% of Ampere Altra Max Processor’s CPU utilization, allowing headroom for future features.

The initial results didn’t meet the target of simultaneously transcoding 100x 576i, 100x 720i, 10x 1080i, 40x 1080p30, 40x 720p30, and 10x 576p input videos. Investigating the performance showed performance initially was close to the goal yet unexpectedly slowed down over time. Following the performance methodology outlined in Ampere’s tutorial, “Performance Analysis Methodology for Optimizing Altra Family CPUs,” by first characterizing platform-level performance metrics. Figure 2 shows the mpstat utility data: initially, the system was running within ~4% of the performance target yet was only running at ~71% overall CPU utilization, with ~36% in user space (mpstat %usr), and ~35% in system-related tasks – kernel time (mpstat %sys), waiting for IO (mpstat’s %iowait), and soft interrupts (mpstat %soft). The fact that the system was idle ~29% of the time indicated that something was blocking performance.

mpstat utility output showing the system is idle 100.0 - 71.4 = 28.6% of the time during initial performance analysis when the system wasn’t meeting the performance target. This showed us what we needed to determine what was limiting system performance.

With the large percentage in software interrupts and IO wait time, we initially investigated interrupts using the softirq tool in BCC, which provides BPF-based Linux IO analysis, networking, monitoring, and more. The softirq tool traces the Linux kernel calls to measure the latency for all the different software interrupts on the system, outputting a histogram graph showing the latency distribution. The BCC tools are very powerful and easy to run. It showed ~20 microsecond average latency in the driver used by NETINT’s VPU while handling ~40K interrupts/s. As our performance problem was of the order of milliseconds, the BCC softirq tool showed that software interrupts weren’t limiting performance, so we continued to investigate what was limiting performance.

dev-code.png

BCC softirq tool measures software interrupt latency. softirq block device output showing block IRQ average latency of ~12 usecs and thus not critical for the overall performance when running at 30 FPS or 33 milliseconds per frame.

Next, we used the perf record/perf report utilities to measure various Performance Measurement Unit (PMU) counters to characterize the low-level details of how the application was running on the CPU, looking to pinpoint performance bottleneck(s). As we initially didn’t know what was limiting performance, we collected PMU counter data to measure CPU utilization (CPU cycles, CPU instructions, Instructions per Clock, frontend, and backend stalls), cache and memory access, memory bandwidth, and TLB access. As the system after reboot reached ~96% of the performance target and degraded to ~60% after running many jobs, we collected perf data after reboot and when the performance was poor. Analyzing the PMU data to look for the biggest differences in the good and poor performance cases, the kernel function alloc_and_insert_iova_range stood out by taking 40x more CPU cycles in the poor performance case. Searching Linux kernel source code via the very powerful live grep website showed this function is related to IOMMU. Rebooting the kernel with the iommu.passthrough=1 option resolved the performance degradation over time issue by reducing TLB miss rate. We were at ~96% of the performance target, so we were close but needed extra performance to meet our goals!

NETINT engineers made the final performance speedup. They saw additional Arm64 deinterlacing optimizations available in FFmpeg mainline, which met our performance goals while reducing the overall CPU utilization to 50-60%, down from 70%.

perf utility output showing performance critical functions when the system was running slow and fast. The function __alloc_and_insert_iova_range shows a very large increase in the CPU cycles and Stall Frontend. This led us solving the performance degradation over time by using the Linux kernel boot option iommu.passthrough=1.

The Results

The result is the NETINT 300 Channel Live Stream Video Server Ampere Edition based on a collaboration of NETINT, Supermicro, and Ampere, which can simultaneously transcode 95x 1080i30 streams, 195x 720i30 streams, 365x 576i30 streams, or a combined 100x 576i, 100x 720i, 10x 1080i, 40x 1080p30, 40x 720p30, and 10x 576p streams in a Supermicro MegaDC SuperServer ARS-110M-NR 1U server. This server expands the system functionality to enable running video workloads that require high-performance CPU performance in a dense, power, and cost-effective 1U server.

Call to Action

NETINT’s vision to reimagine the live video server based on customer demands resulted in the NETINT Quadra Video Server Ampere Edition in a Supermicro 1U server chassis, unlocking a whole new world of value for customers who need to run video workloads that require high-performance CPU processing in addition to video transcoding with NETINT’s VPUs.

Alex Liu and Mark Donningan from NETINT, Sean Varley from Ampere Computing, and Ben Lee from Supermicro have a webinar available to watch on NETINT’s YouTube channel, “How to Build a Live Streaming Server that delivers 300 HD interlaced channels,” which provides additional information.

Other video workloads that are excellent to run on this server include AI inference processing, which NETINT recently announced and demonstrated at NAB 2024 - NETINT unveiled the Industry-First Automated Subtitling Feature With OpenAI Whisper running on Ampere.

About the Companies

NETINT
Founded in 2015, NETINT’s big dream of combining the benefits of silicon with the quality and flexibility of software for video encoding using proprietary ASICs is now a reality. As the first commercial vendor for video processing-specific silicon, NETINT pioneered the development of the video processing unit (VPU). Nearly 100,000 NETINT VPUs are deployed globally, processing over 300 billion minutes of video.

Supermicro
Supermicro is a global technology leader committed to delivering first-to-market innovation for Enterprise, Cloud, AI, Metaverse, and 5G Telco/Edge IT Infrastructure, with a focus on environmentally friendly and energy-saving products. Supermicro uses a building blocks approach to allow for combinations of different form factors, making it flexible and adaptable to various customer needs. Their expertise includes system engineering, focused on the importance of validation, and ensuring that all components work together seamlessly to meet expected performance levels. Additionally, they optimize costs through different configurations, including choices in memory, hard drives, and CPUs, which together make a significant difference in the overall solutions that Supermicro provides.

Ampere Computing
Ampere is a modern semiconductor company designing the future of cloud computing with the world’s first Cloud Native Processors. Built for the sustainable Cloud with the highest performance and best performance per watt, Ampere processors accelerate the delivery of all cloud computing applications. Ampere Cloud Native Processors provide industry-leading cloud performance, power efficiency and scalability. For more information visit https://amperecomputing.com.


All data and information contained in or disclosed by this document are for informational purposes only and are subject to change. This document is not to be used, copied, or reproduced in its entirety, or presented to others without the express written permission of Ampere®. © 2025 Ampere® Computing LLC. All rights reserved. Ampere®, Ampere® Computing, Altra and the Ampere® logo are all trademarks of Ampere® Computing LLC or its affiliates. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

Created At : October 23rd 2025, 6:43:05 pm
Last Updated At : February 3rd 2026, 10:09:34 pm
Ampere Logo

Ampere Computing LLC

4655 Great America Parkway Suite 601

Santa Clara, CA 95054

image
image
image
image
image
 |  |  | 
© 2025 Ampere Computing LLC. All rights reserved. Ampere, Altra and the A and Ampere logos are registered trademarks or trademarks of Ampere Computing.
This site runs on Ampere Processors.