Driving Scalable Inference from the Software Stack Up — Meet Ampere’s AI Software Team

Team Ampere

10 July 2025

AI infrastructure is moving fast. Large language models are everywhere, frameworks are changing rapidly and new optimization techniques show up every few weeks. Keeping up requires more than fast hardware. It demands focused, adaptable engineering, especially in software.

Long before this wave of AI took off, Ampere made a key decision to invest early in AI software. In 2021, we acquired OnSpecta, a startup focused on accelerating inference performance through software. For a company known for hardware, this was a deliberate move and a recognition that real-world AI workloads would be won or lost through software optimization.

The team formed from that acquisition, led by Victor Jakubiuk, now plays a central role in Ampere’s AI software efforts. Their job is to make inference fast, efficient and easy to deploy without asking developers to rewrite code or adopt proprietary tools.

Working Through a Shifting AI Landscape
AI frameworks change fast. New compilers, quantization methods and runtime layers come and go. Instead of chasing every trend, this specialized team built a system to evaluate which technologies would stick and which ones wouldn’t. That focus helped them prioritize work where it had the biggest impact, improving the core frameworks and libraries that most developers rely on.

They contributed optimizations to TensorFlow, PyTorch and ONNX Runtime, helping those frameworks run better on Ampere hardware without requiring special toolchains or retraining. Their focus was on practical performance gains including higher throughput, lower latency and better energy efficiency — especially at scale.

Solving Real Bottlenecks
As models got larger, inference workloads hit a new memory wall. The team responded by implementing techniques like flash attention to reduce memory bandwidth pressure and unnecessary data movement. They tuned the full software stack to keep data closer to where it’s processed, which helped increase performance without requiring changes to models or developer workflows.

Their insights also informed hardware design, particularly around memory layout and system architecture. Even there, the focus stayed on software and how we can get more out of the infrastructure we already have.

Seeing the Shift to Smaller Models
While many in the industry were scaling up to 70B+ parameter models, the team saw a different need growing: smaller, fine-tuned models built for specific tasks. These models offered better performance-per-dollar and were easier to deploy, but they needed new software strategies to reach their full potential.

The team shifted focus to support this next wave of workloads. They tuned inference pipelines to reflect that customers were running compact models in high-throughput, low-latency environments. They also made sure those improvements worked out of the box by building the Ampere Optimized AI Framework (AIO), a library that plugs into major frameworks and speeds up inference on Ampere hardware without requiring model conversion or custom integration.

A Software Bet That’s Paying Off
Ampere’s early investment in AI software is now a clear advantage. While the OnSpecta acquisition brought important tools, it also brought a team with deep experience in AI systems, capable of working across frameworks, compilers and hardware interfaces.

Today, their work is deployed across real production environments, from cloud services to social platforms. Customers are running transformer models, recommendation systems and vector search on Ampere hardware, all powered by software that’s been tuned for performance, cost and energy use.

This team is just one part of Ampere’s AI effort, but it’s proof that software matters. And that investing early, even as a hardware company, can create lasting technical and competitive advantages.

Disclaimer
All data and information contained herein is for informational purposes only and Ampere reserves the right to change it without notice. This document may contain technical inaccuracies, omissions and typographical errors, and Ampere is under no obligation to update or correct this information. Ampere makes no representations or warranties of any kind, including express or implied guarantees of noninfringement, merchantability, or fitness for a particular purpose, and assumes no liability of any kind. All information is provided “AS IS.” This document is not an offer or a binding commitment by Ampere.

System configurations, components, software versions, and testing environments that differ from those used in Ampere’s tests may result in different measurements than those obtained by Ampere.

©2025 Ampere Computing LLC. All Rights Reserved. Ampere, Ampere Computing, AmpereOne and the Ampere logo are all registered trademarks or trademarks of Ampere Computing LLC or its affiliates. All other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

Created At : April 30th 2025, 8:44:21 pm

Last Updated At : July 16th 2025, 4:25:04 pm

Ampere Computing LLC

4655 Great America Parkway Suite 601

Santa Clara, CA 95054

| | |

This site runs on Ampere Processors.