LLM Inference with Ampere-based OCI A1

> Oracle Partner Page > AI Solutions > AI Developer Center

LLM Inference on OCI

Meet Your Performance Needs While Minimizing TCO

Ampere Cloud Native Processors with Ampere Optimized AI AI Frameworks are uniquely positioned to offer Large Language Model (LLM) Inference at performance levels that meet client needs both in terms of tokens per second (tps) and time till the first token, while providing the lowest cost per million tokens.

Deploy an AI Chatbot on an Ampere A1 Flex Compute Instance Using Minikube

Choosing CPUs for Efficient Generative AI Deployments

Democratizing Generative AI with CPU-based Inference

Interested in running, testing, and optimizing AI inference workloads on Ampere A1 Flex Shape?

Serge Chat

This demo shows that the Ampere-developed chatbot called Serge running Llama 2 7B on Ampere-based OCI A1 matches the user experience provided by ChatGPT 3.5 based on the 3.5 GPT model. Serge, a simple chatbot made solely for showcase purposes, rivals the performance and the quality of output provided by ChatGPT 3.5 while running GPU-Free on efficient and scalable Ampere-based OCI A1 cloud instances.

LLM Chat Demo

Test the fine-tuned version of the open-source Llama 3 model running on Ampere-based OCI A1 cloud instance with Ampere® Optimized AI Frameworks. Ampere-developed Serge Chat runs at a real-time performance level with latency and token generation rate meeting the user needs.