LLM Inference with Ampere-based OCI A1
LLM Inference on OCI
Meet Your Performance Needs While Minimizing TCO
Ampere Cloud Native Processors with Ampere Optimized AI AI Frameworks are uniquely positioned to offer Large Language Model (LLM) Inference at performance levels that meet client needs both in terms of tokens per second (tps) and time till the first token, while providing the lowest cost per million tokens.
Serge Chat
This demo shows that the Ampere-developed chatbot called Serge running Llama 2 7B on Ampere-based OCI A1 matches the user experience provided by ChatGPT 3.5 based on the 3.5 GPT model. Serge, a simple chatbot made solely for showcase purposes, rivals the performance and the quality of output provided by ChatGPT 3.5 while running GPU-Free on efficient and scalable Ampere-based OCI A1 cloud instances.
Deploy On OCI
Try OCI Free
Quickstart Guide
Ampere Optimized llama.ccp
Docker Hub
This Docker image can be run on bare metal Ampere® CPUs and Ampere® based VMs available in the cloud.
Resources