Footnotes:
Data Center Efficiency: Data for the Efficiency claims and Carbon equivalency analysis in the roadmap video (05/18/2023) is based on a composite Web Service study used in the Ampere Efficiency campaign and based on single node performance comparisons measured and published by Ampere Computing. Performance data and the test configurations used to gather the data for each application is published on our web site. The following table shows the composition of a modeled web service based on performance data to determine scale-out behavior through projections and calculations at both Rack and Data center level. Total data center power consumption is based this Web Services study and scaled to 100,000 ft2 data center. Total power difference is then used to complete the Carbon equivalencies. The primary applications used in this analysis are:
Rack Level evaluation is based on the total performance required to scale out to 1 Rack of power budget for the Ampere
® Altra
® Max processors under the weighted load of the stated application composition above. Rack is based on a standard 42U rack with a total power budget of ~14kW including ~10% overhead buffer for networking, mgmt and PDU. Per server power is measured socket level power during fully loaded operation for each architecture, combined with an equivalent system level overhead typical for motherboard, peripheral and memory power draws. All socket power figures were measured by Ampere during live stress testing. The relative power efficiency ratings can be found at the links provided in the table above.
Data Center level analysis is calculated from the rack level analysis and scaled linearly to fit a medium sized data center specification based approximately on publicly available data for the Bluffdale NSA facility in Bluffdale, UT.(1) The data center modeled is 100k ft2, where 65% of the space is reserved for the server room built on an 8 tile pitch. The total power capacity is roughly 66MW based on a PUE assumption of 1.2. More information on data center rack pitch densities can be found through a variety of publicly available analysis(2)
Carbon equivalencies were calculated using the EPA equivalency calculator.(3)
(1) https://www.npr.org/sections/alltechconsidered/2013/09/23/225381596/booting-up-new-nsa-data-farm-takes-root-in-utah
(2) https://www.racksolutions.com/news/blog/how-many-servers-does-a-data-center-have/
(3) https://www.epa.gov/energy/greenhouse-gases-equivalencies-calculator-calculations-and-references
VMs/Rack: The number of VMs per rack were calculated based on a 42U 16.5kW rack. The load applied is SpecRate Integer 2017 Estimated (SIR) for each system architecture compared. The SIR load total is relevant only to apply a load that exercises each server to a maximum power draw to calculate the number of servers possible within a rack budget of 16.5kW. For each architecture, the total number of servers in the rack is calculated based on single socket 1U servers. The total number of VMs in each server is based on physical core count for each processor summed to obtain the total number of VMs possible per rack. A VM is assumed to own all available threads present for each core. The raw data is shown in the table below:
Architecture | Cores/Server | System Power/Server | Servers/Rack | Cores (VMs)/Rack |
---|
AmpereOne | 192 | 434 W | 38 | 7296 |
Intel SPR 8480 | 56 | 534 W | 30 | 1680 |
AMD Genoa 9654 | 96 | 624 W | 26 | 2688 |
Recommendations/Rack: The number of recommendations (queries) per rack was calculated based on a 42U 14 kW rack. The load applied is Pytorch running the DLRM recommendation model for each system architecture compared. The AI load applied to each server yields a maximum throughput for the DLRM model. Total power drawn for each server was measured at the socket level combined with a typical power for system components per server dived by the rack power budget to obtain the number of servers possible within a rack budget of 14 kW with 10% overhead applied for networking, mgmt and PDU. For each architecture the total number of servers is based on single socket 1U servers. The total performance per rack is a simple sum of performance per server * the total servers/rack. The raw data is show in the table below:
Architecture | Cores/Server | Performance/Server | System Power/Server | Servers/Rack | Performance/Rack |
---|
AmpereOne | 160 | 819,750 queries/s | 534 W | 23 | 18.85 M Queries/s |
AMD Genoa 9654 | 96 | 356,388 queries/s | 512 W | 25 | 8.91 M Queries/s |
AMD 9654 (Genoa) | AmpereOne
| DLRM |
---|
- HW: AMD 9654 (96c, 192t, 1P, 256GB mem)
- OS: Ubuntu 22.04
- Linux kernel: 5.18.11-200.fc36.x86_64
- AI SW: AMD ZenDNN PyTorch 1.12.1 - release 4.0.0 - python 3.10/pytorch:1.5.2 release docker image
- Data Format: FP32
|
- HW: 160c, 1P, 512GB mem
- OS: Ubuntu 20.04
- Linux kernel: 6.1.10-amp01.4k (400W system)/5.18.19-200.fc36.aarch64
- AI SW: Ampere Computing AI
- Data Format: FP16
|
- pytorch implementation based on official facebookresearch/dlrm
- https://github.com/AmpereComputingAI/dlrm/tree/karol/torchscript
- model hyperparameters:
- arch_sparse_feature_size = 64
- arch_mlp_bot = "512-512-64"
- arch_mlp_top = "1024-1024-1024-1"
- mini_batch_size = 4032
- num_batches = 1
- num_indicies_per_lookup = 100
- ~514M parameters
- intra threads set to 4 for each parallel process (24 processes on Genoa, 40 on Siryn)
|
Stable Diffusion Perf/Rack: The number of frames/s per rack was calculated based on a 42U 14 kW rack. The load applied is Pytorch running the Stable Diffusion V2 model for each system architecture compared. The AI load applied to each server yields a maximum throughput for the Stable Diffusion V2 model. Total power drawn for each server was measured at the socket level combined with a typical power for system components per server divided by rack power budget to obtain the number of servers possible within a budget of 14 kW with 10% overhead applied for networking, mgmt and PDU. For each architecture the total number of servers is based on single socket 1U servers. The total performance per rack is a simple sum of performance per server * the total servers/rack. The raw data is show in the table below:
Architecture | Cores/Server | Performance/Server | System Power/Server | Servers/Rack | Performance/Rack |
---|
AmpereOne | 160 | 0.036 frames/s | 534 W | 23 | 0.828 frames/s |
AMD Genoa 9654 | 96 | 0.014 frames/s | 624 W | 26 | 0.364 frames/s |
AMD 9654 (Genoa) | AmpereOne
| DLRM |
---|
- HW: AMD 9654 (96c, 192t, 1P, 256GB mem)
- OS: Ubuntu 22.04
- Linux kernel: 5.18.11-200.fc36.x86_64
- AI SW: AMD ZenDNN PyTorch 1.12.1 - release 4.0.0 - python 3.10/pytorch:1.5.2 release docker image
- Data Format: FP32
|
- HW: 160c, 1P, 512GB mem
- OS: Ubuntu 20.04
- Linux kernel: 6.1.10-amp01.4k (400W system)/5.18.19-200.fc36.aarch64 AI
- AI SW: Ampere Computing AI
- Data Format: FP16
|
- Graph torch JIT scripted
- V2.1 base variant used - 512 ema pruned weights
- fp32 precision, ~1.3 billion parameters
- txt 2 image mode
- 50 sampling steps, batch size of 3, generated image resolution 512x512
- num_batches = 1
- intra threads set to 16 for each parallel process (6 processes on Genoa, 10 on AmpereOne)
|