Hadoop on Google Cloud Workload Brief
Jan 2023, Big Data Solution Brief
Google Cloud's T2A virtual machines, powered by Ampere® Altra® processors provide outstanding single-threaded performance at an affordable cost. These VMs come in various pre-defined sizes, with a maximum of 48 vCPUs and 48GB of memory per vCPU per VM. T2A VMs are compatible with a wide range of Linux operating systems, including RHEL, Ubuntu, SUSE, and CentOS, etc. Most importantly, they offer several key benefits such as deterministic performance, linear scalability, and the best price-performance in the market. They are engineered to efficiently run scale-out and cloud-native workloads.
The Apache Hadoop software framework is designed for distributed processing of large data sets, and it is designed to scale out from a single server to thousands of machines, each offering local computation or storage or both. To optimize cluster deployments, the software has built-in resiliency to handle individual server or component (PCIe cards, SSDs, etc.) failures. It consists of four main modules: HDFS, YARN, Map Reduce and Hadoop Common. Applications collect data in various formats and seed it to the cluster. The name node has metadata information for all these chunks of data. A MapR job runs against this data in HDFS across data nodes.
All the above tasks are computationally intensive. 1. The data must be pulled from HDFS, which demands a high-performance storage 2. It must be coordinated across different computers, which demands a high-speed network 3. It must be quickly processed by thousands of tasks 4. It must be aggregated by reducers to organize the final output
Ampere Altra-powered instances in Google cloud provide the optimal platform for tackling the ever-growing big data challenges found in modern enterprise environments.
Cloud Native: Designed from the ground up for ‘born in the cloud’ workloads, Ampere Altra can deliver much higher price-performance over its x86 peers.
Consistency and Predictability: Ampere Altra processors that are designed for cloud native usage, provide consistent and predictable performance of Hadoop solutions and in particular for bursting workloads.
Scalable: With an innovative scale-out architecture, Ampere Altra processors have a high core count with compelling single-threaded performance combined with consistent frequency for all cores that make Big data workloads scale up and scale out efficiently.
Power Efficient: Industry-leading energy efficiency allows Ampere Altra processors to hit competitive levels of raw performance while consuming much lower power than the competition.
What it Enables
Memory
Connectivity
Technology & Functionality
Performance
Ampere Altra-powered T2A instances in Google cloud are generally available in several Google Cloud regions: US, Europe and Asia Southeast. T2A virtual machines offer a high level of networking performance with bandwidth speeds up to 32 Gbps. Additionally, storage options such as Zonal, Regional, and SSD disks are available for use with these virtual machines.
Ampere Arm technology uses a high number of cores per socket, maximizing core count per rack. This power-efficient design results in lower power consumption and consistent performance for big data applications. T2A VMs based on Ampere Arm processors provide better value for big data applications when compared to x86 processors. They are ideal for Hadoop applications due to their predictable and scalable architecture.
In this Solution Brief, we contrast 3 Google VMs, each featuring comparable CPUs from Intel, AMD and Ampere.
We used Intel HiBench benchmarking tool, and ran Hadoop TeraSort benchmark on the following three Google Cloud VMs:
1. N2 (Intel Icelake Platinum 8373C)
2. N2D (AMD Milan EPYC 7B13)
3. T2A (Ampere Altra Q64
TeraGen was used to generate a dataset of 250GB, and then the data was sorted using TeraSort capturing throughput in MB/s.
All the virtual machines had an identical configuration on CPU cores/threads, memory and storage.
The storage size was chosen to limit the bandwidth to 1000 MB/s across all the VMs.
Transparent huge pages were disabled on the guest operating system.
Few configuration parameters in Hadoop were tuned to maximize the utilization of CPU, memory and storage.
N2 | N2D | T2A | |
---|---|---|---|
vCPU | 16 | 16 | 16 |
Cores | 8 | 8 | 16 |
Mem | 64G | 64G | 64G |
Arch | x86_64 | x86_64 | aarch64 |
Kernel | Ubuntu 22.04 | Ubuntu 22.04 | Ubuntu 22.04 |
Storage | 2 x 1024 GB, totaling 1000 MB/s throughput | 2 x 1024 GB, totaling 1000 MB/s throughput | 2 x 1024 GB, totaling 1000 MB/s throughput |
JDK | Oracle JDK 8u345 | Oracle JDK 8u345 | Oracle JDK 8u345 |
Yarn Configuration
dfs.block.size | 250M |
yarn.scheduler.minimum-allocation-mb | 1024 |
yarn.scheduler.maximum-allocation-mb | 59392 |
yarn.scheduler.minimum-allocation-vcores | 1 |
yarn.scheduler.maximum-allocation-vcores | 15 |
yarn.nodemanager.resource.cpu-vcores | 16 |
yarn.nodemanager.resource.memory-mb | 63488 |
mapreduce.map.memory.mb | 1024 |
mapreduce.reduce.memory.mb | 3072 |
mapred.reduce.parallel.copies | 16 |
mapreduce.reduce.shuffle.parallelcopies | 14 |
mapreduce.map.java.opts | 2048M |
The relative performance data captured on Google Cloud with Hadoop on Yarn is shown below.
1. Ampere VMs performed well compared to Intel and AMD VMs. 2. Ampere VMs deliver 14% better price performance than Intel and 6% better than AMD VMs.
(VM pricing calculated with Google Cloud’s public pricing calculator)
Google Cloud VM's equipped with Ampere Altra processors offer exceptional performance for big data solutions such as Hadoop. The processors' ability to scale linearly with workloads complements the linear scale-out architecture of Hadoop and MapReduce frameworks. The combination of performance and cost benefits make Google T2A instances an excellent choice for running Hadoop workloads.
We look forward to helping our customers discuss their unique needs.
For more information, please visit:
All data and information contained herein is for informational purposes only and Ampere reserves the right to change it without notice. This document may contain technical inaccuracies, omissions and typographical errors, and Ampere is under no obligation to update or correct this information. Ampere makes no representations or warranties of any kind, including but not limited to express or implied guarantees of noninfringement, merchantability, or fitness for a particular purpose, and assumes no liability of any kind. All information is provided “AS IS.” This document is not an offer or a binding commitment by Ampere. Use of the products contemplated herein requires the subsequent negotiation and execution of a definitive agreement or is subject to Ampere’s Terms and Conditions for the Sale of Goods.
System configurations, components, software versions, and testing environments that differ from those used in Ampere’s tests may result in different measurements than those obtained by Ampere.
©2023 Ampere Computing. All Rights Reserved. Ampere, Ampere Computing, Altra and the ‘A’ logo are all registered trademarks or trademarks of Ampere Computing. Arm is a registered trademark of Arm Limited (or its subsidiaries). All other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
Ampere Computing® / 4655 Great America Parkway, Suite 601 / Santa Clara, CA 95054 / amperecomputing.com