Big Data Solutions
Batch processing and analytics work best on Ampere processors
Big Data solutions require massive computational power along with persistent and high performance storage and network resources.
Ampere® Altra® processors, designed for cloud native usages, provide consistent and predictable performance for big data solutions. Their high core count, compelling single threaded performance, and consistent frequency make big data workloads scale out very efficiently. Ampere processors have much lower power consumption over legacy x86 processors. Low power and high core density per rack translate directly into both Capex and Opex savings.
Scale out with confidence! In our Hadoop TPCx-HS testing ( Graph 1) we observed near linear scaling through a total of nine nodes. This data was captured on the Amper Altra based Hammerhead Bare Metal Cluster. If you have a scale-out workload you would like to run on a high performance cluster you can request access on our Hammerhead Cluster
Cloud Native Performance
Ampere Altra processors are a complete system-on-chip (SOC) solution built for Cloud Native applications. Graphs 2 to 4 depict the Spark and Hadoop, TPC-DS and Terasort wokloads on VM's. Ampere VM's performed well above its peers. Spark Terasort performed 73% better than Intel Skylake and 13% better than AMD Milan(Graph 4).
Consistency and Predictability
Ampere processors provide consistent and predictable performance of Big Data solutions and for bursting workloads.
Ampere processors have industry leading energy efficiency, and consume much lower power than competition.
System configurations, components, software versions, and testing environments that differ from those used in Ampere’s tests may result in different measurements than those obtained by Ampere. The system configurations and components used in our testing are detailed here
Big data architecture is designed to handle the ingestion, processing, and analysis of large and complex data. Big Data workloads manage large amounts of data, analyze it for business purposes, steer data analytics operations for business intelligence, and orchestrate the big data analytics tools to effectively extract vital business information from extremely large data pools.
Big data solutions include the following types of workloads:
Data Souces include:
Distributed data stores are essential components of the solutions. Data stores range in size from gigabytes to petabytes of data in many different formats. Big data applications process these files using long running batch jobs to filter, aggregate and format the data for later consumption by data analytics.
Hadoop Distributed File System (HDFS) is a component of Big data storage layer. The files in HDFS are broken into block-size chunks called data blocks that are replicated within the cluster for storage resiliency.
YARN manages the resources for the applications. YARN decouples MapReduce’s management and scheduling capabilities. YARN has multiple nodes to resume execution in case of failure of the first node.
MapReduce's algorithm distributes the job and runs it across the cluster. Single tasks are divided into multiple tasks and run on different machines.
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.
Apache Spark is used for executing data engineering, data science, and machine learning on single-node machines or clusters. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. It provides API’s in Java, Scala, Python and supports multiple workloads in real time analytics, batch processing, interactive queries, and machine learning. Spark addresses the limitations of MapReduce by doing in-memory processing and reusing data across multiple parallel operations. Spark relies on other storage systems like HDFS, Couchbase, Cassandra and others
Hive is a distributed data warehouse system. Hive is used to process mostly structured data in Hadoop. Hive allows users to read, write, and manage petabytes of data using SQL. Hive can query large datasets leveraging MapReduce.
Pig is used for the analysis of large amounts of data. It is a procedural data flow language that operates on the client side of the cluster. It can handle semi-structured data as well.
HBase is a columnar database that runs on top of HDFS. HBase provides a fault tolerant way of storing data sets and is well suited the process large volumes of random read and write data in real time.
Mahout is a library of machine learning algorithms implemented on top of Apache Hadoop and using MapReduce. Mahout provides the data science tools to automatically find meaningful patterns in big data sets.
HCatalog allows you to access Hive Metastore tables with Pig, Spark, and Custom MapReduce applications. It exposes REST API and command line Client to create tables and other operations.
Apache Ambari is a open source platform that simplifies the provisioning, management, monitoring, security of an Apache Hadoop cluster by providing an easy to use web UI and REST API. It provides a step by step wizard for installing Hadoop servives.
Zookeeper provides operational services in Hadoop Cluster. Distributed applications use Zookeeper to store metadata, and use as a distributed configuration service and a naming registry for distributed systems.
Apache Oozie is a workflow tool. You can build the workflow with dependencies of various jobs that are bound together and submitted to Yarn as one logical entity. Oozie is like a cron and submits the job to Yarn which executes the job.
Big Data Solution Regressions
Big Data infrastructure is used in several analytic applications including Oil and Gas, Healthcare, Retail, Telco and Financial services. The data harnessed is used to improve operational efficiency, demand forecasting, pricing optimization, and other financial and compliance analytics. Regression for Big Data infrastructure components on the latest aarch64 builds is coming soon.
Ampere Altra Systems
Ampere Altra and Ampere Altra Max. These systems are flexible enough to meet the needs of any cloud deployment and come packed with Ampere's 80-core Altra or 128-core Altra Max processors
Microsoft offers a comprehensive line of Azure Virtual Machines that can run a diverse and broad set of Linux workloads such as web servers, open-source databases, in-memory applications, big data analytics, gaming, media, and more.
Equinix Metal, an on-demand digital infrastructure platform, has created Gen3 configs with Ampere Altra for common workloads which are available in minutes on bare metal
Whether your business is early in its journey or well on its way to digital transformation, Google Cloud can help you solve your toughest challenges.
Hewlett Packard Enterprise
The new HPE ProLiant RL300 Gen11 server is the first in a series of HPE ProLiant RL Gen11 servers that deliver next-generation compute performance with higher power efficiency using Ampere® Altra® and Ampere ® Altra® Max cloud-native processors.
OCI Ampere A1
Ampere Altra and Oracle Cloud combine predictable performance, near-linear scaling, and secure architecture with the best price-performance in the market in the following shapes: