12-Core ARM Cluster Benchmarked Against Intel Atom, Ivy Bridge, AMD Fusion
Written by Michael Larabel in Processors on 14 June 2012. Page 1 of 1. 20 Comments

Last week I shared my plans to build a low-cost, 12-core, 30-watt ARMv7 cluster running Ubuntu Linux. The ARM cluster that is built around the PandaBoard ES development boards is now online and producing results... Quite surprising results actually for a low-power Cortex-A9 compute cluster. Results include performance-per-Watt comparisons to Intel Atom and Ivy Bridge processors along with AMD's Fusion APU.

As talked about in last week's preview, six PandaBoard ES development boards were used to form this cluster. A single PandaBoard ES is already quite decent in terms of ARMv7 performance when running Ubuntu 12.04 thanks to improvements made in supporting the board's ARM SoC, Ubuntu switching to hardfp packages by default, and other Linux optimizations coming out of upstream and the Linaro camp. The PandaBoard ES uses the OMAP4460 SoC (this is an upgrade over the original PandaBoard bearing an OMAP4430 with 1.0GHz Cortex-A9 MPCore) from Texas Instruments that provides a 1.2GHz dual-core ARM Cortex-A9 processor. (The OMAP4460 also has PowerVR graphics, but that is not important for these cluster purposes.) On the PandaBoard ES there is 1GB of system memory, 10/100 Ethernet, two USB 2.0 ports, HDMI output, and an SD/SDHC slot for storage.

This Phoronix twelve ARM core cluster, which is dubbed "Effimaß", is uniquely constructed out of a dish drying rack. As far as the reasoning for this, "One of the unusual things I'm trying for this build is to assemble it all within a wooden dish drying rack. This isn't the first time that ARM development boards have been used in a cluster, with Ubuntu/Linaro and others using PandaBoard clusters for their build farm, etc. The other approaches to efficiently managing all of the boards with minimal space has been stacking them with spacers between the PCBs, etc...The issues I see with that though is it makes the boards not swappable at all without dismantling the entire stack, time consuming to setup, and requires special parts...These racks can be found for a few dollars on the Internet, can be used almost "out of the box", would allow for multiple different development boards / PCB sizes / mounting hole differences, very easy to swap out boards, could be fabricated from scratch quite easily, and allow for fairly high density clusters in compact space. The shape should also allow for managing cables and placing of AC power supplies (underneath) fairly easy. The size of this dish drying rack though for a current six-board cluster is a bit large, but this concept may end up working quite well for others."

Each PandaBoard ES had a 16GB SDHC Class-10 card for storage and the head node was using NFS to share a home directory to the other nodes. MPICH2 was being used for the MPI cluster configuration atop Ubuntu 12.04. Ubuntu 12.10 offers some remarkable ARM performance gains on the OMAP4 hardware due to the newer Linux kernel (version 3.4 at present, compared to Linux 3.2 on Ubuntu 12.04) and the major compiler upgrade (GCC 4.7 vs. GCC 4.6), but due to some early configuration problems with the post-alpha-one snapshot, the installations were reverted to Ubuntu 12.04 LTS. Ubuntu 12.10 will be loaded up on this compute cluster in the coming weeks and should result in double-digit gains.

The PandaBoards can be powered off USB, but I ended up using AC adapters for each of the PandaBoards in order to be able to better monitor the overall power draw in different configurations using a WattsUp USB-based AC power meter that then interfaces with the Phoronix Test Suite for automated power monitoring while benchmarking.

The purpose of Effimaß basically just comes down to having a low-power ARM cluster for some interesting benchmark results, using it as a platform for porting new benchmarks to ARM, bringing new MPI cluster capabilities to the Phoronix Test Suite framework, and making other related improvements for ARM and cluster computing within the open-source Phoronix benchmarking stack.

Before getting any further, thanks go out to Texas Instruments and SVTronics for supporting this cluster by providing the PandaBoard ES development boards at a discount to make it even more affordable. The PandaBoard ES currently retails for about $182 USD, so when everything is said and done, the cluster carries a retail price of around $1,200 USD.

In the Phoronix office tour it was shown where Effimaß was going to be setup, but due short CAT5 cables interfering, for now the cluster is running on a standalone cart and is attached to a 48-port enterprise-grade network switch. The network switch power consumption wasn't monitored as part of the power monitoring since eventually the cluster will move back to its intended location where it will be tapping an already present 24-port enterprise-grade network switch and thus not lead to any net increase in power draw.

As far as the Effimaß name for this 12-core ARM cluster, from the Bavarian who named it, her reasoning came down to: "1. Effizienz for efficiency, in terms of the low-power ARM cores. 2. Maß: basically the measure of all things." So how efficient is the cluster at the moment? While tweaking has only just begun and more tests are still being conducted, the six PandaBoard ES cluster is idling at 15~16 Watts, under load is averaging about 29 Watts, and the peak power consumption I have seen under load is 31 Watts. This is while delivering some rather surprising numbers. I was expecting the 10/100 Ethernet and/or the SDHC-backed storage to be the main bottleneck for the Ubuntu Linux cluster, but that did not actually seem to be the case, at least with the MPI workloads tested thus far.

Aside from the low-power consumption, another benefit is that the cluster itself is completely silent -- no fans at all or rotating media. However, if you are doing a large cluster backed by an enterprise-grade switch, you may have a fan there, and that can be a bit noisy.

First up are results looking at how well the MPICH2 cluster connected via 10/100 Ethernet is scaling up to 12 ARMv7 cores via the six PandaBoard ES units. Each of the PandaBoard ES had a 16GB SDHC card with a stock install of the OMAP4 version of Ubuntu 12.04 LTS with the stock packages: Linux 3.2, GCC 4.6, and EXT4 file-system as the main components. Again, each PandaBoard ES has an OMAP4460 dual-core Cortex-A9 1.2GHz processor and 1GB of system memory.

The primary benchmark being used for this benchmark is the MPI version of the NAS Parallel Benchmarks (NPB) from NASA. This test profile was used since NPB is quite popular for parallel computing, the tests are very reliable, there are a plethora of others using the NAS Parallel Benchmarks, and all-around are just really good Fortran-based benchmarks for testing multiple computing cores.

To look at how well the mini ARM cluster is scaling, there are benchmarks for the main NAS Parallel Benchmarks when using 1, 2, 4, 6, 8, 10, and 12 cores.

First up is the EP.B test, which is NASA's "Embarrassingly Parallel" benchmark with the class B problem size. When at one core of the PandaBoard ES the EP.B test is running at 5.27 Mop/s and then both cores on a single PandaBoard ES is at 10.3 Mop/s (+95%), at two PandaBoard ES nodes it's 19.26 (+86%), etc. When all six PandaBoard ES nodes are utilized, EP.B is running at 55.2 Mop/s, which is at 10.47x the speed of utilizing a single Cortex-A9 1.2GHz core on the OMAP4460 SoC. The scaling is actually better than was originally anticipated for using 10/100 Ethernet and a shared NFS mount from an SDHC card.

The EP.C test is still Embarrassingly Parallel but with the C problem size, which is about four times larger than EP.B. Going from one to twelve cores, there was a 10.07x speed-up.

For the NPB FT test, which is a discrete 3D fast Fourier Transform, all-to-all communication, there is a problem. When involving MPI across multiple PandaBoards, the performance plummets compared to when utilizing just a single board.

The last NPB test for looking at the scaling is LU.A. The LU pseudo-application is a Lower-Upper Gauss-Seidel solver. This workload did not scale as well across the cluster with going from one to twelve cores just resulting in a 4.8x performance improvement. However, this was not a failure of MPI or the PandaBoards with the scaling when going from one to two cores on a single PandaBoard ES just yielding a 29% improvement.

Now to look at the power efficiency, which will then be compared to a few other systems. First is looking at the PandaBoard ES power efficiency when using a single OMAP4460 board followed by the results for maxing out all six PandaBoard ES boards in the cluster.

While idling, the single PandaBoard ES was consuming around 3.9 Watts as measured by the WattsUp AC power adapter.

With the load generated by the EP.C workload with the NAS Parallel Benchmark, the average power consumption with both ARMv7 A9 cores being fully utilized, the average power consumption was 6.4 Watts with a peak of 6.6 Watts.

For the EP.C workload, the Phoronix Test Suite calculated that the single PandaBoard ES was achieving 1.60 Mop/s per Watt.

For the Lower-Upper Gauss-Seidel solver, the power consumption was similar while achieving 38.27 Mop/s per Watt.

An overview of the power consumption of the single PandaBoard ES for the duration of the testing.

Now it is time to look at the power efficiency results when all six PandaBoard ES boards making up this cluster were fully utilized using MPICH2.

When all six boards were idling, the average power consumption was 16.8 Watts.

With the EP.C workload on all twelve ARM cores, the average power consumption was 30.4 Watts for all six PandaBoards, which is in line with each PandaBoard burning through 5~6 Watts under load. When it comes to the performance-per-Watt, the EP.C test was yielding an average of 1.78 Mop/s per Watt, which was an increase over the single PandaBoard ES at 1.60 Mop/s per Watt.

The first system we have for comparison against the PandaBoard ES hardware is an Intel Atom 330 NetTop from MSI. This system has an Intel Atom 330 that is a dual-core part operating at 1.60GHz plus Hyper Threading to provide four logical cores. The system also has 1GB of RAM, ATI Mobility Radeon HD 4300 graphics, and a 250GB Samsung HDD. This is not the perfect comparison due to the system using an HDD rather than a Secure Digital card, etc, but the results are quite apparent anyhow. The comparison systems were all running stock, clean installations of Ubuntu 12.04 LTS for the respective architecture. All of the systems were also tested without an X.Org Server running.

When the Intel Atom 330 NetTop was idling, 29 Watts was being drawn, which was just below the power of the 12-core ARM cluster.

With the Embarrassingly Parallel workload, the Atom 330 system was burning through 33.6 Watts for the system with the dual-core Atom that also boasts Hyper Threading -- again, not too far off from the power consumption of all six PandaBoard ES units.

The Atom 330 had an EP.C result of 19.7 Mop/s. For comparison, a single PandaBoard ES with two cores being used for the NPB test had an average of 10.3 Mop/s or four cores were at 18.4 Mop/s. In other words, a single 1.2GHz Cortex-A9 core is close to a single core of an Atom 330 x86 processor. Twelve 1.2GHz ARM cores produced 53.2 Mop/s.

When looking at the performance-per-Watt of the Atom 330 against the PandaBoard ES hardware, ARM is an astounding win. EP.C on the Atom 330 averaged 0.59 Mop/s per Watt where as the single PandaBoard ES was nearly three times as efficient with its 1.6 Mop/s per Watt average and the entire 12-core cluster at 1.78 Mop/s per Watt.

The next Atom 330 test was NPB LU.A.

The Atom 330 system had an average of 429 Mop/s with LU.A compared to 190 Mop/s on a single PandaBoard ES, 440 Mop/s on two PandaBoard ES, 579 Mop/s on three PandaBoard ES, and 915 Mop/s on all six PandaBoard ES units benchmarked.

The ARM cluster slaughtered the Atom 330 again in power efficiency: 11.98 Mop/s per Watt for the Intel x86_64 CPU while a single PandaBoard ES was at 38 Mop/s per Watt and 30 Mop/s per Watt for the entire cluster.

For a more powerful system, an Intel Core i7 3770K "Ivy Bridge" setup was also benchmarked. The i7-3770K has a base frequency of 3.50GHz with four physical cores plus Hyper Threading. An SSD was in this system with its stock Ubuntu 12.04 x86_64 installation.

The system's AC power consumption, with the HD 4000 integrated graphics and SSD included, while idling on an Ubuntu 12.04 installation had a 41 Watt average power consumption.

The Core i7 3770K "Ivy Bridge" system produced 277 Mop/s for EP.C, which was more than five times faster than the 12-core ARM cluster.

The average power consumption for the Ivy Bridge system was 107 Watts under this load, which worked out to 2.58 Mop/s per Watt. The Ivy Bridge system for this workload was even more efficient than the PandaBoard ES at 1.78 Mop/s per Watt. The Ivy Bridge system was even with an SSD and other attached components requiring additional power than the ARM setup.

LU.A on the i7-3770K came in at 9514 Mop/s, which was more than ten times faster than the six-board PandaBoard ES cluster.

The average power consumption of the Intel system was 111 Watts.

The efficiency was at 85 Mop/s per Watt compared to the Effimaß cluster at 30.79 Mop/s per Watt.

Next up is an Intel Atom Z530 "Poulsbo" system in the form of a CompuLab Fit-PC2 NetTop. With the Atom Z530 was 1GB of RAM, Poulsbo graphics, and a 160GB Hitachi HDD.

The Z530 Poulsbo system idled at 8.5 Watts.

Under load on Ubuntu 12.04 with the NPB EP.C workload was just 10.7 Watts for the dual-core part.

EP.C ran at just 8.74 Mop/s on the Atom Z530.

The EP.C result translated to 0.82 Mop/s per Watt. This Atom SoC was more efficient than the Atom 330, but still far behind the PandaBoard ES setups at 1.6~1.78 Mop/s per Watt.

The last results for this article is from an AMD Fusion E-350 APU system with integrated Radeon HD graphics. The E-350 "Zacate" is a dual-core part clocked at 1.60GHz.

The Fusion E-350 system idled at about 38 Watts.

Under the NPB EP.C workload, the average power consumption for the low-end Fusion E-Series system jumped to 45 Watts.

While burning through 45 Watts, the dual-core E-350 averaged 23.34 Mop/s. This result comes ahead of the 4-core PandaBoard ES cluster configuration (18.41 Mop/s) but behind the 6-core result (26.97 Mop/s).

While on a per-core basis the Fusion E-350 did much better, the power efficiency put the Effimaß cluster ahead. The E-350 averaged 0.52 Mop/s where as the cluster and single PandaBoard ES configurations were three times more efficient per Watt.

The LU.A result from the AMD Fusion E-350 was superior to that of the 12-core ARM cluster. However, the ARM cluster still won when it came to energy efficiency for this solver: 30.79 vs. 19.15 Mop/s.

The PandaBoard ES did better than the Intel Atom hardware that was tested in terms of performance and energy efficiency, but the ARMv7 Cortex-A9 processors vastly lost out to the Intel Ivy Bridge hardware in terms of raw performance (obviously) but also the power efficiency was even better for this latest-generation Intel architecture. Besides winning on performance and efficiency, the Core i7 3770K system would cost less than the cost of a six PandaBoard ES cluster setup.

Comparing the Effimaß cluster to the AMD Fusion E-350, the Zacate APU had better raw performance but the ARM cluster was the performance-per-Watt leader.

While this do-it-yourself ARM cluster configuration is not the most effective setup right now, it will be interesting to see how the cluster performance works out for the next-generation ARMv8 hardware as well as the many ARM core servers coming out, such as the upcoming products from Calxeda.

Aside from wanting to upgrade the cluster to Ubuntu 12.10 for GCC 4.7 and the other newer packages that boost the ARM Linux performance, other planned optimizations include: investigating performance differences if using a high-speed NAS with NFS mount for the cluster rather than SDHC cards (e.g. using something like the Excito B3) or a USB-based SSD, kernel tweaks, other ARMv7 compiler tuning, and some other modifications to see how far the PandaBoard ES hardware can be pushed while keeping to minimal power use.

More ARM Linux benchmarks are forthcoming. There will also be information soon about a 48 PandaBoard cluster.

About The Author
Author picture

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 10,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter or contacted via MichaelLarabel.com.

Related Articles
Trending Linux News