Announcement

**schmidtbag** · 08 January 2021, 11:34 AM

Seems to scale very nicely.

**DanielG** · 08 January 2021, 12:38 PM

So after comparing some of the results, the main takeaway is that the performance per MHz is identical - in all the 64core test results I checked the Ampere Altra is about 32% faster, at 32% higher clockrate.

**tildearrow** · 08 January 2021, 01:29 PM

Oh hey ARM benchmarks! But you forgot about one...

DaCapo?..... I just want to switch to ARM soon (do they even have an 8/16 core version?) but I heard the performance is still bad... (not sure how did Apple manage x86 performance levels with M1)

**luben** · 08 January 2021, 02:50 PM

You should be able to test on C6g instances, they have higher clocked Graviton2 CPUs

**onlyLinuxLuvUBack** · 08 January 2021, 03:17 PM

Could you add a test, kvm running a x86 linux centos/ubuntu server running a java jar cmdline app calculating mandelbrots with a benchmark text output.
Maybe instead a kvm+x86-linux+java jar running jdbc to a x86 postgresql server(in the same vm) and bench the transactions per sec ?

**qarium** · 08 January 2021, 07:51 PM

Originally posted by tildearrow View Post

(not sure how did Apple manage x86 performance levels with M1)

it is a simple idea what makes the M1 fast...

it is the concept of zero copy... means "no buffers" and fast direct connect to very fast and paralell ram. M1 has 8 channel ram for its 12 cores 8 channel "4266 MT/s LPDDR4X SDRAM" it is 128bit interface means 16bit per channel thats a massive paralell design.
what does zero copy mean? traditional computer copy from hdd/ssd to ram and then into L3 cache and then copy into L2 cache and then copy it into L1 cache and then it copy it to Vram of the gpu.
the M1 is different as soon as the data hit the cpu cache it is the same cache as the gpu uses it is like the HSA AMD- APU design but AMD gave up on that but APPLE has the ecosystem to bring exactly this.

there is no L3 cache

the 8 high performance cores have 12 MB L2 cache
the 4 low performance have 4 MB L2 cache
the L2 cache is "shared" in amd/intel chip design only the L3 cache is chared.
traditional systems waste electric energy by copy data from one buffer to another to vram to L3 cache to L2 cache and so one.
compared to this apple M1 as only very few copy of buffers...
DDR5 ram has similar idea a 128bit interface has 4 channels with 32bit per channel. ddr4 was 64bit per channel.

the Idea is everytime you do not copy data from cache to cache you save energy and the more parallel the ram gets the more parallel you can have different tasks the data have faster latency to hit the cpu.

the BIG-Little design is less an improvement than people think apple claims there is no need for more than 4 small cores means even their M2 design with 16 fast cored or 32fast cores will still only have 4 small cores. and this is only if system runs on battery or idle...

in performance situations the small cores have near zero impact.

also why the apple M1 is so fast: it is producted in 5nm node and can hit 3,2ghz for ARM this is very fast others are at 2,5ghz.

in the future apple want to do even more "zero copy" for example they want to put RAM on the cpu die to... to even stop the copy between L2 cache and RAM...

believe it or not but they even plan to put the SSD on the cpu die do. to save the copy between ssd and ram.

AMD had similar design in ther HSA APUs but in the end it need the right software stack and AMD is forced to do windows and so one. so AMD had no chance to put HSA everywhere.

now you maybe still think how in hell can this work?... 5nm vs 14nm (intel 11900K) this explains all.
becaus of 5nm apple would be faster even with the most boring cpu design and without zero copy design...

problem with this design is not the 12core version M1 but the M2 with 32 cores...
there is high chance that apple can not beat highend AMD hardware 5950X+5900XT... or 64core threadripper.

thats because zero copy design works well on small systems but there is little chance that this works for the big systems to... this means even the M2 with 32 core will not beat a 64core amd cpu with 6900XT...

**_msw_** · 08 January 2021, 08:09 PM

Originally posted by luben View Post

You should be able to test on C6g instances, they have higher clocked Graviton2 CPUs

All the currently available AWS Graviton2 based EC2 instances, including C6g, are clocked at 2.5 GHz.

**artivision** · 08 January 2021, 10:06 PM

Originally posted by Qaridarium View Post

it is a simple idea what makes the M1 fast...

it is the concept of zero copy... means "no buffers" and fast direct connect to very fast and paralell ram. M1 has 8 channel ram for its 12 cores 8 channel "4266 MT/s LPDDR4X SDRAM" it is 128bit interface means 16bit per channel thats a massive paralell design.
what does zero copy mean? traditional computer copy from hdd/ssd to ram and then into L3 cache and then copy into L2 cache and then copy it into L1 cache and then it copy it to Vram of the gpu.
the M1 is different as soon as the data hit the cpu cache it is the same cache as the gpu uses it is like the HSA AMD- APU design but AMD gave up on that but APPLE has the ecosystem to bring exactly this.

there is no L3 cache

the 8 high performance cores have 12 MB L2 cache
the 4 low performance have 4 MB L2 cache
the L2 cache is "shared" in amd/intel chip design only the L3 cache is chared.
traditional systems waste electric energy by copy data from one buffer to another to vram to L3 cache to L2 cache and so one.
compared to this apple M1 as only very few copy of buffers...
DDR5 ram has similar idea a 128bit interface has 4 channels with 32bit per channel. ddr4 was 64bit per channel.

the Idea is everytime you do not copy data from cache to cache you save energy and the more parallel the ram gets the more parallel you can have different tasks the data have faster latency to hit the cpu.

the BIG-Little design is less an improvement than people think apple claims there is no need for more than 4 small cores means even their M2 design with 16 fast cored or 32fast cores will still only have 4 small cores. and this is only if system runs on battery or idle...

in performance situations the small cores have near zero impact.

also why the apple M1 is so fast: it is producted in 5nm node and can hit 3,2ghz for ARM this is very fast others are at 2,5ghz.

in the future apple want to do even more "zero copy" for example they want to put RAM on the cpu die to... to even stop the copy between L2 cache and RAM...

believe it or not but they even plan to put the SSD on the cpu die do. to save the copy between ssd and ram.

AMD had similar design in ther HSA APUs but in the end it need the right software stack and AMD is forced to do windows and so one. so AMD had no chance to put HSA everywhere.

now you maybe still think how in hell can this work?... 5nm vs 14nm (intel 11900K) this explains all.
becaus of 5nm apple would be faster even with the most boring cpu design and without zero copy design...

problem with this design is not the 12core version M1 but the M2 with 32 cores...
there is high chance that apple can not beat highend AMD hardware 5950X+5900XT... or 64core threadripper.

thats because zero copy design works well on small systems but there is little chance that this works for the big systems to... this means even the M2 with 32 core will not beat a 64core amd cpu with 6900XT...

Nope. Arc executes dependencies on many problems first and main parts then, with many unrestricted units. x86 executes dependency of a single problem and the main part together with advanced maths and fewer complex units, only if space remains goes for the second problem. x86=faster per execution unit and low latency, Arc=faster per app for the most apps and high throughput. Once cores like Arc and RiskV reach 1024-2048bit vector registers Gpus will be in trouble to, if those vectors uploaded with some 3D and color upcodes.

**qarium** · 08 January 2021, 11:10 PM

Originally posted by artivision View Post

Nope. Arc executes dependencies on many problems first and main parts then, with many unrestricted units. x86 executes dependency of a single problem and the main part together with advanced maths and fewer complex units, only if space remains goes for the second problem. x86=faster per execution unit and low latency, Arc=faster per app for the most apps and high throughput. Once cores like Arc and RiskV reach 1024-2048bit vector registers Gpus will be in trouble to, if those vectors uploaded with some 3D and color upcodes.

I really wish that i could unterstand what you said... sorry i do not.

what is ARC Executes ?

according to google something like this: "ARC executes U.S. Army Patriot Missile Battery Redeployment to Germany. BREMERHAVEN, GERMANY – Under contract with American Roll-On Roll-Off Carrier (ARC), M/V Merchant discharged a U.S. Army Patriot missile battery in Bremerhaven on January 2 after it had been used in support of ongoing NATO operations in Turkey."

ARC executes U.S. Army Patriot Missile Battery Redeployment to Germany - American Roll-On Roll-Off Carrier

https://www.arcshipping.com/news/arc-executes-u-s-army-patriot-missile-battery-redeployment-to-germany/

BREMERHAVEN, GERMANY - Under contract with American Roll-On Roll-Off Carrier (ARC), M/V Merchant discharged a U.S. Army Patriot missile battery in

also google does not know RiskV ... it only knows RISC-V... https://de.wikipedia.org/wiki/RISC-V

lets assume you mean ARM with ARC...

also i have to tell you that RISC-V-GPU is already death all the projects about RISC-V-GPU switched to IBM POWER ISA...

also about your 1024bit and 4048bit vector units... intel has 512bit and it is a great failure because the core downclocks to 1,7ghz on a so called 5,3GHZ cpu.

Announcement

Ampere Altra vs. Amazon Graviton2 Linux Performance Benchmarks

Ampere Altra vs. Amazon Graviton2 Linux Performance Benchmarks

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment