Announcement

**vladpetric** · 09 August 2020, 03:12 PM

Originally posted by atomsymbol

Just some notes:

The Skylake/Zen µOP cache isn't a trace cache, in the sense of Pentium 4 trace cache. A disadvantage of a trace cache is that there have to be strict limits put on the tracing, otherwise it can end up consuming exponential amounts of bits.

There is no information about the instruction format used by Skylake/Zen in their µOP caches, so I can only speculate. However, it is probable that the µOP instruction encoding will be slowly diverging from the programmer-visible x86 instruction encoding as time goes on - why wouldn't it?

Maybe the downloaded binary blobs can be verified against the operational semantics of the original code. The download is a lesser security issue than the upload, in my opinion. Even if the machine uploads the code to trusted providers of the optimization service, the trust is limited. For common Linux apps it wouldn't be an issue because the source code is already open.

Macro-op fusion (CMP + Jcc) is a miniscule JIT optimization.

As I said earlier - not entirely sure what we're arguing about, as it seems that for the most part we're agreeing.

That divergence has been happening for a while now (e.g., to the best of my knowledge, there are internal registers that a micro-op can use, but you don't see them at instruction levels; IOW they are micro-architectural, not architectural).

As far as I'm concerned, a proper JIT would do some constant folding (e.g., chained adds get collapsed) and maybe some register reallocation as well, utilizing micro-architectural resources of course.

I don't know if it's worth quibbling about whether macro-op fusion, which combines two instructions into a more complex one, that the execution engine can handle as such, makes a JIT or not. If it were one out of a dozen optimization techniques, sure.

Does the fact that Intel and AMD completely control the internal micro op representation allow them to potentially do such optimizations? Yes, absolutely. But this doesn't mean that they do do it or will do it in the future. And these things are measurable, after all (Dr. Agner Fog uses synthetic code and the micro-op counter in his measurements).

Do read the paper though.

**RussianNeuroMancer** · 09 August 2020, 03:18 PM

Originally posted by Toggleton View Post

Did a run on my N2(nearly broken. usb hub dead+ sd card does not work anymore

)

What happened? And why you still didn't requested RMA?

Also some results of your N2 benchmarks looks anomalously strange, such as LibreOffice to PDF conversion. Any idea what may have caused this?

**ldesnogu** · 10 August 2020, 08:28 AM

Originally posted by vladpetric View Post

Hoping that you're right (honestly!). Any benchmarks though?

Sorry was on holiday

I rely on Anandtech results for SPECCPU 2006/2017: https://www.anandtech.com/show/15603...ania-devices/6

Cortex-A77 has about the performance of Apple A11. Iphone with A11 was released on Sept 2017, while Qualcomm 865 was released in early 2020. So that's more than the 1/1.5 year I previously claimed, it's about 2 years.

**vladpetric** · 10 August 2020, 09:17 AM

Originally posted by ldesnogu View Post

Sorry was on holiday

I rely on Anandtech results for SPECCPU 2006/2017: https://www.anandtech.com/show/15603...ania-devices/6

Cortex-A77 has about the performance of Apple A11. Iphone with A11 was released on Sept 2017, while Qualcomm 865 was released in early 2020. So that's more than the 1/1.5 year I previously claimed, it's about 2 years.

I like this, thanks! which processor is the A77 though?

**ldesnogu** · 10 August 2020, 09:25 AM

Originally posted by vladpetric View Post

I like this, thanks! which processor is the A77 though?

Cortex-A77 is found in Snapdragon 865.

**Toggleton** · 10 August 2020, 01:02 PM

Originally posted by RussianNeuroMancer View Post

What happened? And why you still didn't requested RMA?

Also some results of your N2 benchmarks looks anomalously strange, such as LibreOffice to PDF conversion. Any idea what may have caused this?

Well the N2 did work fine for over a year and it is likely my fault that is broken

(it did break while i re flashed the eMMC to the mainline kernel but i was not that careful that day) and TBH i have still a use for it with BOINC(right now making some Rosetta@home work

)

I did retry it now and the libreoffice test is even worse today (i guess cause it is so hot) but it is a singlethread run and i would guess that it is maybe not optimized for aarch64 yet (the test run that was done on RPI 64bit https://openbenchmarking.org/result/...NE-2007316NE53 does not have a libreoffice result but my odroid c4 with 4x ARMv8 Cortex-A55 @ 1.91GHz has a result of 54.088(31.154 was the original test on the N2 / 38.108 N2 today)

eMMC speed look fine so that should not be the reason. So i guess the RPI/ARMv7 32-bit community did optimize some stuff that got no attention in arm64/aarch64 yet.

**JimmyZ** · 11 August 2020, 09:14 AM

Originally posted by starshipeleven View Post

ARM is an architecture that is supposed to scale up into the high-performance too.

ARM is not exactly an architecture, it's an instruction set, or the name of the company, that company also designs Cortex-Ax, a FAMILY of architectures. Apple tried to use one architecture to rule them all but soon realized that's not viable.

Originally posted by justwhatever View Post

No, actualy Michael test raspi with old outdated 32bit armv6 raspbian.

That's the official OS, 64 bit is still "EXPERIMENTAL", we should respect that.

The horrifying situation now is that high performance ARM processors are not for sale, like Amazon Graviton2 and Apple Ax, which is worse then the situation of x86, yes there're only 2 competitors but at least they sell bare processors.

**Syfer** · 11 August 2020, 09:34 AM

Originally posted by hotaru View Post

why run 32-bit on the Pi 4 instead of 64-bit? 64-bit is faster for a lot of workloads due to having more registers.

I am curious how much of an improvement Ubuntu Server for Pi4(compiled for the newer ARM instruction set and not backwards compatible with Pi3) would provide in these benchmarks

**sykobee** · 11 August 2020, 11:09 AM

I think it would be more interesting to compare with, say, an Atom based SBC, if there is one in a similar price range.

A72 is a cheap to license core, as it's fairly old. A77 and A78 are far more performant if you want a ARM designed ARM core - Apple's core's are a step or two ahead of course and rumours put the A14 significantly faster than the A13 still.

The SoC in the RPi is cheap - $5 to $10 at most - and that's why it includes a cheap core, and is made on a cheap process (28nm) that limits achieved clock speeds. On the other hand it still runs in a very low TDP, and the cost is tiny.

I have no idea when the RPi5 will be out - maybe 2021, possibly 2022. I'd expect that to bump the core to an A75 (which should be cheap by then), and maybe jump to a better process again (original Pi was 40nm until RPi3) if the costs work out. But they want to keep the cheapest Pi $35, and that puts fundamental limits on what they can achieve.

I think the Pi4 had a good showing in this article, given its circumstances.

**vladpetric** · 11 August 2020, 04:20 PM

Originally posted by atomsymbol

I have read both of the suggested papers (1: RENO, 2: Memory Bypassing). A few notes about the paper "Loh, Sami, Friendly: Memory Bypassing: Not Worth the Effort" of my own:

I don't trust this paper because it doesn't contain essential information that I would like to see in a paper about this particular topic
If a loop executes 100+ iterations then exactly one of the following options is true:
1. Either, a variable's value depends on the value (1..100+) of the loop iteration,
2. or, a variable's value does not depend on the value (1..100+) of the loop iteration (loop-invariant variables)
The address used in a particular load-xor-store instruction:
1. Either, the address depends on which loop iteration is being executed,
  - The probability of memory bypassing being applicable (in real-world code) is low (but of course, one can invent contrived non-real-world examples where this probability is high)
2. or, the address (=X) is loop-invariant:
  - Considering the loop as a whole:
    1. Either, X.num_loads = 0 and X.num_stores > 0:
      - This is an output variable
      - No memory bypassing required
      - Applicable optimization: discard all writes to X except the last write to X
        The probability of this optimization being applicable in real-world highly optimized code is very low, but not-highly-optimized code might be amenable to this optimization
    2. or, X.num_loads > 0 and X.num_stores = 0:
      - This is an input constant (so it isn't actually a variable, it is a constant)
      - No memory bypassing required
    3. or, X.num_loads > 0 and X.num_stores > 0:
      - Memory bypassing will lead to speedups - but there will be speedup only if load(X) and store(X) are executing in the same clock
      - A necessary precondition for load(X) and store(X) to be executable in the same clock is the following:
        The CPU must start executing multiple iterations of the loop at the same time, for example the CPU starts executing 100 iterations of a loop concurrently
        See https://millcomputing.com/docs/pipelining/

1. Thank you! (I mean it

)

2. It may take me a bit of time to fully respond, but I wanted to make a few quick points:

The latency of the register file is for the most part completely hidden. I.e., if you have two instructions, the first producing register A, and the second consuming register A, they can issue back-to-back (no additional delay). If the first instruction takes 1 cycle to execute, the second can issue in the next cycle (bypass networks and the OoO scheduler takes care of that).

However, the same is not true for loads. When a load issues, it takes 2-3-4 cycles if it hits in the L1. That latency isn't hidden, an instruction depending on the load needs to wait those 2-3-4 cycles.

Speculative bypassing means that, if you somehow know (or predict) that the value of the load is already in a specific physical register in the physical register file, you just rename the load to that physical register and you don't have to execute it anymore. Essentially, you converted a memory operation to a register read, which, as I said, has no "visible" latency. So you saved those 2-3-4 cycles that it takes to access (in parallel) the L1 cache and Store Queue.

In other words, you don't need the store and load to execute in the same cycle for speculative memory bypassing to produce a benefit.

The Loh et al. paper is pretty influential. I do realize that computer architecture papers are hard to read ... And the fact that it's a negative paper - something is not worth doing - means that one needs to understand that "something" first.

Announcement

How A Raspberry Pi 4 Performs Against Intel's Latest Celeron, Pentium CPUs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment