Announcement

Collapse
No announcement yet.

Fujitsu Begins Adding A64FX Support To GCC Compiler

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fujitsu Begins Adding A64FX Support To GCC Compiler

    Phoronix: Fujitsu Begins Adding A64FX Support To GCC Compiler

    The Fujitsu A64FX ARM processor that has 48 cores per node and 32GB of HBM2 memory that currently powers the fastest supercomputer is beginning to see GCC compiler support...

    http://www.phoronix.com/scan.php?pag...64FX-GCC-Start

  • #2
    So Micheal, when will we see one of these chips, in works station form, in your lineup?

    Comment


    • #3
      Considering that they requested the upstreaming of GCC so soo, mean that they probably had both pretty much functional from the start. It's nice that they decided to do it properly, instead of restricting it to one because of internal political reasons.

      I suppose this pragmatism comes from needing to support the entire language spectrum for their "customers".
      Some languages are only mature on llvm, others only on gcc.

      Comment


      • #4
        It would be interesting to compare Perf/Watt of this Fujitsu A64FX vs modern X86 CPUs from Intel and AMD.
        And I'd wonder how much of that diff would be related to diffs in ISA

        ARM ISA implementations have:
        - much simpler instructions fetch and decode,
        - amounting in usually one less decoding pipeline stage,
        - less stuff thrown away on branch mispredictions.

        It all amounts to X86s having more transistors, more switching capacitance, more Joules of energy consumed, more heat produced and therefore more thermal throttling.

        On the other hand, X86 has somewhat more dense code, and therefore somewhat better cache utilization. But then it needs bigger and wider L0 (uOp) caches.

        Comment


        • #5
          Originally posted by pkese View Post
          It would be interesting to compare Perf/Watt of this Fujitsu A64FX vs modern X86 CPUs from Intel and AMD.
          And I'd wonder how much of that diff would be related to diffs in ISA

          ARM ISA implementations have:
          - much simpler instructions fetch and decode,
          - amounting in usually one less decoding pipeline stage,
          - less stuff thrown away on branch mispredictions.

          It all amounts to X86s having more transistors, more switching capacitance, more Joules of energy consumed, more heat produced and therefore more thermal throttling.

          On the other hand, X86 has somewhat more dense code, and therefore somewhat better cache utilization. But then it needs bigger and wider L0 (uOp) caches.
          The Fujitsu A64FX is ( extremely ) optimised for SIMD and specific type of HPC work load, For general purpose usage, it would be more interesting to see how Apple Silicon Performs.

          Comment


          • #6
            Being a computational physicist and having written programs to take advantage of CPUs as well as GPUs, I must say that it is easier to achieve near theoretical peak performance from a wide SIMD unit that lies within the CPU, than from a GPU, especially for not-so-massive simulations. These days, OpenMP allows one to easily vectorize critical inner loops with relatively low effort. The advantage of not having to move data between the host and device cannot be overstated for computations that are latency sensitive. The biggest problem with using GPUs for moderate-size simulations is that the data transfer takes longer than the actual computation itself. Even in an APU, because the CPU and the GPU caches are not coherent, you still have to copy/map the results from the GPU which takes some time.

            In that sense, the results from the SIMD unit inside a CPU core are immediately available for further processing/computations. For massively large simulations with millions of atoms, yes the GPU will be faster and the computational cost will more than offset the data transfer latency. For everything else, a CPU with multiple fast cores with wide SIMDs are better. Mind you, I'm only talking about the HPC space. For general-purpose personal computers, like Linus Torvalds had recently commented, AVX-512 is not worth the extra silicon space.

            Case in point, the Ryzen 5 2500U in my laptop gets about 50% more double precision OPs/second when using vectorized, multithreaded code, as compared to best OpenCL code running on the integrated Vega 8 GPU, for some of my computations. In our line of work, we usually abuse double precision more often than we should. But even with single-precision code, there are a large number of use cases where performing the computations in-core with the data almost never leaving the caches is faster than offloading it to the GPU.

            This is why the A64FX exists. It is a hybrid, purpose-built processor that will trump a traditional CPU+GPU node in terms of perf/watt for the kind of work loads that it is meant to run. So comparing the A64FX with general-purpose hardware for perf/watt is not really an apples-to-apples comparison. And on a related note, it is no longer a question of whether a RISC/CISC ISA is more efficient. You can make either excel in perf/watt for specific workloads, by beefing up specific parts of the design.
            Last edited by arunbupathy; 08-09-2020, 10:29 AM.

            Comment


            • #7
              Some benchmarks of the A64FX will be available soon in some papers at the EAHPC workshop: https://arm-hpc.github.io/IEEECluster2020/. It is difficult to have access to this platform yet.
              I have already played with the ARM emulator but I am a little skeptical on the architecture when I look at the A64FX documentation: https://github.com/fujitsu/A64FX/tree/master/doc
              If you look closely at the instruction latencies, you will see that the latency for AND (SVE vector), is 4, FADD and FMAD/FMLA (SVE vector) is 9, and FSQRT (SVE vector) depends on the vector length (128-bit -> 29, 256-bit -> 52, 512-bit -> 98 ) so I suspect that it is not really 512-bit like AVX-512 but probably something like 4x128-bit with 512-bit registers but 128-bit compute units to explain the huge latencies.
              If someone has more information on this let me know !!!

              Comment


              • #8
                Forgot to say in my previous post that current compilers (GCC 10.1, Clang 10, Armclang 20.1) are not really better at autovectorization with the SVE instruction set, thus still requiring to use intrinsics and using SVE intrinsics is really more tedious than using SSE/AVX/NEON (maybe not AVX-512...) ones since a mask is required and you cannot use arithmetic operators by default. A last thing I have noticed is that when you write a = b + c * d with vector types, it is automatically transformed into an FMA instruction with SSE/AVX/NEON, but not with current SVE compilers so you need to explicitly write a = svmad_z( svptrue_b32(), c, d, b ) .

                Comment


                • #9
                  When can Torvalds come out and hope that the 512-bit extensions that are the only reason these chips were made in the first place need to die a slow and painful death?

                  In fact, maybe he could reject the entire architecture from the mainline kernel to show how brave he is. But I doubt it.

                  Comment


                  • #10
                    First benchmarks running on real hardware (most benchmarks have so far been obtained from emulators) have been presented at the APLAT 2020: https://conference-indico.kek.jp/eve...ibutions/2139/

                    In addition I tried to run the Phoronix Test Suite on the A64FX few weeks ago, however most of the benchmarks do not work at all on aarch64. Benchmarks I found that actually compiled are showing awkward results (high performance variations) which I suspect are related to wrong compiler options (no SVE, etc.) or high optimizations of the benchmarks for x86_64..

                    Comment

                    Working...
                    X