Announcement

Collapse
No announcement yet.

Quad-Core ODROID-X Battles NVIDIA Tegra 3

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by ssvb View Post
    Unfortunately integer ld/st instructions can't dual-issue with NEON instructions for Cortex-A9 anymore.
    I think you are wrong: the L/S instructions have their own pipe, and issue can send instructions both to that pipe and to NEON pipes. Did you try it?

    Comment


    • #32
      Originally posted by ldesnogu View Post
      I think you are wrong: the L/S instructions have their own pipe, and issue can send instructions both to that pipe and to NEON pipes. Did you try it?
      Yes, of course. I have learned long ago that nobody can be trusted (both random dudes on the Internet and the people I actually consider to be quite knowledgeable). Documentation can't be also trusted without verification (not no mention that it is often incomplete or vague). It goes without saying that I can't be trusted too

      I have encountered the sad fact of Cortex-A9 being unable to dual issue NEON instructions with any L/S instructions (both ARM and NEON) in practice long ago. The Cortex-A9 NEON Media Processing Engine Technical Reference Manual says "with the exception of simultaneous loads and stores, the processor can execute VFP and Advanced SIMD instructions in parallel with ARM or Thumb instructions", which is admittedly not very clear. But there is not need guessing and misinterpreting because we can easily run a simple benchmark program:

      Code:
      .text
      .arch armv7-a
      .fpu neon
      .global main
      
      #ifndef CPU_CLOCK_FREQUENCY
      #error CPU_CLOCK_FREQUENCY must be defined
      #endif
      
      #define LOOP_UNROLL_FACTOR   20
      
      .func main
      main:
              push        {r4-r12, lr}
              ldr         ip, =(CPU_CLOCK_FREQUENCY / LOOP_UNROLL_FACTOR)
              b           1f
          .balign 64
      1:
          .rept LOOP_UNROLL_FACTOR
              vorr        d30, d30, d30
              vorr        d31, d31, d31
              vorr        d30, d30, d30
              vorr        d31, d31, d31
      #ifdef DO_ARM_LDR
              ldr         r0, [sp]
      #endif
              vorr        d30, d30, d30
              vorr        d31, d31, d31
              vorr        d30, d30, d30
              vorr        d31, d31, d31
      2:
          .endr
              subs        ip, ip, #1
              bne         1b
      
              mov         r0, #0
              pop         {r4-r12, pc}
      .endfunc
      Cortex-A9:
      Code:
      $ gcc -DCPU_CLOCK_FREQUENCY=1200000000 bench_mixed_ldr_neon.S && time ./a.out
      real	0m8.093s
      user	0m8.080s
      sys	0m0.000s
      
      $ gcc -DCPU_CLOCK_FREQUENCY=1200000000 -DDO_ARM_LDR=1 bench_mixed_ldr_neon.S && time ./a.out
      real	0m9.048s
      user	0m9.035s
      sys	0m0.000s
      Using LDR instruction adds an extra cycle for Cortex-A9.

      Cortex-A8:
      Code:
      $ gcc -DCPU_CLOCK_FREQUENCY=1000000000 bench_mixed_ldr_neon.S && time ./a.out
      real	0m8.018s
      user	0m8.016s
      sys	0m0.000s
      
      $ gcc -DCPU_CLOCK_FREQUENCY=1000000000 -DDO_ARM_LDR=1 bench_mixed_ldr_neon.S && time ./a.out
      real	0m8.019s
      user	0m8.000s
      sys	0m0.008s
      Cortex-A8 can dual-issue L/S instructions with NEON arithmetics perfectly fine.

      Comment


      • #33
        Thanks for clearing that up

        Comment


        • #34
          SS, were you running your OMAP4430 off of USB / OTG power on your original cpuburn test? I have heard that you can run current a bit above spec on the original Pandaboard (but not the 4460 ES). Still, I would be surprised it you can run it at ~300% above spec... I have had quite a bit of success with OTG power so I will give this a try on a Panda A2 generation. The power peaks I am seeing on ODROID-X seem to be during the c-ray PTS tests which peaks at 7W. How are you getting your direct current measurements? I will instrument my boards if I find a good way to measure that...

          Comment


          • #35
            Originally posted by SolarNet View Post
            SS, were you running your OMAP4430 off of USB / OTG power on your original cpuburn test? I have heard that you can run current a bit above spec on the original Pandaboard (but not the 4460 ES). Still, I would be surprised it you can run it at ~300% above spec... I have had quite a bit of success with OTG power so I will give this a try on a Panda A2 generation.
            No, I'm using a 5V power supply rated at 3A. OTG just can't provide enough current without violating USB spec. Even the idle system had ~550 mA current draw, which is already too much for OTG.

            The power peaks I am seeing on ODROID-X seem to be during the c-ray PTS tests which peaks at 7W.
            I bet you can run it a lot hotter with a cortex-a9 tuned cpuburn Just do the following and maybe run htop in another terminal or ssh session to verify that all 4 cores are fully loaded. I would not be surprised if the power consumption goes up to 10W or more, which should be easily measurable even with your apparently poor precision power meter:
            Code:
            $ wget https://raw.github.com/ssvb/ssvb.github.com/master/files/2012-04-10/ssvb-cpuburn-a9.S
            $ gcc ssvb-cpuburn-a9.S
            $ ./a.out
            How are you getting your direct current measurements? I will instrument my boards if I find a good way to measure that...
            Just a multimeter connected between the power supply and the 5V barrel jack on the board. Something similar to what is shown on the picture here.

            Comment


            • #36
              I will try this out... see if I can find the true peak power. I have some more benchmarks at
              http://openbenchmarking.org/result/1...AR-1208150AR20
              comparing a dual-core Exynos (Soft-float) to the quad-core Exynos (Hard-float)... the numbers aren't quite twice as good as I thought they would be. Worse PE / RAM ratio might be in play there... thanks...

              Comment


              • #37
                Sweet. 11W normal with spikes to 12W... and that is in my hypercooled mineral oil bath to be safe (seriously, I'll take a picture). Next test will be to fully engage the rest of the board and measure. Might need to break the nitrogen out for that one...

                Actually, I'm surprised at how stable it has been. I have only just now dropped it into the oil bath.

                Comment


                • #38
                  Originally posted by SolarNet View Post
                  Next test will be to fully engage the rest of the board and measure.
                  It would be interesting (and probably scary) if somebody could implement something like gpuburn-mali400 and run it on odroid-x together with cpuburn And kicking DMA to repeatedly copy something in the background could additionally stress the memory controller. But in any case, it is just a stress test for the cooling system. None of real applications is ever going to consume as much power.

                  Comment


                  • #39
                    I have some more benchmarks here... the suite keeps crashing at compile bench for some reason... will have to look at that...

                    openbenchmarking.org/result/1208245-AR-1208223AR23

                    Comment


                    • #40
                      Originally posted by SolarNet View Post
                      I have some more benchmarks here... the suite keeps crashing at compile bench for some reason... will have to look at that...
                      It's probably best not to hit compilebench on the ARM hardware with SD cards since compilebench is rather write-intensive on the storage.
                      Michael Larabel
                      http://www.michaellarabel.com/

                      Comment

                      Working...
                      X