Announcement

Collapse
No announcement yet.

The NVIDIA Jetson TX2 Performance Has Evolved Nicely Since Launch

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by milkylainen View Post
    Does the Jetson TX2 have NVLinks somewhere?
    I'd like a full (and free, preferably) NVLink IP block to integrate into super fast FPGA's.
    That would enable me to move some serious data into the GPU.
    PCIe is just not fast enough.
    That's high end and specific. Even the dual socket POWER9 Talos board doesn't have it. The hardware must be configured for it (CPU, motherboard) such that it's available on specific servers at eye watering prices. It does have PCIe 4.0 with CAPI 2.0 extensions, at 16x that's 32GB/s with memory coherency.

    Nvidia Xavier goes around the problem by including tons of hardware on the die.

    Future GPUs use PCIe 4.0 (e.g. the coming AMD ones, no word on the RTX 2080 as far as I know). Future AMD Zen2 may have PCIe 4.0 but that's speculation.
    If you really need to keep things small Ryzen APUs (ITX or embedded) are the closest thing to the Tegras I guess, but stuck with a PCIe 8x slot (3.0 on current hardware, 4.0 on next gen probably)

    Comment


    • #12
      Originally posted by schmidtbag View Post
      I'd like to get one of these but they're just so expensive.
      Why? Just get a Gemini Lake board.

      For about $120, you get probably comparable CPU performance and a well-supported GPU (with open source drivers) that's at least half what the TX2 packs. Power consumption is comparable, but Gemini Lake is available in standard form factors.

      http://www.asrock.com/mb/Intel/J5005-ITX/index.us.asp

      That particular board is passively-cooled and supports HDMI 2.0.

      Best of all, it supports OpenCL (which Tegra SoC's do not)!
      Last edited by coder; 30 August 2018, 01:30 AM.

      Comment


      • #13
        Originally posted by milkylainen View Post
        Does the Jetson TX2 have NVLinks somewhere?
        Definitely not, but I think Xavier might. Their "Drive PX Pegasus" platform links two of them together, somehow. The presence of NVLink is mentioned, here (though without details like the # of links):

        https://www.anandtech.com/show/11913...t-nextgen-gpus

        Comment


        • #14
          Originally posted by grok View Post
          These are the Quadros of ARM boards.
          Sort of. If they were truly analogous to the Quadro workstation cards, they would offer less performance for several times the cost. In fairness, these do have some of the fastest embedded GPUs available.

          Comment


          • #15
            Originally posted by Girolamo_Cavazzoni View Post
            I have a question regarding the Denver cores: How often are the benchmarks run? As far as I know a software layers optimizes the code fed to the cores which are very wide in-order designs. Processing speed should grow with each iteration until it hits a maximum.
            Michael should consult the Nvidia docs, or at least run the benchmarks twice.

            In traditional profile-driven compilation, there's usually not much benefit to running them more than twice.

            Comment


            • #16
              Originally posted by coder View Post
              Michael should consult the Nvidia docs, or at least run the benchmarks twice.

              In traditional profile-driven compilation, there's usually not much benefit to running them more than twice.
              That's not what Girolamo_Cavazzoni was talking about I guess: Denver is using a JIT that improves performance as the benchmark runs by recompiling hot spots on the fly, that's much more dynamic than profile-driven compilation where, as you say, you run the program twice and you're done. OTOH I'm not sure the JIT engine of Denver will improve performance of a program when it's run multiple times.

              Another thing to take care of when benchmarking TX2 is to make sure of where a program is running: the Denver core or the A57 core. When the board boots the Denver cores are disabled and have to be explicitly enabled. The nvpmodel tool can be used to enable either or both clusters.

              Comment


              • #17
                Originally posted by ldesnogu View Post
                That's not what Girolamo_Cavazzoni was talking about I guess: Denver is using a JIT that improves performance as the benchmark runs by recompiling hot spots on the fly, that's much more dynamic than profile-driven compilation where, as you say, you run the program twice and you're done. OTOH I'm not sure the JIT engine of Denver will improve performance of a program when it's run multiple times.
                I know that, though it probably wasn't clear. I'm assuming it performs similarly to profile-driven recompilation. The main benefit is knowing how frequently different branches are taken, which can inform decisions about inlining, loop unrolling, vectorization, etc. I doubt there's much to be gained by runs beyond the second, so long as the input dataset is the same.

                It would be an interesting test to actually measure the performance of consecutive runs, until it plateaus. It'd avoid the need for all this speculation. I always prefer to just try it out, when possible.

                Originally posted by ldesnogu View Post
                Another thing to take care of when benchmarking TX2 is to make sure of where a program is running: the Denver core or the A57 core. When the board boots the Denver cores are disabled and have to be explicitly enabled. The nvpmodel tool can be used to enable either or both clusters.
                IMO, he should benchmark it the way most people are likely to use it. If there's some obvious configuration that most people do, then fair enough. But Nvidia is really the one on the hook for taking care of these sorts of configuration issues.

                That said, a second set of benchmarks on the optimized configuration would be bonus.

                Comment


                • #18
                  Originally posted by coder View Post
                  Michael should consult the Nvidia docs, or at least run the benchmarks twice.

                  In traditional profile-driven compilation, there's usually not much benefit to running them more than twice.
                  PTS generally always runs the benchmarks a minimum of three times for statistical accuracy...
                  Michael Larabel
                  https://www.michaellarabel.com/

                  Comment


                  • #19
                    Originally posted by coder View Post
                    I know that, though it probably wasn't clear. I'm assuming it performs similarly to profile-driven recompilation. The main benefit is knowing how frequently different branches are taken, which can inform decisions about inlining, loop unrolling, vectorization, etc. I doubt there's much to be gained by runs beyond the second, so long as the input dataset is the same.

                    It would be an interesting test to actually measure the performance of consecutive runs, until it plateaus. It'd avoid the need for all this speculation. I always prefer to just try it out, when possible.
                    Consecutive runs shouldn't change anything unless the JIT maintains a DB of hot spots (as far as I know only FX!32 did that). All optimizations are done on the fly, one of the benefits being that if the run you do to make the profile-driven recompilation isn't the same as the final (measured) run you still get good optimizations on code paths that weren't exercised. The obvious drawback is that you have to be careful not to spend too much time doing optims at runtime or you get freezes.

                    IMO, he should benchmark it the way most people are likely to use it. If there's some obvious configuration that most people do, then fair enough. But Nvidia is really the one on the hook for taking care of these sorts of configuration issues.

                    That said, a second set of benchmarks on the optimized configuration would be bonus.
                    That's a dev board, people are expected to play with low level stuff

                    I'll do some checks on TX2 once I have some free time...

                    Comment


                    • #20
                      Originally posted by Michael View Post

                      PTS generally always runs the benchmarks a minimum of three times for statistical accuracy...
                      Michael did you enable the Denver cores as explained here?

                      Comment

                      Working...
                      X