The NVIDIA Jetson TX2 Performance Has Evolved Nicely Since Launch

Michael replied

30 August 2018, 04:38 AM
Originally posted by coder View Post

Michael should consult the Nvidia docs, or at least run the benchmarks twice.

In traditional profile-driven compilation, there's usually not much benefit to running them more than twice.

PTS generally always runs the benchmarks a minimum of three times for statistical accuracy...
Likes 1
Leave a comment:
coder replied

30 August 2018, 04:03 AM
Originally posted by ldesnogu View Post

That's not what Girolamo_Cavazzoni was talking about I guess: Denver is using a JIT that improves performance as the benchmark runs by recompiling hot spots on the fly, that's much more dynamic than profile-driven compilation where, as you say, you run the program twice and you're done. OTOH I'm not sure the JIT engine of Denver will improve performance of a program when it's run multiple times.

I know that, though it probably wasn't clear. I'm assuming it performs similarly to profile-driven recompilation. The main benefit is knowing how frequently different branches are taken, which can inform decisions about inlining, loop unrolling, vectorization, etc. I doubt there's much to be gained by runs beyond the second, so long as the input dataset is the same.

It would be an interesting test to actually measure the performance of consecutive runs, until it plateaus. It'd avoid the need for all this speculation. I always prefer to just try it out, when possible.

Originally posted by ldesnogu View Post

Another thing to take care of when benchmarking TX2 is to make sure of where a program is running: the Denver core or the A57 core. When the board boots the Denver cores are disabled and have to be explicitly enabled. The nvpmodel tool can be used to enable either or both clusters.

IMO, he should benchmark it the way most people are likely to use it. If there's some obvious configuration that most people do, then fair enough. But Nvidia is really the one on the hook for taking care of these sorts of configuration issues.

That said, a second set of benchmarks on the optimized configuration would be bonus.
Leave a comment:
ldesnogu replied

30 August 2018, 03:09 AM
Originally posted by coder View Post

Michael should consult the Nvidia docs, or at least run the benchmarks twice.

In traditional profile-driven compilation, there's usually not much benefit to running them more than twice.

That's not what Girolamo_Cavazzoni was talking about I guess: Denver is using a JIT that improves performance as the benchmark runs by recompiling hot spots on the fly, that's much more dynamic than profile-driven compilation where, as you say, you run the program twice and you're done. OTOH I'm not sure the JIT engine of Denver will improve performance of a program when it's run multiple times.

Another thing to take care of when benchmarking TX2 is to make sure of where a program is running: the Denver core or the A57 core. When the board boots the Denver cores are disabled and have to be explicitly enabled. The nvpmodel tool can be used to enable either or both clusters.
Likes 1
Leave a comment:
coder replied

30 August 2018, 01:18 AM
Originally posted by Girolamo_Cavazzoni View Post

I have a question regarding the Denver cores: How often are the benchmarks run? As far as I know a software layers optimizes the code fed to the cores which are very wide in-order designs. Processing speed should grow with each iteration until it hits a maximum.

Michael should consult the Nvidia docs, or at least run the benchmarks twice.

In traditional profile-driven compilation, there's usually not much benefit to running them more than twice.
Leave a comment:
coder replied

30 August 2018, 01:16 AM
Originally posted by grok View Post

These are the Quadros of ARM boards.

Sort of. If they were truly analogous to the Quadro workstation cards, they would offer less performance for several times the cost. In fairness, these do have some of the fastest embedded GPUs available.
Leave a comment:
coder replied

30 August 2018, 01:13 AM
Originally posted by milkylainen View Post

Does the Jetson TX2 have NVLinks somewhere?

Definitely not, but I think Xavier might. Their "Drive PX Pegasus" platform links two of them together, somehow. The presence of NVLink is mentioned, here (though without details like the # of links):

https://www.anandtech.com/show/11913...t-nextgen-gpus
Leave a comment:
coder replied

30 August 2018, 01:01 AM
Originally posted by schmidtbag View Post

I'd like to get one of these but they're just so expensive.

Why? Just get a Gemini Lake board.

For about $120, you get probably comparable CPU performance and a well-supported GPU (with open source drivers) that's at least half what the TX2 packs. Power consumption is comparable, but Gemini Lake is available in standard form factors.

ASRock > J5005-ITX

http://www.asrock.com/mb/Intel/J5005-ITX/index.us.asp

ASRock Super AlloyIntel Quad-Core Pentium Processor J5005 (up to 2.8 GHz)Supports DDR4 2133/2400 SO-DIMM1 PCIe 2.0 x1, 1 M.2 (Key E)Graphics Output Options: D-Sub, HDMI, DVI-D7.1 CH HD Audio (Realtek ALC892 Audio Codec), ELNA Audio Caps4 SATA34 USB 3.1 Gen1 (2 Front, 2 Rear)Supports Full Spike Protection, ASRock Live Update & APP Shop

That particular board is passively-cooled and supports HDMI 2.0.

Best of all, it supports OpenCL (which Tegra SoC's do not)!

Last edited by coder; 30 August 2018, 01:30 AM.
Leave a comment:
grok replied

29 August 2018, 06:47 PM
Originally posted by milkylainen View Post

Does the Jetson TX2 have NVLinks somewhere?
I'd like a full (and free, preferably) NVLink IP block to integrate into super fast FPGA's.
That would enable me to move some serious data into the GPU.
PCIe is just not fast enough.

That's high end and specific. Even the dual socket POWER9 Talos board doesn't have it. The hardware must be configured for it (CPU, motherboard) such that it's available on specific servers at eye watering prices. It does have PCIe 4.0 with CAPI 2.0 extensions, at 16x that's 32GB/s with memory coherency.

Nvidia Xavier goes around the problem by including tons of hardware on the die.

Future GPUs use PCIe 4.0 (e.g. the coming AMD ones, no word on the RTX 2080 as far as I know). Future AMD Zen2 may have PCIe 4.0 but that's speculation.
If you really need to keep things small Ryzen APUs (ITX or embedded) are the closest thing to the Tegras I guess, but stuck with a PCIe 8x slot (3.0 on current hardware, 4.0 on next gen probably)
Leave a comment:
Girolamo_Cavazzoni replied

29 August 2018, 05:02 PM
I thought likewise, it just made most sense to me.
Leave a comment:
milkylainen replied

29 August 2018, 04:17 PM
Originally posted by Girolamo_Cavazzoni View Post

I have a question regarding the Denver cores: How often are the benchmarks run? As far as I know a software layers optimizes the code fed to the cores which are very wide in-order designs. Processing speed should grow with each iteration until it hits a maximum.

I believe the answer is 42.
That is the number when maximum number crunching has occurred.
Likes 2
Leave a comment:

Announcement

The NVIDIA Jetson TX2 Performance Has Evolved Nicely Since Launch

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: