The NVIDIA Jetson TX2 Performance Has Evolved Nicely Since Launch

coder replied

30 August 2018, 11:25 PM
Originally posted by ldesnogu View Post

Consecutive runs shouldn't change anything unless the JIT maintains a DB of hot spots (as far as I know only FX!32 did that). All optimizations are done on the fly,

That would be lame. They don't need to keep profiling data, but should at least keep a persistent cache of JIT-translated/optimized images. Otherwise, load times and power utilization would suffer, both of which would be quite counterproductive to most of this chip's goals.
Leave a comment:
grok replied

30 August 2018, 02:07 PM
Originally posted by coder View Post

Definitely not, but I think Xavier might. Their "Drive PX Pegasus" platform links two of them together, somehow. The presence of NVLink is mentioned, here (though without details like the # of links):

https://www.anandtech.com/show/11913...t-nextgen-gpus

Interesting but it's lacking in specifics. The dedicated GPUs could have NVLink between them.
Leave a comment:
ldesnogu replied

30 August 2018, 01:56 PM
Thanks for the course on CPU micro-architecture
Leave a comment:
milkylainen replied

30 August 2018, 01:47 PM
Originally posted by ldesnogu View Post

You definitely should read about how Denver works before claiming it's "fancy pancy speak". Here is a starting point: https://www.anandtech.com/show/8701/...e.php?id=11262
It goes much farther than what your typical CPU does in HW.

I did and I stand by my opinion.
Yes. It is taking code translation to internal ISA a bit further.
But market speak makes you believe it will do Ludicrous speed with code optimization shenanigans.

All modern and large CPU's marry a really wide/deep backend with a industry standard ISA.
While this is taking it a bit further it is no magic sauce.
Keeping optimized micro/macro-op translations cached is not a new idea.
You're trading a lot of silicon for smartness. You could spend that silicon on a beefier front end or a wider mem interface, etc.
Unless the smartness results in drastic compexity reduction for the same speedup gains, it's usually not worth it (tiled rendering, rasterization for example).
Complexity reduction could be reordering done in software, etc.

The discussion for brain vs brawn has been going on for decades.
It's usually been universal that brawn is the simpler and more generic tradeoff.
Easier to implement, verify etc.
Transmeta did part of this already (part of the team came from Transmeta). They failed miserably.
Their CPU wasn't faster than a contemporary CPU that spent as much silicon on pure brawn.
In the end the customer won't care much for whatever brains if the $$ does not buy enough speed.
Denver could easily do x86 translation from the frontend aswell if Nvidia wanted an x86 CPU.

Also, it's not like the cache will hold translations for an entire benchmark that is run a gazillion times to "optimize".
It will most likely hold a couple of tight kernel loops that are often used frequently.
As I said. There are advantages and disadvantages.
Likes 1
Leave a comment:
ldesnogu replied

30 August 2018, 12:29 PM
Originally posted by milkylainen View Post

It's fancy pancy speak for ISA frontend translated to a internal architecture, which all modern "large" CPU's are today anyway.
You can extend the translation a bit, especially if your backend is really, really wide.
But in general I don't expect much performance benefit compared to a more traditional way.
There are other benefits though. For example: It's easier to hide stupid binary compilation speed issues when moving between CPU's.
It's easier to make old code benefit from a newer CPU.

You definitely should read about how Denver works before claiming it's "fancy pancy speak". Here is a starting point: https://www.anandtech.com/show/8701/...xus-9-review/4
It goes much farther than what your typical CPU does in HW.

Last edited by ldesnogu; 31 August 2018, 01:28 AM. Reason: Fix link.
Likes 1
Leave a comment:
milkylainen replied

30 August 2018, 12:14 PM
Originally posted by ldesnogu View Post

That's not what Girolamo_Cavazzoni was talking about I guess: Denver is using a JIT that improves performance as the benchmark runs by recompiling hot spots on the fly, that's much more dynamic than profile-driven compilation where, as you say, you run the program twice and you're done. OTOH I'm not sure the JIT engine of Denver will improve performance of a program when it's run multiple times.

Another thing to take care of when benchmarking TX2 is to make sure of where a program is running: the Denver core or the A57 core. When the board boots the Denver cores are disabled and have to be explicitly enabled. The nvpmodel tool can be used to enable either or both clusters.

It's fancy pancy speak for ISA frontend translated to a internal architecture, which all modern "large" CPU's are today anyway.
You can extend the translation a bit, especially if your backend is really, really wide.
But in general I don't expect much performance benefit compared to a more traditional way.
There are other benefits though. For example: It's easier to hide stupid binary compilation speed issues when moving between CPU's.
It's easier to make old code benefit from a newer CPU.
Leave a comment:
milkylainen replied

30 August 2018, 11:57 AM
Originally posted by coder View Post

Definitely not, but I think Xavier might. Their "Drive PX Pegasus" platform links two of them together, somehow. The presence of NVLink is mentioned, here (though without details like the # of links):

https://www.anandtech.com/show/11913...t-nextgen-gpus

To bad. Since the Pascal architecture definitely has them.
It's a shame that NVidia spends a bunch on ASIC real estate and BGA balls and there is nothing to show for, for the general public that is.
No NVLink core or IP-block anywhere in sight.
I'd love a NVidia lab-board with a 6x NVLink to a fast FPGA to see what that setup could do.150G bidi bw with NVLink 2.0.
Leave a comment:
schmidtbag replied

30 August 2018, 09:21 AM
Originally posted by coder View Post

Why? Just get a Gemini Lake board.

For about $120, you get probably comparable CPU performance and a well-supported GPU (with open source drivers) that's at least half what the TX2 packs. Power consumption is comparable, but Gemini Lake is available in standard form factors.

ASRock > J5005-ITX

http://www.asrock.com/mb/Intel/J5005-ITX/index.us.asp

ASRock Super AlloyIntel Quad-Core Pentium Processor J5005 (up to 2.8 GHz)Supports DDR4 2133/2400 SO-DIMM1 PCIe 2.0 x1, 1 M.2 (Key E)Graphics Output Options: D-Sub, HDMI, DVI-D7.1 CH HD Audio (Realtek ALC892 Audio Codec), ELNA Audio Caps4 SATA34 USB 3.1 Gen1 (2 Front, 2 Rear)Supports Full Spike Protection, ASRock Live Update & APP Shop

That particular board is passively-cooled and supports HDMI 2.0.

Definitely not a bad alternative and I appreciate you pointing that out. But if I'm going to go with a standard form factor, I might as well go for a socketed CPU. What I need this for is a robot, so the form factor doesn't really matter that much (as long as it's small) and neither do most of the connectors related to desktop usage (including HDMI 2.0, surround sound audio, plenty of USB ports, etc).

Best of all, it supports OpenCL (which Tegra SoC's do not)!

I actually wasn't aware the Tegras didn't support OpenCL. That's a shame. However, for the time being I've been [begrudgingly] using CUDA, since there are more resources available for it that do what I need. I'd strongly prefer OpenCL but I'd have to do a lot of code from scratch, which would be a hefty investment of time and effort on my part for a hobbyist project. All that being said, I'm also using OpenCV with the T-API, which by default uses OpenCL, but, I think there's a build of OpenCV specific to Tegra users that can use CUDA instead.
Likes 1
Leave a comment:
ldesnogu replied

30 August 2018, 07:25 AM
Originally posted by Michael View Post

PTS generally always runs the benchmarks a minimum of three times for statistical accuracy...

Michael did you enable the Denver cores as explained here?
Leave a comment:
ldesnogu replied

30 August 2018, 07:23 AM
Originally posted by coder View Post

I know that, though it probably wasn't clear. I'm assuming it performs similarly to profile-driven recompilation. The main benefit is knowing how frequently different branches are taken, which can inform decisions about inlining, loop unrolling, vectorization, etc. I doubt there's much to be gained by runs beyond the second, so long as the input dataset is the same.

It would be an interesting test to actually measure the performance of consecutive runs, until it plateaus. It'd avoid the need for all this speculation. I always prefer to just try it out, when possible.

Consecutive runs shouldn't change anything unless the JIT maintains a DB of hot spots (as far as I know only FX!32 did that). All optimizations are done on the fly, one of the benefits being that if the run you do to make the profile-driven recompilation isn't the same as the final (measured) run you still get good optimizations on code paths that weren't exercised. The obvious drawback is that you have to be careful not to spend too much time doing optims at runtime or you get freezes.

IMO, he should benchmark it the way most people are likely to use it. If there's some obvious configuration that most people do, then fair enough. But Nvidia is really the one on the hook for taking care of these sorts of configuration issues.

That said, a second set of benchmarks on the optimized configuration would be bonus.

That's a dev board, people are expected to play with low level stuff

I'll do some checks on TX2 once I have some free time...
Leave a comment:

Announcement

The NVIDIA Jetson TX2 Performance Has Evolved Nicely Since Launch

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: