Originally posted by ldesnogu
View Post
Announcement
Collapse
No announcement yet.
The NVIDIA Jetson TX2 Performance Has Evolved Nicely Since Launch
Collapse
X
-
-
Originally posted by coder View PostDefinitely not, but I think Xavier might. Their "Drive PX Pegasus" platform links two of them together, somehow. The presence of NVLink is mentioned, here (though without details like the # of links):
https://www.anandtech.com/show/11913...t-nextgen-gpus
Leave a comment:
-
Originally posted by ldesnogu View PostYou definitely should read about how Denver works before claiming it's "fancy pancy speak". Here is a starting point: https://www.anandtech.com/show/8701/...e.php?id=11262
It goes much farther than what your typical CPU does in HW.
Yes. It is taking code translation to internal ISA a bit further.
But market speak makes you believe it will do Ludicrous speed with code optimization shenanigans.
All modern and large CPU's marry a really wide/deep backend with a industry standard ISA.
While this is taking it a bit further it is no magic sauce.
Keeping optimized micro/macro-op translations cached is not a new idea.
You're trading a lot of silicon for smartness. You could spend that silicon on a beefier front end or a wider mem interface, etc.
Unless the smartness results in drastic compexity reduction for the same speedup gains, it's usually not worth it (tiled rendering, rasterization for example).
Complexity reduction could be reordering done in software, etc.
The discussion for brain vs brawn has been going on for decades.
It's usually been universal that brawn is the simpler and more generic tradeoff.
Easier to implement, verify etc.
Transmeta did part of this already (part of the team came from Transmeta). They failed miserably.
Their CPU wasn't faster than a contemporary CPU that spent as much silicon on pure brawn.
In the end the customer won't care much for whatever brains if the $$ does not buy enough speed.
Denver could easily do x86 translation from the frontend aswell if Nvidia wanted an x86 CPU.
Also, it's not like the cache will hold translations for an entire benchmark that is run a gazillion times to "optimize".
It will most likely hold a couple of tight kernel loops that are often used frequently.
As I said. There are advantages and disadvantages.
- Likes 1
Leave a comment:
-
Originally posted by milkylainen View Post
It's fancy pancy speak for ISA frontend translated to a internal architecture, which all modern "large" CPU's are today anyway.
You can extend the translation a bit, especially if your backend is really, really wide.
But in general I don't expect much performance benefit compared to a more traditional way.
There are other benefits though. For example: It's easier to hide stupid binary compilation speed issues when moving between CPU's.
It's easier to make old code benefit from a newer CPU.
It goes much farther than what your typical CPU does in HW.
- Likes 1
Leave a comment:
-
Originally posted by ldesnogu View PostThat's not what Girolamo_Cavazzoni was talking about I guess: Denver is using a JIT that improves performance as the benchmark runs by recompiling hot spots on the fly, that's much more dynamic than profile-driven compilation where, as you say, you run the program twice and you're done. OTOH I'm not sure the JIT engine of Denver will improve performance of a program when it's run multiple times.
Another thing to take care of when benchmarking TX2 is to make sure of where a program is running: the Denver core or the A57 core. When the board boots the Denver cores are disabled and have to be explicitly enabled. The nvpmodel tool can be used to enable either or both clusters.
You can extend the translation a bit, especially if your backend is really, really wide.
But in general I don't expect much performance benefit compared to a more traditional way.
There are other benefits though. For example: It's easier to hide stupid binary compilation speed issues when moving between CPU's.
It's easier to make old code benefit from a newer CPU.
Leave a comment:
-
Originally posted by coder View PostDefinitely not, but I think Xavier might. Their "Drive PX Pegasus" platform links two of them together, somehow. The presence of NVLink is mentioned, here (though without details like the # of links):
https://www.anandtech.com/show/11913...t-nextgen-gpus
It's a shame that NVidia spends a bunch on ASIC real estate and BGA balls and there is nothing to show for, for the general public that is.
No NVLink core or IP-block anywhere in sight.
I'd love a NVidia lab-board with a 6x NVLink to a fast FPGA to see what that setup could do.150G bidi bw with NVLink 2.0.
Leave a comment:
-
Originally posted by coder View PostWhy? Just get a Gemini Lake board.
For about $120, you get probably comparable CPU performance and a well-supported GPU (with open source drivers) that's at least half what the TX2 packs. Power consumption is comparable, but Gemini Lake is available in standard form factors.
ASRock Super AlloyIntel Quad-Core Pentium Processor J5005 (up to 2.8 GHz)Supports DDR4 2133/2400 SO-DIMM1 PCIe 2.0 x1, 1 M.2 (Key E)Graphics Output Options: D-Sub, HDMI, DVI-D7.1 CH HD Audio (Realtek ALC892 Audio Codec), ELNA Audio Caps4 SATA34 USB 3.1 Gen1 (2 Front, 2 Rear)Supports Full Spike Protection, ASRock Live Update & APP Shop
That particular board is passively-cooled and supports HDMI 2.0.
Best of all, it supports OpenCL (which Tegra SoC's do not)!
- Likes 1
Leave a comment:
-
Originally posted by coder View PostI know that, though it probably wasn't clear. I'm assuming it performs similarly to profile-driven recompilation. The main benefit is knowing how frequently different branches are taken, which can inform decisions about inlining, loop unrolling, vectorization, etc. I doubt there's much to be gained by runs beyond the second, so long as the input dataset is the same.
It would be an interesting test to actually measure the performance of consecutive runs, until it plateaus. It'd avoid the need for all this speculation. I always prefer to just try it out, when possible.
IMO, he should benchmark it the way most people are likely to use it. If there's some obvious configuration that most people do, then fair enough. But Nvidia is really the one on the hook for taking care of these sorts of configuration issues.
That said, a second set of benchmarks on the optimized configuration would be bonus.
I'll do some checks on TX2 once I have some free time...
Leave a comment:
Leave a comment: