Announcement

**schmidtbag** · 30 August 2018, 09:21 AM

Originally posted by coder View Post

Why? Just get a Gemini Lake board.

For about $120, you get probably comparable CPU performance and a well-supported GPU (with open source drivers) that's at least half what the TX2 packs. Power consumption is comparable, but Gemini Lake is available in standard form factors.

ASRock > J5005-ITX

http://www.asrock.com/mb/Intel/J5005-ITX/index.us.asp

ASRock Super AlloyIntel Quad-Core Pentium Processor J5005 (up to 2.8 GHz)Supports DDR4 2133/2400 SO-DIMM1 PCIe 2.0 x1, 1 M.2 (Key E)Graphics Output Options: D-Sub, HDMI, DVI-D7.1 CH HD Audio (Realtek ALC892 Audio Codec), ELNA Audio Caps4 SATA34 USB 3.1 Gen1 (2 Front, 2 Rear)Supports Full Spike Protection, ASRock Live Update & APP Shop

That particular board is passively-cooled and supports HDMI 2.0.

Definitely not a bad alternative and I appreciate you pointing that out. But if I'm going to go with a standard form factor, I might as well go for a socketed CPU. What I need this for is a robot, so the form factor doesn't really matter that much (as long as it's small) and neither do most of the connectors related to desktop usage (including HDMI 2.0, surround sound audio, plenty of USB ports, etc).

Best of all, it supports OpenCL (which Tegra SoC's do not)!

I actually wasn't aware the Tegras didn't support OpenCL. That's a shame. However, for the time being I've been [begrudgingly] using CUDA, since there are more resources available for it that do what I need. I'd strongly prefer OpenCL but I'd have to do a lot of code from scratch, which would be a hefty investment of time and effort on my part for a hobbyist project. All that being said, I'm also using OpenCV with the T-API, which by default uses OpenCL, but, I think there's a build of OpenCV specific to Tegra users that can use CUDA instead.

**milkylainen** · 30 August 2018, 11:57 AM

Originally posted by coder View Post

Definitely not, but I think Xavier might. Their "Drive PX Pegasus" platform links two of them together, somehow. The presence of NVLink is mentioned, here (though without details like the # of links):

https://www.anandtech.com/show/11913...t-nextgen-gpus

To bad. Since the Pascal architecture definitely has them.
It's a shame that NVidia spends a bunch on ASIC real estate and BGA balls and there is nothing to show for, for the general public that is.
No NVLink core or IP-block anywhere in sight.
I'd love a NVidia lab-board with a 6x NVLink to a fast FPGA to see what that setup could do.150G bidi bw with NVLink 2.0.

**milkylainen** · 30 August 2018, 12:14 PM

Originally posted by ldesnogu View Post

That's not what Girolamo_Cavazzoni was talking about I guess: Denver is using a JIT that improves performance as the benchmark runs by recompiling hot spots on the fly, that's much more dynamic than profile-driven compilation where, as you say, you run the program twice and you're done. OTOH I'm not sure the JIT engine of Denver will improve performance of a program when it's run multiple times.

Another thing to take care of when benchmarking TX2 is to make sure of where a program is running: the Denver core or the A57 core. When the board boots the Denver cores are disabled and have to be explicitly enabled. The nvpmodel tool can be used to enable either or both clusters.

It's fancy pancy speak for ISA frontend translated to a internal architecture, which all modern "large" CPU's are today anyway.
You can extend the translation a bit, especially if your backend is really, really wide.
But in general I don't expect much performance benefit compared to a more traditional way.
There are other benefits though. For example: It's easier to hide stupid binary compilation speed issues when moving between CPU's.
It's easier to make old code benefit from a newer CPU.

**ldesnogu** · 30 August 2018, 12:29 PM

Originally posted by milkylainen View Post

It's fancy pancy speak for ISA frontend translated to a internal architecture, which all modern "large" CPU's are today anyway.
You can extend the translation a bit, especially if your backend is really, really wide.
But in general I don't expect much performance benefit compared to a more traditional way.
There are other benefits though. For example: It's easier to hide stupid binary compilation speed issues when moving between CPU's.
It's easier to make old code benefit from a newer CPU.

You definitely should read about how Denver works before claiming it's "fancy pancy speak". Here is a starting point: https://www.anandtech.com/show/8701/...xus-9-review/4
It goes much farther than what your typical CPU does in HW.

**milkylainen** · 30 August 2018, 01:47 PM

Originally posted by ldesnogu View Post

You definitely should read about how Denver works before claiming it's "fancy pancy speak". Here is a starting point: https://www.anandtech.com/show/8701/...e.php?id=11262
It goes much farther than what your typical CPU does in HW.

I did and I stand by my opinion.
Yes. It is taking code translation to internal ISA a bit further.
But market speak makes you believe it will do Ludicrous speed with code optimization shenanigans.

All modern and large CPU's marry a really wide/deep backend with a industry standard ISA.
While this is taking it a bit further it is no magic sauce.
Keeping optimized micro/macro-op translations cached is not a new idea.
You're trading a lot of silicon for smartness. You could spend that silicon on a beefier front end or a wider mem interface, etc.
Unless the smartness results in drastic compexity reduction for the same speedup gains, it's usually not worth it (tiled rendering, rasterization for example).
Complexity reduction could be reordering done in software, etc.

The discussion for brain vs brawn has been going on for decades.
It's usually been universal that brawn is the simpler and more generic tradeoff.
Easier to implement, verify etc.
Transmeta did part of this already (part of the team came from Transmeta). They failed miserably.
Their CPU wasn't faster than a contemporary CPU that spent as much silicon on pure brawn.
In the end the customer won't care much for whatever brains if the $$ does not buy enough speed.
Denver could easily do x86 translation from the frontend aswell if Nvidia wanted an x86 CPU.

Also, it's not like the cache will hold translations for an entire benchmark that is run a gazillion times to "optimize".
It will most likely hold a couple of tight kernel loops that are often used frequently.
As I said. There are advantages and disadvantages.

**ldesnogu** · 30 August 2018, 01:56 PM

Thanks for the course on CPU micro-architecture

**grok** · 30 August 2018, 02:07 PM

Originally posted by coder View Post

Definitely not, but I think Xavier might. Their "Drive PX Pegasus" platform links two of them together, somehow. The presence of NVLink is mentioned, here (though without details like the # of links):

https://www.anandtech.com/show/11913...t-nextgen-gpus

Interesting but it's lacking in specifics. The dedicated GPUs could have NVLink between them.

**coder** · 30 August 2018, 11:25 PM

Originally posted by ldesnogu View Post

Consecutive runs shouldn't change anything unless the JIT maintains a DB of hot spots (as far as I know only FX!32 did that). All optimizations are done on the fly,

That would be lame. They don't need to keep profiling data, but should at least keep a persistent cache of JIT-translated/optimized images. Otherwise, load times and power utilization would suffer, both of which would be quite counterproductive to most of this chip's goals.

**coder** · 30 August 2018, 11:34 PM

Originally posted by schmidtbag View Post

Definitely not a bad alternative and I appreciate you pointing that out. But if I'm going to go with a standard form factor, I might as well go for a socketed CPU. What I need this for is a robot, so the form factor doesn't really matter that much (as long as it's small) and neither do most of the connectors related to desktop usage (including HDMI 2.0, surround sound audio, plenty of USB ports, etc).

You seen this?

https://developer.nvidia.com/embedded/buy/jetson-tx2

Here's an Apollo Lake SoC on an embeddable board:

Homepage

http://www.up-board.org/upsquared/

Industrial use and capable of enabling next-generation industrial automation and AI solutions. Wide range of AI Acceleration modules in mPCIe, M.2 2280 and PCIe[x4] form factorPowerful, Industrial, AI-Proof and expendable to scale upPre-installed software package includes Ubuntu and Intel Edge Insi

Originally posted by schmidtbag View Post

I actually wasn't aware the Tegras didn't support OpenCL. That's a shame. However, for the time being I've been [begrudgingly] using CUDA, since there are more resources available for it that do what I need. I'd strongly prefer OpenCL but I'd have to do a lot of code from scratch, which would be a hefty investment of time and effort on my part for a hobbyist project. All that being said, I'm also using OpenCV with the T-API, which by default uses OpenCL, but, I think there's a build of OpenCV specific to Tegra users that can use CUDA instead.

Will it use CUDA, automatically? I thought you had to explicitly use stuff in the cuda namespace.

OpenCV: cv::cuda Namespace Reference

https://docs.opencv.org/3.4.0/d1/d1a/namespacecv_1_1cuda.html

**coder** · 30 August 2018, 11:37 PM

Originally posted by ldesnogu View Post

You definitely should read about how Denver works before claiming it's "fancy pancy speak". Here is a starting point: https://www.anandtech.com/show/8701/...e.php?id=11262
It goes much farther than what your typical CPU does in HW.

True, but that deals with the original Denver, from 2014. This has Denver 2, described here (note that Parker is the code name for TX2):

Hot Chips 2016: NVIDIA Discloses Tegra Parker Details

https://www.anandtech.com/show/10596/hot-chips-2016-nvidia-discloses-tegra-parker-details

Anyway, your link didn't work for me. Try this:

https://www.anandtech.com/show/8701/...xus-9-review/2

Announcement

The NVIDIA Jetson TX2 Performance Has Evolved Nicely Since Launch

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment