Originally posted by defaultUser
View Post
The big difference in the system is that you get Broadwell CPUs and PCIE P100s. You can actually ~match the CPU side of DGX-1 config in a ~$80k setup (or get 30-40% better CPU/cache at max); on the other hand the PCIE version of P100 is clocked ~10% lower and has no NVLink. There is probably little to no space for Ethernet and IB cards, but you can get ConnectX 4 EDR for <=$1k per adapter, so it's not a huge bump in price. Of course, if you need IB, you'll need to pull out some of the GPUs.
Whether the PCI-E vs NVLink matters or not depends on the problem. If you ask the marketing folks, you'll hear one thing but if you ask the engineers, you might very well hear another. Many problems have enough independent parallelism that as long as you program things smartly you can overlap communication and computation and get decent performance even though you have to use PCIE. Now, what matters a lot is what kind of PCIE arch/topology did they implement. Depending on how that looks like it can be shit or great (see this blog by R Walker if you're interested in details: https://exxactcorp.com/blog/explorin...communication/)
Secondly, much if not all the software that NVIDIA develops works equally well on a System76, Dell, Supermicro, etc. machines, even on your own hack-box -- as long as you buy Tesla. There are optimizations that they can do with NVLink present, but whether you need it and how much the difference it makes is i) very problem-dependent and ii) in many cases non-trivial high-performance engineering (often research) question. AFAIK, actually model-parallel learning allows scaling on PCIE (PLX trees) equally well as on NVLink. Check this talk if you want to learn more, Scott Le Grand is a super-smart guy (second half, from about 18 minutes):
video: http://on-demand.gputechconf.com/gtc...deo/S6492.html
PDF: http://on-demand.gputechconf.com/gtc...r-dynamics.pdf
Comment