Announcement

**JasonBorden79** · 31 August 2017, 11:55 AM

The thing that irks me is that x399 motherboards for threadripper are so expensive. You're looking at ~$340 minimum for a X399 motherboard and only ~$215 X299 Intel board. This makes getting Intel's $599 i7-7820X a better deal than the $549 7900X supposing you can make do with 28 pcie lanes instead of 60. Plus the fact that the i7-7820X performs better on average, has additional avx instructions, and uses less power. It seems that the 7900X should be priced a bit more competitively.

**caligula** · 31 August 2017, 12:23 PM

Originally posted by schmidtbag View Post

Actually, in most cases, it isn't simple at all.

Please define 'most cases'. How about iterating over all the applications you have, adding a cross in the checkbox if it scales to multiple cores. I'll start with few examples:
photo applications
coarse grained parallelism: perform a single operation on a set of images: parallel convert -resize 50% ::: *.jpg
fine grained parallelism: perform the operation on individual pixels, blocks of pixels, or for each layer.

audio
coarse grained parallelism: perform a single operation on a set of audio clips:: parallel flac ::: *.wav
fine grained parallelism: perform the operation on individual samples or blocks of samples

video
coarse grained parallelism: perform a single operation on a set of video clips: parallel encoder ::: *.files
fine grained parallelism: perform the operation on individual frames or blocks of frames: ffmpeg -threads

compilation
coarse grained parallelism: make -j16
fine grained parallelism: some modern threaded compiler hint: parsing is one of the most time consuming tasks and guess what, parsing is 100% embarrasingly parallel. Totally independent. Parsing is a totally pure function String -> AST.

disk i/o
coarse grained parallelism: one thread per device / operation
fine grained parallelism: linux encryption support, multithreaded scrub etc.

statistics, signal processing, ...
guess what, Python, R, Julia, and the other cool kids already use fortran optimized multithreaded vector code

web
coarse grained parallelism: one thread per tab/browser window
fine grained parallelism: web workers, threaded & vectorized backend libraries

Not just anything can easily be made multi-threaded, let alone with a performance improvement.

Funny thing is, most tasks I do scale just fine. See the examples I listed above.

Games, for example, don't use more threads because doing so adds to latency (and therefore I/O lag). It's also a challenge because many objects in a game need direct access to each others variables. Usually, each thread's memory access is isolated, but if you want each thread to share memory access, you either lose a lot of performance or you dramatically increase RAM usage (or both), neither of which are desirable.

Bullshit, here's an example. Try this (code here https://pastebin.com/Rx6nAUnF)

Code:

g++ game.cpp -o game -O3 -fopenmp
./game > game.csv
R --vanilla < game.r
xdg-open game.png

Multi-threading is only practical when each thread can comfortably work independently and doesn't depend on the synchronicity of others.

It depends. You have costs C_Seq (cost of the sequential computation of a task), C_Par (cost of the parallelized tasks), C_Syn (cost of synchronization. Parallel version is faster iff C_Par + C_Syn < C_Seq. You're almost claiming that this is never the case. Synchronization is pretty cheap with 1 to 8 threads. It just is. A modern CPU can do

GCC is only a single-threaded task, because multi-threading it would likely have diminishing returns or be waaay too complicated to set up. But by compiling multiple files into separate threads at the same time, you get to maximize CPU usage without slowing down anything, because [usually] none of the tasks depend on each other to complete.

Totally wrong. Modern compilers are multithreaded. Even most GCC projects use multiple threads via cmake/make/meson/pick your build system. It would always be faster to have the threading inside GCC.

**jrch2k8** · 31 August 2017, 12:31 PM

There is another factor about those 10 years articles, there is a lot more techniques to exploit Muti core performance than simply try to thread serial algorithms into oblivion:

1.) inner loop parallelization: OpenMP and OpenACC are great(and very easy) here and should be used when your algorithm have big enough dataset that can be isolated/instanced from each run inside the loop, for example this make wonders on (sparse)matrix mathematics since both tools can also generate vectorization on the background.

2.) outer loop independent execution: this is great when your inner loop is heavily tied and is not easy to parallelize without heavy synchronization but your datasets are not tied with each other before the loop operates on it, this should be used when you have read ahead blocks this way you can run each in block independently of each other while keeping the inner loop serial

3.) concurrency: this is the simplest one, simply have threads(or (side)async triggers) run each process that is independent of other processes, for example why stall the pipeline waiting for a unrelated matrix multiplication to insert data on a DB? why stall an unrelated read operation while waiting to finish an unrelated write operation? etc. <-- very heavily used in the enterprise business software.

4.) (on demand)(instanced) virtualized multi server concurrency model: why have 1 huge binary trying to do everything when you can segment the job into isolated (on demand) specialized instances?.

5.) Async coding: a cheaper(as less cycles penalty) version of 3 but doesn't guarantee it will run in another thread/core just guarantee it will not block or wait for another operation(is up to the OS) and it can be harder to synchronize if you don't know what you are doing.

6.) Async thread pools: great to trigger large operations on a single thread with certain guarantees that will not block or wait for the main(spawning) thread, can be very hard if you don't know what you are doing.

7.) Event programming: instead of having a myriad of retarded loops waiting and wasting cycles for any sort of input/event tell the OS to inform you when that input/event actually happens and spawn the correct code to process it, can be hard for beginners tho, things like lib/ev/etc can help a lot

8.) NUMA/Parallel aware allocations: sometimes how you allocate memory/cache can make a huge difference depending of the techniques and hardware available even on very serial operations.

9.) Batch processing: sometimes is hard to parallelize certain operations simply because they are too small to see any actual advantage/effort ratio but in many cases you don't need those result in real time, so if you can accumulate enough small operation on a buffer parallelization/concurrency can have a huge impact on the final result or at least improve usage in another department(for example is more efficient to insert 1000 rows in 1 operation than randomly execute insert 1000 times for most databases)

10.) adaptive multi level algorithm: some times is impossible to find a perfect optimized solution for a problem, I mean your algorithm may kill it on a quad core but run like trash on 8 or even worse exhibit exponential diminishing return or your code kills it when handling up to 1000 operations but at 10000 stop scaling and run like trash, so instead of find a miraculous way to be great at both simply use 2 different algorithms based on the load/hardware at runtime. For example videos use an algorithm named IDCT but depending how you intend process it you have 3 variants serial integer(most used by players because you simply won't need more), MxM integer and MxM float(both are massively parallel) that are used mostly by Post Processing tools(normally based on CUDA/OpenCL)

11.) Redesign your algorithms: this one is obvious of course, if you have a serial algorithm think a bit so you can find a more parallel way if possible.

and many more and obviously you can even mixed them depending your needs since there is no magic "Parallelization" formula or technique or compiler or tool. sure at first seems very complex but most decent developers this days use already some of those techniques/tools to some extent and even some higher level languages can use them implicitly in some cases as well tools have become very decent in those aspects, so is not 1% anymore.

Today the problem is more on Money return than actual difficulty at programming level, for example most games studio don't go parallel beyond the absolute minimal because is hard but because there was not enough market pressure to justify the investment, just think how many people would game on 4+ cores machine just 6 months ago(core I3/I5 are way more popular among gamers than the I7 are regardless of what any delusional intel fanboy may claim) and you will realize it made no economic sense for them to go beyond maxing out a regular quad core because people beyond that threshold are too few to cover the expense of that work. Now that both And and Intel are pushing more cores on all the market segments the situation will start to change.

Disclaimer: I didn't pick those names from wikipedia and I'm not a native English speaker either, so the name of the things I wrote here are translations of how I mostly remember them or learned them through the years in my native language, if you don't like it then post below the correct wikipedia terms or whatever but the actual functionality is there so.

**ssokolow** · 31 August 2017, 12:55 PM

Originally posted by Espionage724 View Post

What use is single-threaded performance nowadays? Aside from some (older?) games, I would assume most things are multi-threaded to some extent nowadays.

I've got an example for you: Emulation.

Multi-threaded or not, emulating anything recent will take all the single-threaded performance you can throw at it. (That's why I worry that I'll have to have two PCs next time I upgrade. One from the last generation of Opteron to not have a PSP and a quarantined Intel machine for gaming.)

**jrch2k8** · 31 August 2017, 12:56 PM

Originally posted by caligula View Post

Please define 'most cases'. How about iterating over all the applications you have, adding a cross in the checkbox if it scales to multiple cores. I'll start with few examples:
photo applications
coarse grained parallelism: perform a single operation on a set of images: parallel convert -resize 50% ::: *.jpg
fine grained parallelism: perform the operation on individual pixels, blocks of pixels, or for each layer.

audio
coarse grained parallelism: perform a single operation on a set of audio clips:: parallel flac ::: *.wav
fine grained parallelism: perform the operation on individual samples or blocks of samples

video
coarse grained parallelism: perform a single operation on a set of video clips: parallel encoder ::: *.files
fine grained parallelism: perform the operation on individual frames or blocks of frames: ffmpeg -threads

compilation
coarse grained parallelism: make -j16
fine grained parallelism: some modern threaded compiler hint: parsing is one of the most time consuming tasks and guess what, parsing is 100% embarrasingly parallel. Totally independent. Parsing is a totally pure function String -> AST.

disk i/o
coarse grained parallelism: one thread per device / operation
fine grained parallelism: linux encryption support, multithreaded scrub etc.

statistics, signal processing, ...
guess what, Python, R, Julia, and the other cool kids already use fortran optimized multithreaded vector code

web
coarse grained parallelism: one thread per tab/browser window
fine grained parallelism: web workers, threaded & vectorized backend libraries

Funny thing is, most tasks I do scale just fine. See the examples I listed above.

Bullshit, here's an example. Try this (code here https://pastebin.com/Rx6nAUnF)

Code:

g++ game.cpp -o game -O3 -fopenmp
./game > game.csv
R --vanilla < game.r
xdg-open game.png

It depends. You have costs C_Seq (cost of the sequential computation of a task), C_Par (cost of the parallelized tasks), C_Syn (cost of synchronization. Parallel version is faster iff C_Par + C_Syn < C_Seq. You're almost claiming that this is never the case. Synchronization is pretty cheap with 1 to 8 threads. It just is. A modern CPU can do

Totally wrong. Modern compilers are multithreaded. Even most GCC projects use multiple threads via cmake/make/meson/pick your build system. It would always be faster to have the threading inside GCC.

I think the main problem is many people are too used to marketing terms like "multi threaded" and confuse it with words like optimizations, optimal, usage, etc.

In Practice if an application use at least 2 threads in "Multi Threaded" but it is efficient? maybe it is on single/dual core but probably is not in a 64 cores CPU

Can an application be efficient on a 64 cores CPU without threads? yes, absolutely. Non Threaded and Serial are not synonymous.

Threads are 1 tool in a whole arsenal of tools that can used to make your code exhibit "Efficient Scaling" be it parallel or serial(yes there is such thing as serial scaling).

As another note, there is no such thing as a purely parallel/threaded code as there is no such thing as purely serial code either, all code is a mix of both to some degree even if you in preferred language didn't explicitly did it, the question is if that current Ratio of serial and parallel operations are enough to be efficient on the target system.

Disclaimer: This is not an answer for you personally but for all involved in the discussion, I just picked this post since it was the closest

**mmstick** · 31 August 2017, 01:00 PM

Originally posted by schmidtbag View Post

Actually, in most cases, it isn't simple at all. Not just anything can easily be made multi-threaded, let alone with a performance improvement. Games, for example, don't use more threads because doing so adds to latency (and therefore I/O lag). It's also a challenge because many objects in a game need direct access to each others variables. Usually, each thread's memory access is isolated, making it so objects can't read the variables of others. But if you want each thread to share memory access, you either lose a lot of performance or you dramatically increase RAM usage (or both), neither of which are desirable.

Multi-threading is only practical when each thread can comfortably work independently and doesn't depend on the synchronicity of others. Take software compiling for example: in reality, GCC is only a single-threaded task, because multi-threading it would likely have diminishing returns or be waaay too complicated to set up. But by compiling multiple files into separate threads at the same time, you get to maximize CPU usage without slowing down anything, because [usually] none of the tasks depend on each other to complete..

You're speaking to the choir. I have a lot of experience writing multi-threaded and multi-process code using low level threading primitives like atomics and rwlocks, in addition to mutexes and higher-level channel-based approaches; along with *nix forks, FD redirections, signal handling and job control as creator of the Ion shell. It still stands that writing multi-threaded software is incredibly simple today -- especially if you are writing your multi-threaded software in Rust, which allows you to fiddle with lifetimes and mutability of your references across thread boundaries and defining thread-ability with the Send+Sync traits, all to ensure that your multi-threaded solution is safe at compile-time. You would be surprised at how many opportunities to use threads have been completely overlooked.

In regards to overhead of creating threads, that's not really relevant to my point. My point was that it's incredibly easy to write software that uses multiple threads. Yet there are ways of circumventing the overhead of threads, and that all comes down to the sort of architecture you've chosen to write your software with. Sure, a program that's written with a single thread in mind may be very difficult to parallelize, but I can guarantee you that there are ways to structure your software so that you can use a couple extra threads. Plenty of games have been released that have been able to take advantage of large amounts of cores, so it's not impossible. It's very doable.

To create software that takes full advantage of multiple threads, you need to take these points into consideration:

Don't choose a programming language that requires a runtime. That includes Go. Greatly minimizes scenarios where you can use multiple threads.
Instead of spawning threads ad-hoc, use a thread pool. A number of ways to do that. My rayon example does that, and this can be done with futures.
I/O tasks can easily be made async, which doesn't necessarily mean multiple threads, but it is another form of executing code simultaneously.
Try to reach for atomics first, then rwlocks, then mutexes, and finally channels.
On *nix systems, file descriptors can be a useful form of communication between parent and child. See the pipe & dup2 syscalls.

**caligula** · 31 August 2017, 01:12 PM

Originally posted by ssokolow View Post

I've got an example for you: Emulation.

Multi-threaded or not, emulating anything recent will take all the single-threaded performance you can throw at it. (That's why I worry that I'll have to have two PCs next time I upgrade. One from the last generation of Opteron to not have a PSP and a quarantined Intel machine for gaming.)

I'm really happy that someone brought this up. Yes, emulation is one of the most difficult tasks. Some architectures can't be efficiently emulated at all using multiple threads. For example some special VLIW DSP architectures.

**schmidtbag** · 31 August 2017, 01:12 PM

Originally posted by caligula View Post

Please define 'most cases'. How about iterating over all the applications you have, adding a cross in the checkbox if it scales to multiple cores. I'll start with few examples:

When I say "applications" I mean individual programs, not potential uses or categories. You seem to be referring to the latter, in which case your point is true. But the fact of the matter is, most software is single-threaded, and if parallelized, it is done so via GPUs or independent threads (like GCC). Many of the examples you gave are also done via separate independent threads, which I have already established is the most efficient way to utilize multiple cores.

Funny thing is, most tasks I do scale just fine. See the examples I listed above.

Except you're forgetting all the background processes and you're ignoring specifics. For example with web browsers, some have 1 thread for the main logic and a 2nd thread for rendering. Other browsers do 1 thread per tab. That's not parallelization, that's just efficient multi-tasking. This is important to distinguish, because what mmstick was trying to say is you can easily make any single task multi-threaded. But you can't tell Firefox or Chrome to use 4 cores for a single tab, and I don't think either company has any intention to ever change that. This is my point.

Bullshit, here's an example. Try this (code here https://pastebin.com/Rx6nAUnF)

Code:

g++ game.cpp -o game -O3 -fopenmp
./game > game.csv
R --vanilla < game.r
xdg-open game.png

How about you point out a game that's more than 65 lines where each player does something other than move? Skimming through the code, I don't see anywhere where each player's thread needs to reference data from another player's thread. In a real game that's actually fun to play, variables from other objects need to be accessed after a thread had already been launched. If the threads are running in parallel, they aren't going to have the same information at the same time. For example: if player 1 and player 2 try collecting the same object simultaneously, both threads will think they have the object, when obviously that can't happen. They need to synchronize their data at some point, and doing that is a slow process. This is why AAA titles don't process each player in separate threads - its just too complicated and inefficient.

If you're so certain you're right, explain to me why no studios are doing things the way you claim? Explain to me why even to this day, 4 cores is still the method to play modern games.

It depends. You have costs C_Seq (cost of the sequential computation of a task), C_Par (cost of the parallelized tasks), C_Syn (cost of synchronization. Parallel version is faster iff C_Par + C_Syn < C_Seq. You're almost claiming that this is never the case. Synchronization is pretty cheap with 1 to 8 threads. It just is. A modern CPU can do

No, I'm not. I'm just saying for most tasks that are single-threaded, they will not benefit from being multi-threaded. I'm well aware there are many tasks that dramatically improve from being parallel, for example, particle physics, video encoding, vectors, image rendering, and so on. But not everything scales so smoothly once completion time and complexity varies, when synchronicity is a requirement. For example: 5 construction workers can lay bricks much faster than a single body-builder. But, that body builder can carry more bricks than the 5 others combined. When you need a bunch of bricks brought to 1 location, the body builder doesn't have to communicate with the others regarding who goes first, how many bricks are left, where the bricks should be placed, etc. It's faster and more efficient to just use the body builder.

Totally wrong. Modern compilers are multithreaded. Even most GCC projects use multiple threads via cmake/make/meson/pick your build system. It would always be faster to have the threading inside GCC.

Again, I don't think you're understanding the specifics. To reiterate - parallelization is not the same thing as multitasking. In order to split a single task into multiples, it can only be parallelized. Unless you can prove me wrong, GCC doesn't do that - it processes separate files individually. Otherwise, go ahead and tell GCC to compile 1 single .cpp file using 4 cores.

**jrch2k8** · 31 August 2017, 01:15 PM

Originally posted by mmstick View Post

You're speaking to the choir. I have a lot of experience writing multi-threaded and multi-process code using low level threading primitives like atomics and rwlocks, in addition to mutexes and higher-level channel-based approaches; along with *nix forks, FD redirections, signal handling and job control as creator of the Ion shell. It still stands that writing multi-threaded software is incredibly simple today -- especially if you are writing your multi-threaded software in Rust, which allows you to fiddle with lifetimes and mutability of your references across thread boundaries and defining thread-ability with the Send+Sync traits, all to ensure that your multi-threaded solution is safe at compile-time. You would be surprised at how many opportunities to use threads have been completely overlooked.

In regards to overhead of creating threads, that's not really relevant to my point. My point was that it's incredibly easy to write software that uses multiple threads. Yet there are ways of circumventing the overhead of threads, and that all comes down to the sort of architecture you've chosen to write your software with. Sure, a program that's written with a single thread in mind may be very difficult to parallelize, but I can guarantee you that there are ways to structure your software so that you can use a couple extra threads. Plenty of games have been released that have been able to take advantage of large amounts of cores, so it's not impossible. It's very doable.

To create software that takes full advantage of multiple threads, you need to take these points into consideration:

Don't choose a programming language that requires a runtime. That includes Go. Greatly minimizes scenarios where you can use multiple threads.
Instead of spawning threads ad-hoc, use a thread pool. A number of ways to do that. My rayon example does that, and this can be done with futures.
I/O tasks can easily be made async, which doesn't necessarily mean multiple threads, but it is another form of executing code simultaneously.
Try to reach for atomics first, then rwlocks, then mutexes, and finally channels.
On *nix systems, file descriptors can be a useful form of communication between parent and child. See the pipe & dup2 syscalls.

Agreed, in my previous post forgot about atomics and the channels approach but yeah you are pretty much head on. Problem is people heard threads and believe every function in your program have to specifically spawn CPU number * threads every function or is not "Multi Threaded" which in turn most confuse with parallel scaling.

Btw in Linux you can use since 3.17 Kyle Sievers memory FDs, I've been testing them and they help tons when doing zero copy or even certain type of buffers arrangements where cleaning is hard since all the actual paging, cleaning, etc. is handled kernel side and the seal operator is kinda great as well. very very handy. Not sure if rust can use them tho since I'm mostly a C/C++ guy and I hate too much the rust syntax to bother(I'm too old and too used to C/C++ but I'm not saying is bad or anything, is just the syntax that make my eyes bleed)

**alelinuxbsd** · 31 August 2017, 01:31 PM

Does Threadripper suffer the same issue of Ryzen on Linux?

Announcement

AMD Rolls Out The Threadripper 1900X: 16 Thread, 4.0GHz Boost

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment