Announcement

**quaz0r** · 29 August 2016, 08:34 PM

indeed, openmp is a cheap kludgy hack for lazy programmers. that said, if you are going to use a piece of software anyway, and the choice is between it being single threaded or multi threaded, you may as well be able to use its multi threading.

**caligula** · 29 August 2016, 09:13 PM

Originally posted by coats View Post

I'll see your 10 years and raise you 10 more. And I was running my OpenMP apps efficiently on 32-processor SGI Altix systems 15 years ago, using all 32 processors, and still well on the "good" part of the scaling-curve (that was the largest system the bean-counters would let me have :-). FWIW.

Cool. Yes I haven't had access to the cool old skool systems. My first OpenMP system was a dual socket Pentium 3 workstation. I already had a faster single processor system, but this was pretty bad-ass with its 15k RPM drives and ECC RAM. It took several minutes to boot, but was pretty decent once up and running.

**caligula** · 29 August 2016, 09:16 PM

Originally posted by foobaz View Post

I use FreeBSD and I'm glad they don't enable OpenMP. OpenMP is not the future of parallel computing, it is an abandoned dead end. It's crap and we should be phasing it out, not pushing for its adoption. Pthreads or C11 threads are the way to go.

Not sure if trolling. Pthreads and C11 threads lack most of the functionality provided by the latest OpenMP. Sure OpenMP is a dead end, but all Fortran/C/C++ frameworks are equally bad. Very few know which frameworks and languages are used for multicore programming in 10 years.

**c117152** · 29 August 2016, 10:16 PM

The amount of context switching with OpenMP isn't trivial. Like auto-vectorization, the workloads where it's beneficial are limited to niches and the hardware they excel in is very expensive. And the rest of the time, you're actually getting a performance drop from using those compiler flags on inappropriate hardware.

If you look up some of the image and video encoding and decoding libraries default configure flags, you'll find that even there, where they should push for it, they're disabling them. In fact, they especially use manual techniques (manual vectorization and old school Duff' devices & locks) since when you actually have a reason to optimize, you don't play around with cross-platform, automated abstractions.

So, why should the BSDs even bother? Give us one example of a real world load where OpenMP matters and the majority of people running BSD should care about. Otherwise, it's a pipe dream niche for high-level garbage collected language to chase after rather then something a C system should bother wise.

**nslay** · 30 August 2016, 12:18 AM

This is a regression. FreeBSD 7.0 had OpenMP by default with GCC 4.2. I was using it too.

[base] Index of /stable/7/contrib/gcclibs/libgomp

https://svnweb.freebsd.org/base/stable/7/contrib/gcclibs/libgomp/

OpenMP was there and usable until Clang replaced GCC 4.2. FreeBSD camp siding against default OpenMP are in the wrong. This was present originally and Clang broke it. It should be a PR. Seriously file a PR "FreeBSD 10 doesn't have OpenMP when FreeBSD 6, 7, 8 had it".

**caligula** · 30 August 2016, 05:07 AM

Originally posted by c117152 View Post

The amount of context switching with OpenMP isn't trivial. Like auto-vectorization, the workloads where it's beneficial are limited to niches and the hardware they excel in is very expensive. And the rest of the time, you're actually getting a performance drop from using those compiler flags on inappropriate hardware.

If you look up some of the image and video encoding and decoding libraries default configure flags, you'll find that even there, where they should push for it, they're disabling them. In fact, they especially use manual techniques (manual vectorization and old school Duff' devices & locks) since when you actually have a reason to optimize, you don't play around with cross-platform, automated abstractions.

So, why should the BSDs even bother? Give us one example of a real world load where OpenMP matters and the majority of people running BSD should care about. Otherwise, it's a pipe dream niche for high-level garbage collected language to chase after rather then something a C system should bother wise.

I can easily give some simple use cases. For example many applications deal with multiple files. For instance your favorite file manager probably renders thumbnails of files. Now, a thumbnail generation could take one second. What if you have 8 or 16 cores, would it be faster to use OpenMP's thread pool for the task? Why not? Auto-vectorization is not really a choice here, the problem is task parallel and auto-vectorization with AVX is data parallel.

**c117152** · 30 August 2016, 09:24 AM

Originally posted by caligula View Post

I can easily give some simple use cases. For example many applications deal with multiple files. For instance your favorite file manager probably renders thumbnails of files. Now, a thumbnail generation could take one second. What if you have 8 or 16 cores, would it be faster to use OpenMP's thread pool for the task? Why not? Auto-vectorization is not really a choice here, the problem is task parallel and auto-vectorization with AVX is data parallel.

So, all I have to do is get 8-16 cores processor and disk\bus i/o fast enough not to bottleneck and I'll enjoy the benefits of multi-processing... Besides, you wouldn't use OpenMP in that case. Being a shared memory model, you'd just serialize on the rendering when the thread returns output. Look up your web-browser's image rendering code. It's been profiled extensively in both firefox and chrome and wasn't found to be very useful.

**jrch2k8** · 30 August 2016, 10:10 AM

Ohh come on people, so much utter FUD that i believe not one of the posters here have ever touched a compiler.

1.) Pthreads/C++11+ threads: Don't have anything to do with OpenMP and cannot drop in replace it. Seriously to replace OpenMP you need surgical mastery of pthreads low level construct/syscalls/very deep understanding of kernel allocators models and actually hand write SIMD code(sometime with helper functions in ASM).

2.) Autovectorization: Is not part of OpenMP but part of the compiler and regardless the compiler is nowhere near as good as handwritten SIMD code except for very very obvious low hanging fruits, yeah not even the all mighty ICC. Compilers can show you ASM outputs for a reason people ....

3.) OpenMP SIMD loops are very specific case templates with very well written SIMD code(like reductions), nothing else, it will not magically take your crappy code and make it vectorized, specially since most geniuses out there fail to understand the basic term "Vector"

4.) OpenMP is really performant and is not dead by any standard, is just good enough to remain stable for a while.

5.) OpenMP don't require 16 cores for fuck sakes, even a crappy dual core Pentium D can see nasty performance jumps with it, it just requires the developer isn't brain dead

6.) OpenMP is an INNER LOOP tool, is not to start functions, is not to do arithmetics of 3 ints, is not to parallelize operations, etc.

7.) The only requirement for OpenMP to be badass is, you need enough data and loop iterations. If you use OpenMP to calculate a 4x4 matrix on 2 iterarions you are an idiot and is not OpenMP fault, if you want to sort a list of 20 items with OpenMP you are an idiot and is not OpenMP fault, etc.

8.) If your operation require low latency and are small enough or are not LOOPS please USE ASYNC THREADPOOLS there is nothing more efficient, if you are processing enough data that spawning a thread is TIME irrelevant USE OPENMP there is nothing more efficient, if even on OpenMP the system is collapsing and slower than you need USE OPENCL A FREAKING CPU IS NOT ENOUGH HERE

OpenMP/ThreadPools/OpenCL/HSA are not the SAME, each one is a different tool for different PARALLEL SCENARIOS and never were meant to replace each other, for the same reason nobody uses a fucking prius to transport fuel, or glue wings to motorcycle to transport people because each vehicle type is designed for an specific purpose, is just that easy people is not rocket science.

**cade** · 30 August 2016, 07:45 PM

Originally posted by caligula View Post

Well, I guess FreeBSD doesn't run on multi-core systems then. Seems rather stupid to waste 50-96% of the performance potential since your programming paradigm was invented in the 1970s. The basic work sharing constructs in OpenMP are already years old. I already implemented some programming assignments in school some 10 years ago with OpenMP 2.0.

Wrong.
The threads extension to POSIX (i.e. PTHREADS) dates back to 1995 and FreeBSD, etc. supported pthreads in late 1990's.
Other "UNIX THREAD" APIs also existed. Multiprocessing, i.e. multiple CPU based processing, was available in FreeBSD
in 1998 (SMP support in FreeBSD 3.0) before general availability of multi-core CPUs. Obviously, the appearance
of multi-core CPUs years ago was instantly supported by OSes like FreeBSD who already had the threading/etc,
infrastructure in place.

"FreeBSD 1" was released in November 1993, not in 1970's.
Unlike OpenMP, PThreads is generic enough to be used for any kind of parallelism.
The reason for this is that PThreads is a lower-level API while OpenMP represents a higher-level API.
Of course you would have done some OpenMP at school years ago, it is easier but less powerful
due to it's a higher-level-API nature. As an analogy, in the mid-1980's I was taught in high-school the
BASIC computer language on Apple IIe, not an assembly language.

PThreads offers more "plumbing" and so more time is used in design/implementation/testing of
threaded-solutions but these solutions can be more creative/novel than what is possible from threaded-solitions
implemented using the restricted (less plumbing-like) nature of OpenMP.

There's a reason why threadding solutions (paradigms) like PThreads, OpenMP, etc. exist.
They offer different programming paradigms that cater for different types of
thread-solution-implementations.

For me, I have developed/used C++ wrappers for PThreads/Windows thread APIs over last ~ 15 years,
amongst other programming tasks, and had no reason to leave the low-level thread programming model.
Actually, one role of the C++ wrapper is to "hide" the low-level-plumbing detail and present
convenient/intuitive interfaces to this detail where required and still being able to implement
powerful/intricate, but simple/clear, threading solutions.

It is foolish to think that OSes in the class of FreeBSD had a problem
taking advantage of multi-{CPU, core} hardware.

**caligula** · 31 August 2016, 02:44 AM

Originally posted by c117152 View Post

So, all I have to do is get 8-16 cores processor and disk\bus i/o fast enough not to bottleneck and I'll enjoy the benefits of multi-processing... Besides, you wouldn't use OpenMP in that case. Being a shared memory model, you'd just serialize on the rendering when the thread returns output. Look up your web-browser's image rendering code. It's been profiled extensively in both firefox and chrome and wasn't found to be very useful.

Let me explain what the problem is. When generating thumbnails of various sizes, the input can be for example a 50 megapixel RAW image. This file can grow up to 50 MB these days. They sometimes come with embedded previews, sometimes not, sometimes the previews have useless size. When you scale down a 50 MP image into a fullhd preview or let's say a 128x128 pixel icon, the rendering speed is not an issue. Indeed the icons can be drawn sequentially. The backend task for scaling the image is 100% independent. The program only depends on the output image. You can even do it in a seccomp sandbox. Scaling the image may take few seconds. Synchronizing two threads few milliseconds at most. If you ever use some program like Adobe Lightroom, you'll see that this operation is really demanding and stresses the CPU. You really want to max all the cores. If one thumbnail is generated in a second, a 200 picture photoshoot will render the pictures for over 3 minutes. It's way too long, you just wanted to see the contents of one folder. Libjpeg-turbo might speed up this by a factor or 2 or so, but a OpenMP solution can easily speed up 16x or more on a Xeon.

Announcement

Why FreeBSD Doesn't Aim For OpenMP Support Out-Of-The-Box

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment