task A -> task B -> task C -> task D -> task E -> task F ...
You may split this into two threads, like this:
1st thread: task A -> task B -> task C -> write the output of C to a work queue.
2nd thread: read the work queue -> task D -> task E -> task F
Now where would you do the split in the current stack? Before st/mesa? Or after st/mesa and before the driver? Or between the driver and libdrm? Somewhere else? Note the split itself costs a lot of CPU cycles.
Of course we could use OpenMP or Threading Building Blocks etc. for some algorithms but that would give us very little speedup, not enough to get 2 fully-loaded cores.
I'd really like to see an actual plan instead of arguing that we should use some threading library. No library will magically use all your cores. And BTW, Mesa does use NPTL; it didn't help much, did it?