How would this ever match the performance or power budget of a properly implemented 2D engine and hw overlays?
All people need to do is implement proper buffer synchronization (why was this only an afterthought with dma-buf?), and to actually implement the simple driver code for the fixed function dedicated engines...
Modern hardware doesn't have that. They run the 2D code on the 3D engine anyway, so if you can run it in an existing API/driver codebase with good enough performance why waste development time reinventing the wheel?
Of course, it hasn't yet been proven that OpenGL provides that good enough performance, but if it can there's no reason not to use it.
General code will never reach 100% possible speed, but you could use that same argument to say that Mesa itself shouldn't be used and every driver should create a 3D driver from scratch with no shared codebase at all. At some point, you hit a level of diminishing returns and it doesn't make sense to spend hundreds of hours just to save an extra clock cycle here and there.
Ask yourself: Why does android have hwcomposer, and why did wayland suddenly grow that a bit later on (with the hastily implemented kms planes to boot)...
A 3D engine is huge and quite powerful, and very very versatile (these days). This means that setting it up for a simple operation is going to waste a lot of CPU cycles. Then there is the powerdraw of a 3D engine which is constantly running for tiny little things, and the power of the wasted cpu cycles...
A 2D engine is usually nothing more than something which takes 2 buffers (with their info) and does a simple operation to them and outputs them to another buffer. How much setup does that require, d'you think? How efficient will that silicon be? All you have to do is make sure that the buffers are synchronized, heck, you get nice interrupts for that.
HW overlay, again, is hw specifically designed for this task. All you have to do is power it up, point it to the buffers, and then tell it where to display the buffers, and it'll send nice interrupts when it is displaying a new buffer. Again, much much more efficient.
All one has to do is stop being lazy and shortsighted.
Or be lazy and shortsighted, and keep using X.