And, just maybe because register and ALU width ain't free. Worse, if you look at how ALU operations are implemented, you're increasing the critical path length by

*at least*log2(n) for n-bit wordlength. So, a wider chip will not only be hotter and bigger (and thus more expensive), but also slower.

Compare that with vector arithmetic, and element-wise operations on a k-element vector only occupy k times as much as the same logic and datapath for operating on a single one of those elements.

