Originally posted by PerformanceExpert
View Post
One of my early experiences with this was optimizing a large, separable convolution, on a VLIW DSP*. I got the 1D pass nice and fast. Performance was quite near the theoretical peak. Then, I just had to optimize the transpose, which was obviously thrashing the cache (this CPU had only like 8 KB of L1 cache and no L2... it was a while ago). At the urging of my boss, I measured the relative times of the convolution passes and transpose... and even though the transpose was horribly inefficient for what it was, it was still insignificant compared with the actual convolution. So, it just goes to show the golden rule of optimization: data is king!
* In truth, it's a little bit of a misnomer to call this chip a DSP. Most DSPs have a 2:1 load-to-store ratio, but this chip had 1:1 loads-to-stores. However, it also had 64 registers. So, what I did was to break the kernel into fixed-size chunks that that I could pre-load into a set of registers. Then, I convolved the input row one chunk at a time, accumulating the intermediates in a temporary buffer. The row and temporary buffer were small enough both to fit L1 cache. Also, since this was written in C, I was continually looking at the compiler output to make sure it didn't generate spills or have too many nop slots. These fast convolutions were one of our signature features and helped sell a lot of product.
Comment