But kernels need to load data from main memory and then write the result after computation. So having too many small kernels can make your code memory-bound....