-mtune= only selects the CPU type for instruction scheduling. -march= selects the CPU type for the instructions to use.
This certainly makes a difference, but it does not show with just every application. This is what the article is trying to present by the way. For example -march=k8 will select the standard x86 instruction set up to and include SSE2, -march=amdfam10 will further include SSE3 and SSE4A instructions, and -march=bdver1 will also include SSE4.1, SSE4.2 and AVX instructions.
Whereas -mtune=k8 will tell the compiler to schedule instructions for a L1=64k/L2=512k configuration and that moving an MMX/SSE register to integer has got a cost of 5. Using -mtune=amdfam10 will make it use the same L1/L2 configuration, but sets the cost of MMX/SEE to integer conversions down to 3. And -mtune=bdver1 will use L1=16k/L2=2048k for cache sizes and sets the cost down to 2.
There are a lot more parameters hidden behind these switches. These are just some of the parameters used by GCC to make its decisions. The parameter "generic" will simply pick good, average values for all of these parameters. The differences will not show unless you know what exactly to look for and by choosing an application that you know of will benefit significantly from it. Only with very precise benchmarking tools and setups can one also detect the difference this makes for other applications. The result will usually vary so much, that one needs to make many runs before a clear difference becomes visible, because these will only be tiny and the variations will add a lot of noise into the measurements. Hence the focus on ImageMagick and C-Ray.
What makes this special for gcc is that it is not a simple thing to just turn code into machine instructions. The compiler needs to detected patterns in the code before it can decide to use the newer SSE4 and AVX instructions over the older MMX/SSE/SSE2 ones. And putting this into a compiler and making it use every new, last feature of new CPUs is a challenge.