Another issue is how cache are made, L1, is closest (as distance and CPU cycles) to the math units, memory unit, and so on. The L2 is a bit further apart, and L3 (that we think as shared cache) is the slowest. The electrons have to "walk more" to get the data from the cache to any specific core. So the solution that most people will say is to have 1 MB L1 for every core (the opposite of the 8 M L3 shared). This will make the synchronizations of the CPU impossible (I mean possible like in Athlon X4, but slower, as L3 is used also for syncing the data between cores).
The cache hit/miss ratio and branch prediction (if you have a cache miss, you would have to make the predictor to make computations right (in advance) in the time you wait for memory) is very hard to get it right, and to succeed it the work is on two fronts: how software is written (for Bulldozer that your multi-threaded process will try to have the core most used logic to fit in 2 MB) and how to not get in the architecture bottlenecks.