abinstein wrote:Bobcat is not Bulldozer, and I believe they are designed differently.
Of course, I just wanted to point on AMDs naming procedures. With BUlldozer they just stated "pipeline", with Bobcate they explicitely said 2xINT plus 1Ld, 1 str. pipelines. Therefore I think it will be the same with Bulldozer. 4way would be also too much for 1 core. 2way is the optimum from a cost/performance stand point. If you would have 4 way, you would need SMT again to be able to utilize all the pipelines optimally.
For example, in the FPU, the A-pipe and M-pipe apparently mean [edit]addition and multiplication[/edit] pipelines. These don't make sense in the Bulldozer graph, where both pipelines are 128-bit FMAC.
Read carefully, AMD said nothing about Pipelines in the FPU part. They just wrote 2x128bit FMAC. Thus they mean the ability to process two 128bit FMAC µOPs. How this is done in detail is not know yet, but there is a patent at Dresdenboy's blog which covers that, too

The explicit load & store pipelines are probably also specific to Bobcat for power efficiency and simplicity (IIRC this is what Pentium-M did, which propagated to Core and even Core2 as well).
I speculate that they just referred to 2 AGUs. Why should they change so much compared to the K8 ? If AMD would use explicit Ld/Str pipelines and now AGUs, then they would have to re-engineer the µOp sub-system, too. If I would be AMD, I would be too lazy to do that
A K8 core itself is already quite energy efficient and small at 32nm, now if you cut it down to 2way it is 99% perfect

A further re-work would not gain much.
But as I said - speculation only. Maybe AMD did indeed a lot of µOp rework - who knows.
There is really no reason for Bulldozer to have separate pipelines for AGU. Currently, K10 already has an AGU attached on each ALU; this architecture works quite well for x86 instructions where most instruction is attached with a load/store. There's little reason to change that. But, it's just my guess, and I could be wrong.
Yes - no reason to change that, I agree totally

We then just differ in the counting method for pipelines

In my opinion you could count the AGUs as independent pipelines, thus 2 INT+2 AGUs per core = 4 "pipelines". You seem to think that there are 4INT + 4 AGUs, because you count the AGUs not separately. As I stated above, 4 way for one core would be inefficient, therefore I speculate about 2way per core and 4way per module.
Well let's wait and see who is right

What I do know is that AMD's instruction decode was designed originally for 4-way. Previously, it's not useful to go above 3 because each cycle the fetch unit only gets 16 bytes. In k10, the width is increased to 32 bytes and there will be opportunities to decode 4 instructions per cycle. Whether it's practically feasible is unknown, though.
Sounds to me now, that you mix up instruction fetch with instruction decode. The decoders were always 3way only. The prefetch width was increased to be able to decode more 64bit instructions.
@ Valerón:
Because there is no reason for a new socket ... we had that discussion already a few times in the forum, use the search function;-)
And no - G34 is not 1207 pins .. its a "little" bit more

cheers