
JF-AMD wrote:Who in Japan confirmed anything? Chip design does not happen in Japan. We have a sales office in Tokyo.

abinstein wrote:@Dresdenboy: How were you sure that it's 2ALU+2AGU? I have argued previously that it's NOT likely to be 2 ALU + 2 AGU in Bulldozer. But my arguments were based on high level analysis and could be wrong. OTOH, do you have reliable source to show otherwise?
abinstein wrote:Now lets look back at Bulldozer. If it had 2 ALU and 2 AGU, plus 2 FPU (per core), then the compute:memory ratio is 2:1. It is clearly unbalanced. Now one may say that in Greyhound the ratio is also 2:1 (3 ALU+AGU and 3 FPU). However, as I've explained earlier, there the ALU+AGU could really be a way to simplify circuit design based on the structure of the macro-ops. IOW, in Greyhound (K10), independent load/store instructions cannot be issued to those AGU concurrently with arithmetic/logic instructions. One has to use x86 complex addressing modes to take advantage the full capability of the ALU+AGU.
So IMHO BD will either have 4 ALU+AGU, or 4 general ALU where two of them can also perform address generation (basically have their output going to the load-store queue). It's not likely to be fixed 2 ALU plus fixed 2 AGU.
Wireloop wrote:Pipe 0 -> multiplier, simple ops (add, subtract, logical)
Pipe 1 -> AGU-like, barrel shifter, branch (both direct & indirect), simple ops
Pipe 2 -> ABM, simple ops
Pipe 3 -> AGU-like, barrel shifter, branch (both types too), simple ops
abinstein wrote:My guess is clock rates will be increased dynamically when only 1 core is active. But it won't be double pump. For example, a 2GHz module could have one core at 2.4GHz and the other core idle, or both cores at 2.0GHz. (Numbers are just figurative.)
abinstein wrote:Maybe Sandy Bridge double pump FP because it tries to implement 256-bit AVX using 128-bit pipeline?

Dresdenboy wrote:JF-AMD wrote:Who in Japan confirmed anything? Chip design does not happen in Japan. We have a sales office in Tokyo.
The translated article has to be read like Master Yoda's memoirs. As I read it, AMD Japan (AMD Japan Engineering Lab Tokyo?) confirmed via AMD U.S., that the ALU/AGU configuration is as described. You could reach the author per this mail address:

Dresdenboy wrote:The translated article has to be read like Master Yoda's memoirs. As I read it, AMD Japan (AMD Japan Engineering Lab Tokyo?) confirmed via AMD U.S., that the ALU/AGU configuration is as described.

Dresdenboy wrote:A main concern in designing for future processes like 32nm and beyond will be static power consumption, which is roughly a function of the number of implemented gates. I've seen diagrams, where static and dynamic power already equalled in 65nm. Having units sitting on the die idling sometimes (which IPC do you expect per BD core?) will cost power without doing work. If the designers reduce the length of some pipeline stages, simplify some logic (e.g. IRF with 6 read/4 write ports instead of 9 read, 8 write ports, like in the K7 microarchitecture and it's successors), use some fast but still power efficient adders etc. they could run the integer units much faster (than just +20% or so) with some overhead involved due to the latches and higher clock but static power would be half of that of the "low speed", wider design.abinstein wrote:Maybe Sandy Bridge double pump FP because it tries to implement 256-bit AVX using 128-bit pipeline?
That is exactly what Hans de Vries is thinking.
P.S.: edited and extended the text a bit

amdk11 wrote:
has second decode ?




abinstein wrote:Yeah, I figure that, too. The 2ALU+2LSU thing just doesn't look right. At least IMHO. But people like the idea that Bulldozer must have 2 load/store pipes just like SandyBridge. Well, that makes no sense. IMHO.

According to our sources, GPR [General Purpose Registers] were increased to 128-bit. Once that we learned of this alleged GPR depth, we asked does that mean we can, theoretically, call Bulldozer a "128-bit CPU" and is "x86-128" on the way? I will openly admit that I asked such a question without giving it a second thought.
I was explained that focus of AMD's design was to increase the number of instructions processed on-the-fly, meaning that most instructions should use registers in a 64+64-bit or 32+32+32+32-bit fashion, significantly raising the IPC when compared to current K10.5 architecture. So, no "x86-128". For now. This new internal architecture enabled AMD to design its first Streaming SIMD Extension set, 128-bit SSE5. Again, according to our sources - this was also the reason why Intel went into a denial frenzy over a possible implementation of the SSE5 instruction set.

Bart Swinnen wrote:Some hints were found regarding Bulldozer cache sizes:
http://www.realworldtech.com/forums/?ac ... 2&roomid=2


abinstein wrote:The L1 seems a bit small, even compared to Intel's, which is 8-way 32KB.
I can see the rational behind this. And I can see that most people out there really don't.![]()
Thanks for the info.

Dresdenboy wrote:abinstein wrote:The L1 seems a bit small, even compared to Intel's, which is 8-way 32KB.
I can see the rational behind this. And I can see that most people out there really don't.![]()
Thanks for the info.
IIRC, you don't think, that BD could be a high frequency design (at least for parts of it). If so, what's your explanation? One might be: AMD purposely leaks wrong facts.
But John already said, that BD will be very different from AMD's current designs. This could include the clock cycle time as well. Now way prediction (patent pending) makes more sense since more L1 cache ways have to be checked.


Game_boy wrote:I thought commits to public repositories with sensitive data were all screened by legal people first. They are in the case of AMD's contributions to open graphics drivers. I think this is intentional.
Knowing cache size and associativity doesn't help us estimate performance but it probably helps Intel a lot.

abinstein wrote:Game_boy wrote:I thought commits to public repositories with sensitive data were all screened by legal people first. They are in the case of AMD's contributions to open graphics drivers. I think this is intentional.
The file seems to be used by Open64 compiler to optimize data allocation for the cache hierarchy. I don't think AMD would intentionally contribute to a software to let it optimize inefficiently for their CPU.

Game_boy wrote:abinstein wrote:The file seems to be used by Open64 compiler to optimize data allocation for the cache hierarchy. I don't think AMD would intentionally contribute to a software to let it optimize inefficiently for their CPU.
Do we know whether the commit was by AMD or not?

abinstein wrote:The entries on Intel Core and Wolfdale have been there previously. So why is it showing 2-way set associative L1 cache then? I've always thought that it's 8-way 32KB.


mmarq wrote:Worst than the associativity and the size is the latency...
But i'm convinced that BD will have "dependency prediction" with a sizable store buffer (perhaps in the way of store-sets) and also if "access decoupled", a load buffer...
The trade off is that L1 D in each core/cluster of BD must have more ports than in Barcelona... and if that is the case "what for ?" is the question... if there aren't more structures that are connected to those cache banks!...
OTOH BD should be a speed monster( i guess above 4Ghz is possible)!...
Also the L2 should be bigger

Return to AMD Fusion, Bobcat, Bulldozer
Users browsing this forum: Google [Bot], Google Adsense [Bot] and 3 guests