Text Size

AMD Fusion, Bobcat, Bulldozer

The Bulldozer Blog

Discussion about AMD's upcoming CPU's and APU's

Re: The Bulldozer Blog

Postby abinstein » Tue Jan 12, 2010 1:57 am

@Dresdenboy: How were you sure that it's 2ALU+2AGU? I have argued previously that it's NOT likely to be 2 ALU + 2 AGU in Bulldozer. But my arguments were based on high level analysis and could be wrong. OTOH, do you have reliable source to show otherwise?
abinstein
K8 Opteron (SledgeHammer) Moderator
K8 Opteron (SledgeHammer) Moderator
 
Posts: 7176
Joined: Sat Oct 30, 2004 9:49 pm

Re: The Bulldozer Blog

Postby Dresdenboy » Tue Jan 12, 2010 11:10 am

JF-AMD wrote:Who in Japan confirmed anything? Chip design does not happen in Japan. We have a sales office in Tokyo.

The translated article has to be read like Master Yoda's memoirs. As I read it, AMD Japan (AMD Japan Engineering Lab Tokyo?) confirmed via AMD U.S., that the ALU/AGU configuration is as described. You could reach the author per this mail address: Image

abinstein wrote:@Dresdenboy: How were you sure that it's 2ALU+2AGU? I have argued previously that it's NOT likely to be 2 ALU + 2 AGU in Bulldozer. But my arguments were based on high level analysis and could be wrong. OTOH, do you have reliable source to show otherwise?


Ok, let me quote your arguments:
abinstein wrote:Now lets look back at Bulldozer. If it had 2 ALU and 2 AGU, plus 2 FPU (per core), then the compute:memory ratio is 2:1. It is clearly unbalanced. Now one may say that in Greyhound the ratio is also 2:1 (3 ALU+AGU and 3 FPU). However, as I've explained earlier, there the ALU+AGU could really be a way to simplify circuit design based on the structure of the macro-ops. IOW, in Greyhound (K10), independent load/store instructions cannot be issued to those AGU concurrently with arithmetic/logic instructions. One has to use x86 complex addressing modes to take advantage the full capability of the ALU+AGU.

So IMHO BD will either have 4 ALU+AGU, or 4 general ALU where two of them can also perform address generation (basically have their output going to the load-store queue). It's not likely to be fixed 2 ALU plus fixed 2 AGU.

When saying 2 FPU per core, do you think of separate FADD+FMUL executed per clock cycle and FMAC unit? If so, you would assume, that the speculation about this will turn out to be true. So you'll need 2 loads per cycle and core to feed two 128 bit or one 256 bit operation. A FMA4 operation would only consume one memory operand, so one 128 bit load for 128 bit FMA4 or two 128 bit loads for a 256 bit FMA4 operation would be enough.

But there is no necessity to have 4 ALUs+AGUs (8 units). 4 general ALUs would on average act like 2 ALUs + 2 AGUs with the capability to do more ALU ops. This could look like what "Wireloop" wrote in a comment to my blog (which I updated yesterday):
Wireloop wrote:Pipe 0 -> multiplier, simple ops (add, subtract, logical)
Pipe 1 -> AGU-like, barrel shifter, branch (both direct & indirect), simple ops
Pipe 2 -> ABM, simple ops
Pipe 3 -> AGU-like, barrel shifter, branch (both types too), simple ops


But we are talking about throughput per "module" clock cycle here. What if the integer core has a relative throughput of 4 ALU/AGU pairs per clock cycle at the cost of longer latency for the more complex operations (like LEA with all operands and shift), but maybe some added capabilities like executing simple, but dependend ops in the same "module" clock cycle or an increased throughput of 4 loads in that clock cycle?

And some quick answers to this posting:
abinstein wrote:My guess is clock rates will be increased dynamically when only 1 core is active. But it won't be double pump. For example, a 2GHz module could have one core at 2.4GHz and the other core idle, or both cores at 2.0GHz. (Numbers are just figurative.)

That's a different thing, just applied to module components instead of cores the "old way". We might see something like that coming with Thuban (I explained it here). But as mmarq noted, there could be really advanced power management in BD, which will use more options than just clocking cores differently.

A main concern in designing for future processes like 32nm and beyond will be static power consumption, which is roughly a function of the number of implemented gates. I've seen diagrams, where static and dynamic power already equalled in 65nm. Having units sitting on the die idling sometimes (which IPC do you expect per BD core?) will cost power without doing work. If the designers reduce the length of some pipeline stages, simplify some logic (e.g. IRF with 6 read/4 write ports instead of 9 read, 8 write ports, like in the K7 microarchitecture and it's successors), use some fast but still power efficient adders etc. they could run the integer units much faster (than just +20% or so) with some overhead involved due to the latches and higher clock but static power would be half of that of the "low speed", wider design.

abinstein wrote:Maybe Sandy Bridge double pump FP because it tries to implement 256-bit AVX using 128-bit pipeline?

That is exactly what Hans de Vries is thinking.

P.S.: edited and extended the text a bit
User avatar
Dresdenboy
K6-III Fresh Boarder
K6-III Fresh Boarder
 
Posts: 258
Joined: Sun Apr 25, 2004 1:16 pm
Location: Germany

Re: The Bulldozer Blog

Postby mmarq » Wed Jan 13, 2010 3:48 am

Dresdenboy wrote:
JF-AMD wrote:Who in Japan confirmed anything? Chip design does not happen in Japan. We have a sales office in Tokyo.

The translated article has to be read like Master Yoda's memoirs. As I read it, AMD Japan (AMD Japan Engineering Lab Tokyo?) confirmed via AMD U.S., that the ALU/AGU configuration is as described. You could reach the author per this mail address: Image


Exactly as described ?

I was thinking more in terms of Bobcat that has different Load and Store pipes, and in terms of K10 that has 2 LSUs outside of the execution pipes

The rational is simple
AGUs don't have dedicated ports
2 ALU pipes capables of more complex INt operations
1 ALU pipe capable or simpler and misc operations and "Loads"(like bobcat)
1 ALU pipe capable or simpler and misc operations and "Stores"(like bobcat)

Different from bobcat, the ALUs that arbor Load/Store functionality, should also arbor other operations that are not present in Bobcat LSU pipes... i lived AGUs out, because there can be 4 of those,... or only 2, depending on design trade-offs... but contrary to K10 that "fused"(pack) address operations, BD will not do it because most probably it is "access decoupled" and AGU operations can(should) be serialized with other INt operations.

Additionally... (like K10 with 2 LSUs per core), one more LSU "shared" by 2 INT cores.

That would give for each "core" similar capability of K10, i.e., 2 loads and 2 stores per cycle more exactly 2 128bit loads and 2 64/128(?) stores... and it seems enough, because K10 was only partially o-o-o in memory access, it would end-up stalling the LSUs many times. That is my guess of why it has 2 LSU for 3 INt pipes. BD is supposed to have extensive "data speculation" with "value prediction"... LSU should rarely stall... and 1 LSU could very well be shared because, after all, the FPUs also are.

That would give a distribution very similar to what that "Japanese" design seems to indicate. 2 pipes for more complex Int operations + 2 pipes that do Load/Store operations. That is why i asked ( exactly ? ) as the design seems to indicate?... ( or ? ) the 2 pipes that do load/store can do other operations, and not be exclusively LSUs... courtesy of being "access" decoupled and "data speculation", permitting to serialize those operations in the pipelines in a efficient manner.

What is probably missing of this earlier sketches is the shared LSU... and it makes sense, IMHO, because when upon a 256bit operation the FPU is no longer shared... and so could be the "invisible" LSU...

Funny thing about "generalized" sketches, that even wikipedia about K10, presents only the LS queue and the number of entries, and skips the "units" arrangement... thought realworld tech gets it right ... those LS units tend to be forgotten beasts!...
mmarq
K8 Athlon 64 (Orleans) Expert Boarder
K8 Athlon 64 (Orleans) Expert Boarder
 
Posts: 2337
Joined: Sat Jul 14, 2007 4:31 am

Re: The Bulldozer Blog

Postby Polonium210 » Wed Jan 13, 2010 4:05 am

Dresdenboy wrote:The translated article has to be read like Master Yoda's memoirs. As I read it, AMD Japan (AMD Japan Engineering Lab Tokyo?) confirmed via AMD U.S., that the ALU/AGU configuration is as described.


Lost in translation :?:
Polonium210
K7 Athlon (Argon) Junior Boarder
K7 Athlon (Argon) Junior Boarder
 
Posts: 387
Joined: Tue Jan 30, 2007 4:21 am

Re: The Bulldozer Blog

Postby mmarq » Wed Jan 13, 2010 4:19 am

Dresdenboy wrote:A main concern in designing for future processes like 32nm and beyond will be static power consumption, which is roughly a function of the number of implemented gates. I've seen diagrams, where static and dynamic power already equalled in 65nm. Having units sitting on the die idling sometimes (which IPC do you expect per BD core?) will cost power without doing work. If the designers reduce the length of some pipeline stages, simplify some logic (e.g. IRF with 6 read/4 write ports instead of 9 read, 8 write ports, like in the K7 microarchitecture and it's successors), use some fast but still power efficient adders etc. they could run the integer units much faster (than just +20% or so) with some overhead involved due to the latches and higher clock but static power would be half of that of the "low speed", wider design.

abinstein wrote:Maybe Sandy Bridge double pump FP because it tries to implement 256-bit AVX using 128-bit pipeline?

That is exactly what Hans de Vries is thinking.

P.S.: edited and extended the text a bit


I understand your view.

But with a shared front end the problem is if there will be enough instructions from the fetch engine to feed the equivalent of 4 "cores" for not having static power mounting by idling ? ... SB fetch engine only feeds one "physical" core prepared for SMT, or the equivalent of 1.3 cores at the most... regular BD single front-end must feed the equivalent of 1.8 cores... if it is double pumped it will be the equivalent of 3.6 cores :roll:

And also, is not in any of this earlier sketches, nobody seems to be counting on it !?... nevertheless i've also a gut feeling that in this first incarnation there will not be a large trace cache... and upon having, its is very small and not "mispredict(redirect) recovery", but simply sequencing... will it be enough for 3.6 cores ? :roll:

As to power, if one core can be shut down, may be we can have the other to pump up to more 600-800Mhz... and being so, more pertinent stages for lower FO4... and much better clock scalability, because saving power is by shutting off or heavy throttling down ... perhaps is a better trade-off ?
mmarq
K8 Athlon 64 (Orleans) Expert Boarder
K8 Athlon 64 (Orleans) Expert Boarder
 
Posts: 2337
Joined: Sat Jul 14, 2007 4:31 am

Re: The Bulldozer Blog

Postby superrugal » Thu Jan 14, 2010 5:23 pm

amdk11 wrote:Image
has second decode ?


OH MY GOD , Bulldozer has two level decode ?

Does this speculation imply that decode engine of Bulldozer may have revolutionary change ? Just like K6 to K7 ?? :shock:
superrugal
K5 Fresh Boarder
K5 Fresh Boarder
 
Posts: 103
Joined: Tue Nov 17, 2009 9:39 am

Re: The Bulldozer Blog

Postby MKruer » Thu Jan 14, 2010 5:51 pm

This is just speculation, personally it doesn't make sense to me, unless the way the first decode would work is to determine its its FP or INT, and then to send it to the appropriate second stage decoders, because its already been determined what type of instruction it is the seconds stage can be simplified and remove either the FP or the INT component as it would not be needed from the decoder, but then the first decoder is its not relay a decode is it its more of a dispatch now is it?
Lian-Li PC-V2000 Plus Aluminum Case; Seasonic S12 Energy+ 550 PSU; Asus M4A785TD-V EVO; Phenom II X4 965 Black Edition C3 @ 4.0Ghz ; Thermalright Ultra-120 eXtreme Rev.C; 8GB OCZ AMD Black Edition @ 1333Mhz; Sapphire Radeon HD 7870
User avatar
MKruer
K10 Opteron (Barcelona) Administrator
K10 Opteron (Barcelona) Administrator
 
Posts: 2439
Joined: Mon Mar 01, 2004 4:21 am
Location: I am not paid to do this, I don't even like to do this, I wonder why am I still doing this?

Re: The Bulldozer Blog

Postby abinstein » Thu Jan 14, 2010 6:33 pm

All modern x86 processors use 2-level decode. First decode the x86 instructions into macro/micro ops, then decode the macro/micro ops for execution.
abinstein
K8 Opteron (SledgeHammer) Moderator
K8 Opteron (SledgeHammer) Moderator
 
Posts: 7176
Joined: Sat Oct 30, 2004 9:49 pm

Re: The Bulldozer Blog

Postby JF-AMD » Thu Jan 14, 2010 7:53 pm

I won't specify what, but I will say that the diagram is not accurate.
While I work for AMD, my posts are my own opinions.

http://blogs.amd.com/work/author/jfruehe/

Follow AMD Opteron on Twitter: @JF_AMD
User avatar
JF-AMD
XIP
 
Posts: 1832
Joined: Thu Apr 23, 2009 7:27 am

Re: The Bulldozer Blog

Postby abinstein » Thu Jan 14, 2010 11:36 pm

Yeah, I figure that, too. The 2ALU+2LSU thing just doesn't look right. At least IMHO. But people like the idea that Bulldozer must have 2 load/store pipes just like SandyBridge. Well, that makes no sense. IMHO. :)
abinstein
K8 Opteron (SledgeHammer) Moderator
K8 Opteron (SledgeHammer) Moderator
 
Posts: 7176
Joined: Sat Oct 30, 2004 9:49 pm

Re: The Bulldozer Blog

Postby Dresdenboy » Fri Jan 15, 2010 9:57 am

abinstein wrote:Yeah, I figure that, too. The 2ALU+2LSU thing just doesn't look right. At least IMHO. But people like the idea that Bulldozer must have 2 load/store pipes just like SandyBridge. Well, that makes no sense. IMHO. :)

SB has to share this BW for 2 threads and seems capable of 2x16B load and 1x16B store. So far we don't know, how many stores BD can do per cycle and how wide those loads and stores are. There is more in this diagram, which might be wrong.
User avatar
Dresdenboy
K6-III Fresh Boarder
K6-III Fresh Boarder
 
Posts: 258
Joined: Sun Apr 25, 2004 1:16 pm
Location: Germany

Re: The Bulldozer Blog

Postby mmarq » Wed Jan 20, 2010 10:01 pm

There is another possibility besides "double pumped" INt ALU pipes, that can accommodate only 2 INT pipes (has in the drawings 2 ALUs + 2 LSUs)... yet have the possibility of up to 4 INt uOPs per cycle.

The 2 INt pipes are "complex" and allow the serialization/parallelization of "packed" or "fused" x86 uOPS instructions... in the footsteps of what AMD does now with ALU+AGU instructions. A new "packing" or "fusing" of instructions is devised for use with 128bit registers. There would not be new 128bit INt instructions for this first iteration of BD but 64bit+64bit or 32bit+32bit+32bit+32bit will be possible for EACH INT pipe... or in this last case only 32bit+32bit per pipe... That would give the mentioned 4 way INt per cluster/core.

That is exactly Theo Valich rumor... that needs a very good "packing" or "fusing" of Instructions to work for most of the x86 code... that at least would put "Intel uOPS fusion" to absolute shame.

According to our sources, GPR [General Purpose Registers] were increased to 128-bit. Once that we learned of this alleged GPR depth, we asked does that mean we can, theoretically, call Bulldozer a "128-bit CPU" and is "x86-128" on the way? I will openly admit that I asked such a question without giving it a second thought.


I was explained that focus of AMD's design was to increase the number of instructions processed on-the-fly, meaning that most instructions should use registers in a 64+64-bit or 32+32+32+32-bit fashion, significantly raising the IPC when compared to current K10.5 architecture. So, no "x86-128". For now. This new internal architecture enabled AMD to design its first Streaming SIMD Extension set, 128-bit SSE5. Again, according to our sources - this was also the reason why Intel went into a denial frenzy over a possible implementation of the SSE5 instruction set.

http://www.brightsideofnews.com/news/20 ... nster.aspx

Yet maybe a "NEW" 128bit ISA can be devised in the not so long future... source from Microsoft...
http://www.brightsideofnews.com/news/20 ... swell.aspx
Last edited by mmarq on Wed Jan 27, 2010 3:20 pm, edited 2 times in total.
mmarq
K8 Athlon 64 (Orleans) Expert Boarder
K8 Athlon 64 (Orleans) Expert Boarder
 
Posts: 2337
Joined: Sat Jul 14, 2007 4:31 am

Re: The Bulldozer Blog

Postby Bart Swinnen » Thu Jan 21, 2010 12:23 am

Some hints were found regarding Bulldozer cache sizes:

http://www.realworldtech.com/forums/?ac ... 2&roomid=2
Bart Swinnen
 
Posts: 6
Joined: Mon Jan 04, 2010 1:47 am

Re: The Bulldozer Blog

Postby Эльбрус » Thu Jan 21, 2010 1:28 am

Bart Swinnen wrote:Some hints were found regarding Bulldozer cache sizes:

http://www.realworldtech.com/forums/?ac ... 2&roomid=2


Thanks, it looks great and reliable, summary:

L1D: 16kB 4-way set associative
L2: 2MB 16-way set associative

(Edit: L1 -> L1D)
Last edited by Эльбрус on Thu Jan 21, 2010 12:13 pm, edited 2 times in total.
User avatar
Эльбрус
K7 Athlon XP (Palomino) Junior Boarder
K7 Athlon XP (Palomino) Junior Boarder
 
Posts: 463
Joined: Sat May 02, 2009 7:13 pm

Re: The Bulldozer Blog

Postby abinstein » Thu Jan 21, 2010 4:18 am

The L1 seems a bit small, even compared to Intel's, which is 8-way 32KB.

I can see the rational behind this. And I can see that most people out there really don't. :mrgreen:

Thanks for the info. :)
abinstein
K8 Opteron (SledgeHammer) Moderator
K8 Opteron (SledgeHammer) Moderator
 
Posts: 7176
Joined: Sat Oct 30, 2004 9:49 pm

Re: The Bulldozer Blog

Postby Dresdenboy » Thu Jan 21, 2010 2:09 pm

abinstein wrote:The L1 seems a bit small, even compared to Intel's, which is 8-way 32KB.

I can see the rational behind this. And I can see that most people out there really don't. :mrgreen:

Thanks for the info. :)

IIRC, you don't think, that BD could be a high frequency design (at least for parts of it). If so, what's your explanation? One might be: AMD purposely leaks wrong facts. But John already said, that BD will be very different from AMD's current designs. This could include the clock cycle time as well. Now way prediction (patent pending) makes more sense since more L1 cache ways have to be checked.
User avatar
Dresdenboy
K6-III Fresh Boarder
K6-III Fresh Boarder
 
Posts: 258
Joined: Sun Apr 25, 2004 1:16 pm
Location: Germany

Re: The Bulldozer Blog

Postby abinstein » Thu Jan 21, 2010 5:40 pm

Dresdenboy wrote:
abinstein wrote:The L1 seems a bit small, even compared to Intel's, which is 8-way 32KB.

I can see the rational behind this. And I can see that most people out there really don't. :mrgreen:

Thanks for the info. :)

IIRC, you don't think, that BD could be a high frequency design (at least for parts of it). If so, what's your explanation? One might be: AMD purposely leaks wrong facts.

Why would AMD "purposely leak wrong facts"? Are they becoming a minor Intel finally? :lol:

I don't think BD would be a double-pump design. That'd imply a working frequency at at least 6GHz. However, it's entirely possible for a BD module to work at 4GHz range with one core active, and downclock to 3GHz with both.


But John already said, that BD will be very different from AMD's current designs. This could include the clock cycle time as well. Now way prediction (patent pending) makes more sense since more L1 cache ways have to be checked.

I really don't know. Does Intel's 8-way L1 employ way prediction?
abinstein
K8 Opteron (SledgeHammer) Moderator
K8 Opteron (SledgeHammer) Moderator
 
Posts: 7176
Joined: Sat Oct 30, 2004 9:49 pm

Re: The Bulldozer Blog

Postby Game_boy » Thu Jan 21, 2010 6:03 pm

I thought commits to public repositories with sensitive data were all screened by legal people first. They are in the case of AMD's contributions to open graphics drivers. I think this is intentional.

Knowing cache size and associativity doesn't help us estimate performance but it probably helps Intel a lot.

It does kind of confirm that Orochi wasn't replaced by Zambezi, perhaps Orochi is the die codename and Zambezi the desktop variant codename.
Game_boy
K7 Athlon XP (Palomino) Junior Boarder
K7 Athlon XP (Palomino) Junior Boarder
 
Posts: 401
Joined: Wed Dec 31, 2008 8:41 pm

Re: The Bulldozer Blog

Postby abinstein » Fri Jan 22, 2010 12:06 am

Game_boy wrote:I thought commits to public repositories with sensitive data were all screened by legal people first. They are in the case of AMD's contributions to open graphics drivers. I think this is intentional.

The file seems to be used by Open64 compiler to optimize data allocation for the cache hierarchy. I don't think AMD would intentionally contribute to a software to let it optimize inefficiently for their CPU.

What is weird in that file is that the L1 associativity for Intel Core and Wolfdale seem incorrect. Both should have 8-way associative L1, rather than the 2-way shown in the file. This could be an overlook or maybe intentional (which would be :twisted: ) from AMD?

There's better be an update or some explanation...


Knowing cache size and associativity doesn't help us estimate performance but it probably helps Intel a lot.

Well, yeah, it'd help Intel to know that after 4 years since the release of Core 2 Duo, AMD still can't make an L1 cache with 32KB size and 8-way associativity. (Or maybe they don't want/need to.)
abinstein
K8 Opteron (SledgeHammer) Moderator
K8 Opteron (SledgeHammer) Moderator
 
Posts: 7176
Joined: Sat Oct 30, 2004 9:49 pm

Re: The Bulldozer Blog

Postby Game_boy » Fri Jan 22, 2010 12:17 am

abinstein wrote:
Game_boy wrote:I thought commits to public repositories with sensitive data were all screened by legal people first. They are in the case of AMD's contributions to open graphics drivers. I think this is intentional.

The file seems to be used by Open64 compiler to optimize data allocation for the cache hierarchy. I don't think AMD would intentionally contribute to a software to let it optimize inefficiently for their CPU.


Do we know whether the commit was by AMD or not?
Game_boy
K7 Athlon XP (Palomino) Junior Boarder
K7 Athlon XP (Palomino) Junior Boarder
 
Posts: 401
Joined: Wed Dec 31, 2008 8:41 pm

Re: The Bulldozer Blog

Postby abinstein » Fri Jan 22, 2010 12:42 am

Game_boy wrote:
abinstein wrote:The file seems to be used by Open64 compiler to optimize data allocation for the cache hierarchy. I don't think AMD would intentionally contribute to a software to let it optimize inefficiently for their CPU.


Do we know whether the commit was by AMD or not?

Good point.

According to the log, the entries on Orochi is merged from AMD's Open64 4.2.3, so they're obviously contributed by AMD.

The entries on Intel Core and Wolfdale have been there previously. So why is it showing 2-way set associative L1 cache then? I've always thought that it's 8-way 32KB.
abinstein
K8 Opteron (SledgeHammer) Moderator
K8 Opteron (SledgeHammer) Moderator
 
Posts: 7176
Joined: Sat Oct 30, 2004 9:49 pm

Re: The Bulldozer Blog

Postby Lem » Fri Jan 22, 2010 1:55 am

abinstein wrote:The entries on Intel Core and Wolfdale have been there previously. So why is it showing 2-way set associative L1 cache then? I've always thought that it's 8-way 32KB.

There are comments in the source that say the Intel stuff needs "fine tuning" .. perhaps what's there is just a set of defaults that were copy/pasted? heh
Lem
K7 Athlon XP (Thoroughbred) Senior Boarder
K7 Athlon XP (Thoroughbred) Senior Boarder
 
Posts: 779
Joined: Mon Jul 26, 2004 1:55 pm
Location: Qld, Australia

Re: The Bulldozer Blog

Postby mmarq » Fri Jan 22, 2010 3:11 am

Worst than the associativity and the size is the latency...

But i'm convinced that BD will have "dependency prediction" with a sizable store buffer (perhaps in the way of store-sets) and also if "access decoupled", a load buffer... the rational is having much more "access" instructions near the execution pipes, avoiding to snoop too often the LS queue by doing advanced forms of "data speculation and forwarding", and so having to resort less times with accesses to the L1D than with K10... and at the end working well with smaller and slower(latency) caches.

The trade off is that L1 D in each core/cluster of BD must have more ports than in Barcelona... and if that is the case "what for ?" is the question... if there aren't more structures that are connected to those cache banks!...

OTOH BD should be a speed monster( i guess above 4Ghz is possible)!...

Also the L2 should be bigger
mmarq
K8 Athlon 64 (Orleans) Expert Boarder
K8 Athlon 64 (Orleans) Expert Boarder
 
Posts: 2337
Joined: Sat Jul 14, 2007 4:31 am

Re: The Bulldozer Blog

Postby abinstein » Sun Jan 24, 2010 4:17 am

mmarq wrote:Worst than the associativity and the size is the latency...

That is true. There are two scenarios. If the 16KB L1 has 2 cycle latency, then it'd be a great & high performance design. OTOH, if it has 3 cycle latency, then it's even worse than Pentium III (which has a lower miss penalty).


But i'm convinced that BD will have "dependency prediction" with a sizable store buffer (perhaps in the way of store-sets) and also if "access decoupled", a load buffer...

Well if I understand you correctly, the "dependency prediction" is already in Nehalem, which has an improved memory disambiguation than Core 2 Duo. So yes I do expect (hope) Bulldozer core will have similar capability.


The trade off is that L1 D in each core/cluster of BD must have more ports than in Barcelona... and if that is the case "what for ?" is the question... if there aren't more structures that are connected to those cache banks!...

There are 2 read ports and 1 write port in the L1 data of Barcelona. My guess is the number of ports will be the same in Bulldozer, but each port will be 2x wider (128 bits instead of 64 bits).


OTOH BD should be a speed monster( i guess above 4Ghz is possible)!...

Also the L2 should be bigger

Will, BD better has higher clock frequency if it's L1 miss penalty is as high as 18 cycles. It's 50% higher than K10 or Core 2; everything equal (if not better) BD should have 50% higher clock frequency. :P

BD seems to have a quite special (L2) cache design. From the conceptual figure released by AMD it seems the L2 could be accessed directly by the shared FP pipeline. Or maybe the FP pipelines will compete with the INT pipelines for the already small L1 cache?
abinstein
K8 Opteron (SledgeHammer) Moderator
K8 Opteron (SledgeHammer) Moderator
 
Posts: 7176
Joined: Sat Oct 30, 2004 9:49 pm

Re: The Bulldozer Blog

Postby amdk11 » Sun Jan 24, 2010 5:02 am

I think AMD try increased IPC(low instruction latancy), not increased high frequency.
And Bulldozer still is on papers. All are fake ? :lol:
amdk11
 
Posts: 35
Joined: Sun Jul 19, 2009 3:41 pm

PreviousNext

Return to AMD Fusion, Bobcat, Bulldozer

Who is online

Users browsing this forum: No registered users and 1 guest

cron