wuttz wrote:mmarq;
i think i'd agree with your ideas;
1. "fusing x86 INT to XOP" making use of XOP pipes
2. trace cache for I$ (L0 for SB) bypassing decode when in loops
Yeah!... but a compiler, JIT or not, can accomplish that "binary translation" (fusing) much more extensively than a pure hardware method... and optimize the code at the same time. So a JIT can complement very well hardware fusing for x86 to XOP.
But recent AMD patents about trace caches specify a "redirect recovery cache" where traces are formed
around branch targets, with possible extensive
reuse of instructions ->
wuttz wrote:
but not with;
1. spMT, eager execution
2. JIT/ FSAIL
latter group will always have power/performance budget constraints ..
we need methods that will give better results 100% of the time.
spMT is not 100%, JIT txlation has performance overhead = no go.
SpMT is only a generic term that can encompass plenty of diff techs.
Eager execution can be like run-ahead, sequential(like IBM), using checkpointing instead of a thread context. And since you'll have a "redirect recovery cache" then 80% of the difficulty of implementing eager execution is already done... how came you don't like it ?
Eager execution from a trace cache, reusing decoded instructions, can have quite a low additional power constrain and surely will bring good performance, cause its a form of run-ahead or ahead execution, orienting prefetch and warming caches if nothing else...
Nevertheless i think an internal SMT context for this is better, more so because its an even easier target for checkpointing, and also because larger traces can be employed(up to dozens of instruc if not hundreds), and since you have vertical multithreading, then you probably proceed from the "dispatch domain" where the trace cache will be(?), and vertical multithread it with the o-o-o "exec domain"... i believe its facilitated this way...
A JIT, you already have a JIT wuttz

... if you have a GPU you already have a JIT, wasting plenty cycles on your CPU wuttz!
FSAIL you will also have it, unless you'll have nothing but a pure x86 CPU and no AMD GPGPU... because in the future, if you'll choose an AMD heterogeneous CPU(APU ?)... even if with no AMD GPU and or graphics in it... CPU this with the likes of compress, encrypt, managed-code, complementing large vectors, as different co-processor blocks, there in the modules[even for Opteron server offerings]... chances are you'll be forced to load FSAIL to make it function properly. The agnostic to CPU and GPU FSA/FSAIL is not only for GPGPUs, its for heterogeneous computing in general.
So with a JIT... and FSAIL, which is like a low level VM(LLVM) target... the missing piece is compiler oriented speculative multithreading... nothing new really... its not easy, but not new, and one of the forms of mitigating the issues of this "compiler speculative" tech, is by orienting it to break sequential code into various threads by data speculation(a level above and beyond hardware forms of data speculation which i believe Steamroller will have) on the way of transactions with protected memory regions... and transactional memory with ASF of AMD, will provide exactly that...
This is the only form of spMT that has less power constraints because is more software than hardware, and because its done mostly by a compiler(JIT) with the added bonus of having a proper target ISA .. no matter if a "virtual ISA" like in FSAIL... so it more easily can have other code optimizations/transformations complementing this form of spMT, matter of fact, on-the-fly TLDS(thread level data speculation) can be one of those optimizations/transformations all complementing the above hardware fusing of x86 to XOP.
Transactional Memory will go for sure, AMD and Intel are even cooperating on this, and so forms of TLDS also, cause this transactional memory feature is "speculative" in nature, no matter if now only specifically encoded by developers...
a charme! ... now AMD only has to have a way to do it on-the-fly without developer intervention,
or complement the developer coding(THE BEST) ...
Only i don't think this will be by the time of Excavator... but probably 1 or
2 iterations after.
Intel will also have a form of TLDS with HTM.. of sorts... in the form of pre-computation with internal tread contexts.
better to invest rather in;
1. higher clocks/ turbo
quite orthogonal, even an internal SMT context, on each core, will not constrain this and most probably will not imply the need for a larger FO4... OTOH your idea of making 2 cluster/cores to work on a single thread might(and not little).
Also an internal SMT context might come very handy(unavoidable) if you want to implement reliable or redundant execution... like IBM has for the big metal... and AMD seems serious about this.
It can be BIOS triggered for those internal SMT contexts:
eager execution/run-ahead for client and small server...
redundant execution for big server jobs
all from the same exact chips...
why is amd allergic to reverse-HTT?
Reverse-HTT is spMT... 1 real thread(context) several cores (achievable by speculating)... (inverse)... HTT is 1 real core several contexts...
So!... will AMD ever implement a form of spMT ?... beyond something like eager execution that is ?... i think yes.
At least AMD like Intel seems serious about transactional memory support, and since you'll have a JIT and FSAIL for heterogeneous computing "things"(APUs, HCUs?

), which very well all metal from AMD can be it in the future, including server chips.. why not data a good form of speculative multithreading ?