Sunday, May 26, 2013

AMD Steamroller die shot leaked?

Poster Fellix_Bg posted an interesting picture on the Xtreme systems.

It looks like it could be legit die shot but anything is possible. The changes to the core are massive, especially the front end, branch prediction,instr. and data caches and FP unit as well(massively bigger dedicated die area - 2x the die area dedicated for FP in BD/PD).

Higher res. image

There is also some discussion on SA forum about it and comparison of functional units in this die shot and the BD/PD ones.
SA forum member sdlvx made a nice comparison image:


Two posts from knowledgeable people. 1st is Hans de Vries at SA forum:

This seems quite legit and it's a big module indeed.... It seems many resources are doubled:  
Floating point: dual 256 bit FMA instead of dual 128 bit FMA.
Integer: 8 ALU's and 8 AGU's instead of 4 both. Dual 32kB data caches instead of dual 16 kB.
Many other resources are also doubled like rename hardware and so on.

This is how I understand this design (on inf64's request):

The single Bulldozer decoder somehow couldn't handle 2 threads running
at 100% and for benchmarks we see at most a 50% performance increase
when the "second core" becomes active. So it doesn't work good enough
for CMT (but it's more than OK for dual threaded SMT)

Now why not double up the decoder and use the capability to decode
2 threads for SMT instead?

The dual 6 cycle 256 bit FMA FP units "cry out loud" for more threads,
they will be idle and unused otherwise for most of the cycles since you
need 2x6=12 FP operations to go on simultaneously to fully utilize them.
Even with 4 threads that's still 3 FP operations in parallel per thread.

The old 128 bit FMA units used the hardware more efficiently with two
cycles used per AVX operations but I guess one needs full 256 bit AVX
units to score well at these specially designed synthetic, but otherwise
pretty useless, benchmarks.

A single "Integer core" now has 4 ALU's and 4 AGU's which can improve
the integer performance somewhat, but not a lot. Actually I hope they
can still function as dual 2 ALU/AGU integer execution cores to support
4 threads in parallel. That would really help multithreaded performance,
and a little bit of the CMT ideas would survive.

Over time, in subsequent versions, they can now incrementally improve
single threaded performance using 4 ALU as much as possible. But even
then. Integer performance wasn't really Bulldozers problem as can be
seen in the Boinc Dhrystone benchmarks which showed a similar integer
IPC as the Athlon/Phenom cores (as long as the benchmark fits in L1D)

The L1D caches are doubled to 32kB to support 256 bit reads and writes.
A single cache line of 512 bit can now be read and written in a single
cycle freeing up cycles for more program reads and writes. The double
width also reduces bank conflicts.

This strategy to improve Bulldozer/Piledriver is pretty much as I would
have done it. I hope it's indeed AMD's way as well.


Second is from 3dilettante B3D forum:
 For fun, I'll just assume this is a true representation of a future core, or someone put a decent amount of effort into making this up.
There isn't a good shot of this core and Bulldozer of equivalent quality, but here's what it sort of looks like to me.

The L1 has two sections, separated by a pink section that could be the microcode ROM. There is what appears to be a fetch buffer for both sections, so I am left wondering if this is one big L1 I$ or two.
For the instruction section there are two of everything, save the predecoder, branch predictor, and microcode. It think there's a little pink rectangle below the microcode that is a microcode-related engine, and even this is duplicated.

A single integer core looks to have twice the integer ALUs, twice the AGUs, but the multiplier and divider sections don't look replicated. The physical register file doesn't appear to be doubled. If it is bigger, it's not significant enough to split it into new sections or appear to be more than incremental growth.
The portions of the pipeline related to gathering operands and immediates for each instruction are doubled.
Interestingly, it looks like the rename tables and retire structures are doubled in size.
The table that might have to do with waking up/picking instructions could be bigger, but it isn't doubled.
The odd thing from a single-threaded perspective is that the scheduler logic outside of the tables is either much denser or not much larger. It's also not really necessary to have double the retirement tracking or rename tables for a core whose decoder is still 4-wide--if the core is single-threaded.
Perhaps it isn't.
If the integer pipeline is partitioned into halves, it might make the bypassing network less of a nightmare than expanding it from 2 ALUs and 2 AGUs to 4x4 single-cycle.

I've only had fuzzy BD shots to compare with, which makes the load/store section particularly hard to analyze.
The L1 data cache appears to be different, but not necessarily much bigger in area. If it's not bigger, it may be more aggressively banked. The interfacing logic on the side of the L1 doesn't appear to have more subunits, which might mean the port count hasn't changed. I don't know if its bandwidth has changed, but the fuzzily pictured width of that interface doesn't seem to be much different.
The L/S section appears relatively narrower compared to the sections that did grow, which could indicate it has been slightly modified.
There are a few duplicated/grown structures, which might be queues for loads and stores. My die-shot-fu isn't good enough to know which one is which. The more obviously duplicated structure may be a pair of store queues.

The FP unit appears to have rotated a bunch of components 90 degrees, and doubled the capacity of the register file. The two halves of each bank of registers aren't physically identical. This may be a legacy/full vector set distinction.
For what it's worth, what watchimpress stated is the retire control is duplicated. There were already two banks, one per thread with the original dual-threaded FPU. This could mean it's now able to support four.

I think there is still an upper/lower data path split, but I'm not sure which way is best to handle it.
The units themselves have a double pink line through the middle, which may be a way to separately gate each half.
It may be better to rotate the whole FPU 90 degrees. Instead of it being left:right=Hi:Lo, it's left=register file 0, right=register file 1.
Each half would have its own Hi:Lo split. There seems to be some extra routing going on between the upper and lower halves of each side to permit shuffling between them.
There's a slight break in the symmetry of the left and right sides, however, particularly in the lower right. There may be some special functions that use different hardware there. There is some additional routing in the lower half that could explain how one half could still use whatever special hardware is on the other side.
This may also explain the potentially narrower depiction of the FPU in some slides, which went form 2 FMAC+ 2MMX(one being also FSTORE) to 2 FMAC + 1MMX. The hardware would be mostly the same, but the common case would be that each thread can only see its side of the FPU and other ops that could cross to the other side have the one port doing double-duty.