Sunday, December 29, 2013

1st Kaveri leak from JP

I have found something for you guys, posted it on SA and here for now . Hope you like it. The scores are not out of this world like previous time, so this is looking like a legit screenshot. Magic date is January 14.


PS It takes 4.2Ghz PD to match 3.7Ghz SR in C15 . 7850K runs C15 at 3.7Ghz, it doesn't turbo up in this workload (at least not in the run that you see in the image above ).

Wednesday, December 11, 2013

Kaveri is launching soon. What to expect?

Long time since my last blog post. Since AMD is about to launch desktop Kaveri soon (in ~January) I wanted to write a couple of lines about what *I* think about this product launch.

During this year we have had a few "leaks" regarding Kaveri's performance. Conveniently none of these were solid enough to draw any meaningful conclusion regarding actual core performance or clock capabilities. We had a few low clocking and disproportionally low performing mobile parts listed at website and a few geekbench scores of desktop ES parts(single ch. memory). And that's it. Pretty vague and inconclusive.

Now people do have high expectations when it comes to Steamroller cores that power Kaveri, and the "blame" largely lies on AMD themselves. I know since I had them too. AMD did build up a hype with their presentations on HC last year and in a few interviews after that. We had hints of "solid" IPC jumps, new ISA support (turns out it's a few more instructions and not AVX2), quad channel IMC(GDDR5 mode ,4x32bit), 3 module parts, big iGPU perf. gains etc. What we have now is roughly the following:
1) 3 module version is scrapped due to reasons that will be obvious at launch;
2)quad channel GDDR5 support is gone also although IMC has the support sitting idle on-die(deactivated); figure that...
3)"IPC" can be substantially better BUT only if one was to cherry pick benchmarks and filter the "bad" ones(also related to number 1) and what constitutes a total performance of a chip )
4) If Richland had memory BW problem then it will be a minor one when compared to Kaveri.
 I think I've said enough.

I've always expected that the 3rd iteration of Bulldozer core will bring us the most tangible results when it comes to removing bottlenecks and bringing the best from the concept of CMT core but due to wrong decisions on AMD's part this I'm afraid will just remain a dream.
As for me I have my Piledriver based X4 and it's doing fine at 4.2Ghz. I personally see nothing that Kaveri will bring on CPU side to sway me over and swap it in my board. Yeah, sorry folks.

Thursday, June 6, 2013

AMD Richland (desktop) reviews

AMD officially launched Richland APUs for desktop PCs. The chips are solid improvement over Trinity, not earth shattering but a good perf./watt (and straight x86 performance) uprgade over Trinity parts.

Here is a summary of most relevant reviews in one topic.

Tuesday, June 4, 2013

What is FX @ 5Ghz exactly and why is it listed as "next generation" FX core by Gigabyte?

Good question huh? Yep sure is. Also what is "TJ" :) ??

Had a better look at the Gigabyte sheet.

It says "AMD next generation FX".
Vishera is no next gen, it is almost two years old already

I assume Gigabyte refers to "TJ", which still should pop into AM3r2.

The sheet is for Rev 3.0 990FXA-UD7 which has some changes to the VRM.
Yet it appears to be a multiplied analog VRM.

Interesting times ahead ;).

Monday, June 3, 2013

AMD's 5Ghz "next generation" FX models under way!

Ok guys, finally some computex news (still not official by AMD though). Thanks goes to Hexus' and overclocknet's CPU forums .
It's semi-official, there will be a "high-TDP" FX models presented soon, gigabyte lists it as 5Ghz AMD FX lol
Original source is donanimhaber.

Some rough perf. and power estimates :
I've just read sweclocker's review of 4770K/4670K. I wanted to see what kind of difference, power wise, are we looking at Vs 8350. It turns out that in OCCT 4770K draws 140W vs 210W for 8350, both at stock. When OCed to 4.4Ghz, the power draw of 4770K spikes to 190W while performance increases roughly by ~11% or so (logical since the chip almost always tries to run at full turbo of 3.9Ghz).

Now when we look at the rumored OCed-out-of-the-box FX models, we see them having the rather high TDP of 220W. This is 95W higher than official TDP spec for 8350 (or a bit less than that considering 8350 can actually go above 125W). Now looking at sweclocker's link I posted above, the FX9000 @ 5Ghz should be sitting at 210+95~=305W. This is with 20% clock (performance) increase since both rumored stock and Turbo clock are ~20% higher than on 8350. 305W of "stock" FX9000 vs 190W of OCed to rather expected/normal 4.4Ghz for 4770K on air- that's 115W of difference, roughly of course. Since the jump with FX9000 is 20% vs the perf. increase of 10% the OC of 4700K nets, we have the FX9000 actually closing the gap when compared to OCed 4770K. Versus stock 4770K, new "stock" FX should be winning some of the benchmarks and closing the gap significantly in others. In some situations the gap will still be there though and with a hefty power difference at that.

Finally the pricing structure. If AMD prices these parts at lower range than 4770K, somewhere close to 4670K or a bit higher (250-260$), it could be an "OK" purchase for those wanting great MTed performance and somewhat competitive ST performance (5Ghz Vishera is pretty fast chip in ST tasks, still slower than IB and co though). Lastly those buyers should be ready to accept the fact their chip will be drawing >120W more than OCed IB/Haswell setup,but will be cheaper ~100$ or so.

4770K/4670K (Haswell) reviews

Haswell has launched and brought us fastest yet x86 performance. It's great to think we have come so far from 2006. Still, the performance jump from IvyBridge core is pretty disappointing in legacy code( FMA3/AVX2 is another story, yet to be written when software support arrives).

Here is a XS topic that covers most if not all reviews so far.

Saturday, June 1, 2013

Kaveri is FM2 compatible?

Roy Taylor spoke for newegg TV about Richland. Interesting remark from him at around 3:23 in to the video . Kaveri will be compatible with FM2 boards according to him .We know that AMD lists Kaveri as FM2+ so my guess is that it's compatible BUT you would want FM2+ board with GDDR5 or DDR4 (whichever it supports) in order to get the best possible performance.

Sunday, May 26, 2013

AMD Steamroller die shot leaked?

Poster Fellix_Bg posted an interesting picture on the Xtreme systems.

It looks like it could be legit die shot but anything is possible. The changes to the core are massive, especially the front end, branch prediction,instr. and data caches and FP unit as well(massively bigger dedicated die area - 2x the die area dedicated for FP in BD/PD).

Higher res. image

There is also some discussion on SA forum about it and comparison of functional units in this die shot and the BD/PD ones.
SA forum member sdlvx made a nice comparison image:


Two posts from knowledgeable people. 1st is Hans de Vries at SA forum:

This seems quite legit and it's a big module indeed.... It seems many resources are doubled:  
Floating point: dual 256 bit FMA instead of dual 128 bit FMA.
Integer: 8 ALU's and 8 AGU's instead of 4 both. Dual 32kB data caches instead of dual 16 kB.
Many other resources are also doubled like rename hardware and so on.

This is how I understand this design (on inf64's request):

The single Bulldozer decoder somehow couldn't handle 2 threads running
at 100% and for benchmarks we see at most a 50% performance increase
when the "second core" becomes active. So it doesn't work good enough
for CMT (but it's more than OK for dual threaded SMT)

Now why not double up the decoder and use the capability to decode
2 threads for SMT instead?

The dual 6 cycle 256 bit FMA FP units "cry out loud" for more threads,
they will be idle and unused otherwise for most of the cycles since you
need 2x6=12 FP operations to go on simultaneously to fully utilize them.
Even with 4 threads that's still 3 FP operations in parallel per thread.

The old 128 bit FMA units used the hardware more efficiently with two
cycles used per AVX operations but I guess one needs full 256 bit AVX
units to score well at these specially designed synthetic, but otherwise
pretty useless, benchmarks.

A single "Integer core" now has 4 ALU's and 4 AGU's which can improve
the integer performance somewhat, but not a lot. Actually I hope they
can still function as dual 2 ALU/AGU integer execution cores to support
4 threads in parallel. That would really help multithreaded performance,
and a little bit of the CMT ideas would survive.

Over time, in subsequent versions, they can now incrementally improve
single threaded performance using 4 ALU as much as possible. But even
then. Integer performance wasn't really Bulldozers problem as can be
seen in the Boinc Dhrystone benchmarks which showed a similar integer
IPC as the Athlon/Phenom cores (as long as the benchmark fits in L1D)

The L1D caches are doubled to 32kB to support 256 bit reads and writes.
A single cache line of 512 bit can now be read and written in a single
cycle freeing up cycles for more program reads and writes. The double
width also reduces bank conflicts.

This strategy to improve Bulldozer/Piledriver is pretty much as I would
have done it. I hope it's indeed AMD's way as well.


Second is from 3dilettante B3D forum:
 For fun, I'll just assume this is a true representation of a future core, or someone put a decent amount of effort into making this up.
There isn't a good shot of this core and Bulldozer of equivalent quality, but here's what it sort of looks like to me.

The L1 has two sections, separated by a pink section that could be the microcode ROM. There is what appears to be a fetch buffer for both sections, so I am left wondering if this is one big L1 I$ or two.
For the instruction section there are two of everything, save the predecoder, branch predictor, and microcode. It think there's a little pink rectangle below the microcode that is a microcode-related engine, and even this is duplicated.

A single integer core looks to have twice the integer ALUs, twice the AGUs, but the multiplier and divider sections don't look replicated. The physical register file doesn't appear to be doubled. If it is bigger, it's not significant enough to split it into new sections or appear to be more than incremental growth.
The portions of the pipeline related to gathering operands and immediates for each instruction are doubled.
Interestingly, it looks like the rename tables and retire structures are doubled in size.
The table that might have to do with waking up/picking instructions could be bigger, but it isn't doubled.
The odd thing from a single-threaded perspective is that the scheduler logic outside of the tables is either much denser or not much larger. It's also not really necessary to have double the retirement tracking or rename tables for a core whose decoder is still 4-wide--if the core is single-threaded.
Perhaps it isn't.
If the integer pipeline is partitioned into halves, it might make the bypassing network less of a nightmare than expanding it from 2 ALUs and 2 AGUs to 4x4 single-cycle.

I've only had fuzzy BD shots to compare with, which makes the load/store section particularly hard to analyze.
The L1 data cache appears to be different, but not necessarily much bigger in area. If it's not bigger, it may be more aggressively banked. The interfacing logic on the side of the L1 doesn't appear to have more subunits, which might mean the port count hasn't changed. I don't know if its bandwidth has changed, but the fuzzily pictured width of that interface doesn't seem to be much different.
The L/S section appears relatively narrower compared to the sections that did grow, which could indicate it has been slightly modified.
There are a few duplicated/grown structures, which might be queues for loads and stores. My die-shot-fu isn't good enough to know which one is which. The more obviously duplicated structure may be a pair of store queues.

The FP unit appears to have rotated a bunch of components 90 degrees, and doubled the capacity of the register file. The two halves of each bank of registers aren't physically identical. This may be a legacy/full vector set distinction.
For what it's worth, what watchimpress stated is the retire control is duplicated. There were already two banks, one per thread with the original dual-threaded FPU. This could mean it's now able to support four.

I think there is still an upper/lower data path split, but I'm not sure which way is best to handle it.
The units themselves have a double pink line through the middle, which may be a way to separately gate each half.
It may be better to rotate the whole FPU 90 degrees. Instead of it being left:right=Hi:Lo, it's left=register file 0, right=register file 1.
Each half would have its own Hi:Lo split. There seems to be some extra routing going on between the upper and lower halves of each side to permit shuffling between them.
There's a slight break in the symmetry of the left and right sides, however, particularly in the lower right. There may be some special functions that use different hardware there. There is some additional routing in the lower half that could explain how one half could still use whatever special hardware is on the other side.
This may also explain the potentially narrower depiction of the FPU in some slides, which went form 2 FMAC+ 2MMX(one being also FSTORE) to 2 FMAC + 1MMX. The hardware would be mostly the same, but the common case would be that each thread can only see its side of the FPU and other ops that could cross to the other side have the one port doing double-duty.