Thursday, June 9, 2011

Zambezi ES performance weirdness

Sorry in advance for the longer post :).
Disclaimer : This is just my speculation which is probably just that,speculation. I have not signed any NDA documents nor do I have the hardware discussed here.

Since we have all been witnesses of very strange Zambezi(and Llano) ES scores,this is my try to "predict" and explain what may have been going on with these scores. I will post what I expect as an end score,so in the end,when Zambezi launches,we can see how far away from the real thing  was I :).

Let's start with my theory about why Zambezi X8 has such a low scores .I do believe there is at least some BIOS microcode patching going on,but mostly it's something else.As dresdenboy  suggested before,and I agree with him,there is some power cap pre-programmed in the ES we are seeing in Chinese forums.This may explain the frowned AMD's motherboard partners who received the same tweaked chips for validation process ;).
  Just like in Llano's case,actual clocks are being kept really low in order to keep the CPU within the TDP spec(via Turbo 2.0 interface) that  AMD designated in the ES sampling process.This may be 35,45,65,95 or 125W. From the looks of things,current BD ES are limited at 45 or 65W and they keep throttling down whenever the limit is crossed(measured and estimated digitally in BD).
What this means in practice? Just as in case of Llano ES in "New Llano leaks" thread,BD ES throttles down to approx. 2x lower clock speed in singlethreaded workloads (from what is shown in CPUz).This happens in MT workloads as the limit is easily reached in this case.There seems to be a limited "Turbo" ability too,so say 2.8Ghz ES part may be able to Turbo to what I think is 2.0Ghz( 10x multi in reality :P ) or upto the power limit - which is reached in this case.
So for example,2.8Ghz ES (1.4Ghz chip with 2Ghz Turbo and advanced C6 power savings turned ON) scored 23.4s in SPi. When the tester disabled the C6 and seemingly locked the ES @ 3.2Ghz(1.6Ghz effectively while preventing cores from going into deep sleep thus reaching the TDP limit sooner) the scores in SuperPI actually went down,to 26.7s. The cores now did "Turbo" to approx. one multiplier up and finished the test at 1.8Ghz. This is in line with the lower SPI score.

Now that I explained my theory and what I think is going on here,let's move on to my prediction of scores,all based on the Chinese leaks thread.

We start off with infamous and useless SuperPi.
Clock speed 2.8Ghz,real clock speed 1.4Ghz with turbo up to 2Ghz,C6 enabled. Score 23.3s.
My projected score for retail Zambezi @ 3.2Ghz without any Turbo engaged: 3.2/2=1.6 correction factor => score 1m : 23.3/1.6=14.56s. Score with Turbo 2.0 which is rumored to be 1Ghz over stock : 14.56/(4.2/3.2)=11.1s.
Compare(same page as above) with C6 disabled and clocks seemingly locked at 3.2Ghz(1.6Ghz with limited Turbo up to 1.8Ghz)  : 26s.

Next one is Fritz chess.This is a tricky one. 1 core score from here  is 1877pts,with C6 enabled and limited turbo to 2Ghz. User runs the MT test with 8 cores and gets 9454pts result. How is this possible? Well ,in my opinion ,the TDP limit kicked in again,limiting the each core to 1.6Ghz while multithreaded(MT) test was run. We know that scaling of modules is 80% of native dual hypothetical Bulldozer dual core design(as per AMD themselves),meaning 6.4x factor instead of perfect 8x=> 0.8x8=6.4 .  We have : 1877 x 1.6/2x6.4=9600 pts, close enough huh ? Error is just 1% from actual score ;).
What I think will be the score of 3.2Ghz Zambezi 8C in Fritz chess? 19220 pts,give or take 2%. 

Next one popular Cinebench 11.5.  The "gimped" BD ES scored just 4.6pts. Too low? You bet. This is in line with Phenom X4  @ 3.5Ghz ,while this was supposed to be brand new octal core from AMD running at close 3.2Ghz. Well explanation is  easy and is again ,as in previous case,power capping.
So we have supposedly 2.8Ghz scoring 4.6pts in C11.5. As my theory goes,this is actually a score of  1.4Ghz or 1.6Ghz 8C Zambezi which is limited via power cap (since I don't know what they set in BIOS,2.8GHz or 3.2GHz).  What will be the score of retail 3.2Ghz Zambezi in this benchmark? My estimate is as follows: 1) worst case scenario 4.6pts x 3.2/1.6=9.2pts ; 2) best case scenario 4.6x 3.2/1.4=10.51 pts. Now 10.51pts may sound too high since 980x scores 9.2pts ,but remember this slide?



This leaked AMD slide by Donanimhaber states that Zambezi 8C @ XX Ghz will be approximately 1.80x faster than Thuban X6 1100T in C11.5,which scores 5.9pts. This is exactly : 5.9x1.8=10.6pts.Close enough?

Following C11.5 is famous 3dmark06.The score is also on the same link as all above. "3.2Ghz" Zambezi 8C supposedly scores ~4500pts... Yeah,that's right,a score just a bit better than Phenom X4 @ 3Ghz that , launched in Jan 2009, is getting. So we supposedly have a brand new core design,packing 8 cores in total,with IPC improvements,that still somehow sucks so badly that it is 2x slower per core and per clock than ancient Phenom II X4(just forget X6,it's miles ahead of this poor Zambezi).
So what I think is the real score of Zambezi in this benchmark? It should be between 7800 and 8800pts. Point is this is one weird test that doesn't scale past 6 cores nicely,so it's a bit trickier to figure out what will "normal" Zambezi score here. The Donanimhaber slide indicates 50% better score for 3+Ghz(I assume) X8 Vs 1100T,which scores 5900pts. So we have projected by AMD : 1.5x5900=8800pts and estimated here,by me, 7800-8800pts.


So there you go,I tried my best to try and figure out what is going on with those gimped Bulldozer ES out in the (Chinese) wilderness . Not much left to go,around 45 days or so.We shall soon know how wrong was I. Stay tuned for more.

14 comments:

  1. Very interesting, i personally think the HT speed had a top cap @ 100Mhz. Which would give the results you suggest here.

    Agan, very interesting info here :)

    ReplyDelete
  2. Thanks for the comment.And you are right,it looks like that HT is actually 100Mh in the BIOS,just as in Llano's case :).

    ReplyDelete
  3. Looks very reasonable, strong arguments. Thanks for the comment. As the rusians say we'll live, we'll see. :)

    ReplyDelete
  4. Excellent information, all we have to do is wait a little longer and all you have to do is change the blog colors so I don't have to go blind.

    Thank you

    ReplyDelete
  5. Thanks for comments guys.Check out my latest "Llano tested" blog,multiplier bug (or HT link=100Mhz instead 200Mhz) issue is gone :).Retail Llano performs like it should and not 2x slower than Deneb(per clock) as first Chinese leaks indicated ;)

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. Scaling is more correctly 90% per core, not 80%. total effectiveness of one module is 180% of a single core, divide that by 2 to get effectiveness of 1/2 of a module.

    7.2x effective increase in performance over a single core, I believe this was confirmed unofficially by one of the AMD guys.

    Sorry for the double post, hit preview and it posted instead.

    ReplyDelete
  8. AMD's slide from FAD2010 claimed 80% throughput of conventional dual core design. 80% of 4 hypothetical native dual core Bulldozer modules(done the "old way a la A64 X2) gets you exactly 640% or 6.4x increase over single bulldozer core in 8C chip. But this sharing penalty is actually good news since you have to upscale the result of single thread(when only one thread is running) by 1.25x :). And on top of this comes Turbo.

    ReplyDelete
  9. As for double post in comments,don't worry about it :). BTW guys,do you have any suggestion for better blog color/template? Should I use white one ,or maybe standard orange color?

    ReplyDelete
  10. If another color/template. Never use white backgrund if you change then standard orange or white text on black background if possible.

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. I changed it,I hope it's OK and not too aggressive to the eyes :)

    ReplyDelete
  13. That AMD slide states "estimates and projections, subject to change".

    ReplyDelete
  14. @lexwalker
    Of course,the slide is from days when AMD has not finalized the clock speeds but has finalized the design.So if they hit ballpark clock speeds (say 3.5Ghz projected and they hit 3.2Ghz),the numbers are close enough.

    ReplyDelete