Monday, June 27, 2011

Llano's multitasking performance

A video made by VR-zone showing Llano 3850 @3.4Ghz (OCed) running Crysis 2 with great fps and running simultaneously a Linpack test .All 4 cores loaded,game still runs fine.

Monday, June 13, 2011

Llano Vs SB (in progress)

Update: I'm waiting on complete AT review(still in preview stage). He did retest the A8 with 1600 and 1866MHz RAM and it made some massive difference. On the CPU side Llano is around 3-8% faster than Deneb at the same clock,solid improvement but nothing spectacular.For a slightly tweaked shrink it is a good result on the CPU side. Turbo may be a bit of a let down since max. turbo states are rarely hit due to shared TDP in which the GPU part is prioritized over CPU cores.Top desktop part has no Turbo and works at fixed 2.9Ghz clock with power management p-states in between(800MHz-2.9GHz).

I'm making a chart that summarizes Llano's general performance Vs SandyBridge parts,stay tuned.

Sunday, June 12, 2011

Revisiting Bulldozer ES weirdness story

You all remember my original blog post about BD ES weirdness that goes on in Far East(and probably elsewhere). Problem is/was that those chips are gimped in many ways so that competition is unable to figure out what is the true potential of the first new AMD core design since original K7.

Well we have a sequel now.

Intel announces AVX2 (Haswell) ISA extensions

Intel announced yesterday the new AVX2 ISA extensions that should be introduced with Haswell in 2013. We finally get 256bit integer AVX instructions since AVX1 was limited to FP when it comes to 256bit support.There are other additions like support for FMA(256b/128b but FP only),but main one is 256b integer SIMD support.
AVX2 extends Intel AVX by promoting most of the 128-bit SIMD integer instructions
with 256-bit numeric processing capabilities. AVX2 instructions follow the same
programming model as AVX instructions.
In addition, AVX2 provide enhanced functionalities for broadcast/permute operations
on data elements, vector shift instructions with variable-shift count per data
element, and instructions to fetch non-contiguous data elements from memory. 
http://software.intel.com/file/36945

Friday, June 10, 2011

Retail desktop Llano APU tested!

UPDATE:
As per reader's suggestion ,I'm adding numbers for Sandy Bridge (2600K,default) next to Llano's stock results. Sources for 2600K results : link1,link2.
UPDATE #2:
Found some more real gaming (fps) results for Llano ES(not retail) here. Massively faster than HD3000.Check bottom of this blog post for exact numbers.


It seems someone got a hold of retail A8 3850 part and tested it here. Thanks goes to dresdenboy for the link.
Part works at 2.6Ghz and can turbo up to 2.9Ghz (with all cores loaded/GPU idle? correction: this model has no turbo and works at 2.9Ghz). GPU part  works @ 600MHz and features 400SP, so on par with 6550M discrete part.

Now on to results!

User ran a set of  Futuremark tests: 3dmark11, Vantage and 3dmark06. User also managed to OC the part ,both the CPU and GPU portion to some rather high levels,just via serial bus tuning(45%). He used air cooling. Final OC speeds are 3.77GHz(45%  30% OC) for CPU and 870Mhz(45% OC) for GPU,DDR3 was also OCed to 2320Mhz(45% OC).Memory OC is very important since GPU still depends on memory BW and it's important for ensuring GPU performance scales linear with (GPU)clockspeed increase.

Thursday, June 9, 2011

Zambezi ES performance weirdness

Sorry in advance for the longer post :).
Disclaimer : This is just my speculation which is probably just that,speculation. I have not signed any NDA documents nor do I have the hardware discussed here.

Since we have all been witnesses of very strange Zambezi(and Llano) ES scores,this is my try to "predict" and explain what may have been going on with these scores. I will post what I expect as an end score,so in the end,when Zambezi launches,we can see how far away from the real thing  was I :).

Let's start with my theory about why Zambezi X8 has such a low scores .I do believe there is at least some BIOS microcode patching going on,but mostly it's something else.As dresdenboy  suggested before,and I agree with him,there is some power cap pre-programmed in the ES we are seeing in Chinese forums.This may explain the frowned AMD's motherboard partners who received the same tweaked chips for validation process ;).
  Just like in Llano's case,actual clocks are being kept really low in order to keep the CPU within the TDP spec(via Turbo 2.0 interface) that  AMD designated in the ES sampling process.This may be 35,45,65,95 or 125W. From the looks of things,current BD ES are limited at 45 or 65W and they keep throttling down whenever the limit is crossed(measured and estimated digitally in BD).
What this means in practice? Just as in case of Llano ES in "New Llano leaks" thread,BD ES throttles down to approx. 2x lower clock speed in singlethreaded workloads (from what is shown in CPUz).This happens in MT workloads as the limit is easily reached in this case.There seems to be a limited "Turbo" ability too,so say 2.8Ghz ES part may be able to Turbo to what I think is 2.0Ghz( 10x multi in reality :P ) or upto the power limit - which is reached in this case.
So for example,2.8Ghz ES (1.4Ghz chip with 2Ghz Turbo and advanced C6 power savings turned ON) scored 23.4s in SPi. When the tester disabled the C6 and seemingly locked the ES @ 3.2Ghz(1.6Ghz effectively while preventing cores from going into deep sleep thus reaching the TDP limit sooner) the scores in SuperPI actually went down,to 26.7s. The cores now did "Turbo" to approx. one multiplier up and finished the test at 1.8Ghz. This is in line with the lower SPI score.

Now that I explained my theory and what I think is going on here,let's move on to my prediction of scores,all based on the Chinese leaks thread.

We start off with infamous and useless SuperPi.
Clock speed 2.8Ghz,real clock speed 1.4Ghz with turbo up to 2Ghz,C6 enabled. Score 23.3s.
My projected score for retail Zambezi @ 3.2Ghz without any Turbo engaged: 3.2/2=1.6 correction factor => score 1m : 23.3/1.6=14.56s. Score with Turbo 2.0 which is rumored to be 1Ghz over stock : 14.56/(4.2/3.2)=11.1s.
Compare(same page as above) with C6 disabled and clocks seemingly locked at 3.2Ghz(1.6Ghz with limited Turbo up to 1.8Ghz)  : 26s.

Next one is Fritz chess.This is a tricky one. 1 core score from here  is 1877pts,with C6 enabled and limited turbo to 2Ghz. User runs the MT test with 8 cores and gets 9454pts result. How is this possible? Well ,in my opinion ,the TDP limit kicked in again,limiting the each core to 1.6Ghz while multithreaded(MT) test was run. We know that scaling of modules is 80% of native dual hypothetical Bulldozer dual core design(as per AMD themselves),meaning 6.4x factor instead of perfect 8x=> 0.8x8=6.4 .  We have : 1877 x 1.6/2x6.4=9600 pts, close enough huh ? Error is just 1% from actual score ;).
What I think will be the score of 3.2Ghz Zambezi 8C in Fritz chess? 19220 pts,give or take 2%. 

Next one popular Cinebench 11.5.  The "gimped" BD ES scored just 4.6pts. Too low? You bet. This is in line with Phenom X4  @ 3.5Ghz ,while this was supposed to be brand new octal core from AMD running at close 3.2Ghz. Well explanation is  easy and is again ,as in previous case,power capping.
So we have supposedly 2.8Ghz scoring 4.6pts in C11.5. As my theory goes,this is actually a score of  1.4Ghz or 1.6Ghz 8C Zambezi which is limited via power cap (since I don't know what they set in BIOS,2.8GHz or 3.2GHz).  What will be the score of retail 3.2Ghz Zambezi in this benchmark? My estimate is as follows: 1) worst case scenario 4.6pts x 3.2/1.6=9.2pts ; 2) best case scenario 4.6x 3.2/1.4=10.51 pts. Now 10.51pts may sound too high since 980x scores 9.2pts ,but remember this slide?



This leaked AMD slide by Donanimhaber states that Zambezi 8C @ XX Ghz will be approximately 1.80x faster than Thuban X6 1100T in C11.5,which scores 5.9pts. This is exactly : 5.9x1.8=10.6pts.Close enough?

Following C11.5 is famous 3dmark06.The score is also on the same link as all above. "3.2Ghz" Zambezi 8C supposedly scores ~4500pts... Yeah,that's right,a score just a bit better than Phenom X4 @ 3Ghz that , launched in Jan 2009, is getting. So we supposedly have a brand new core design,packing 8 cores in total,with IPC improvements,that still somehow sucks so badly that it is 2x slower per core and per clock than ancient Phenom II X4(just forget X6,it's miles ahead of this poor Zambezi).
So what I think is the real score of Zambezi in this benchmark? It should be between 7800 and 8800pts. Point is this is one weird test that doesn't scale past 6 cores nicely,so it's a bit trickier to figure out what will "normal" Zambezi score here. The Donanimhaber slide indicates 50% better score for 3+Ghz(I assume) X8 Vs 1100T,which scores 5900pts. So we have projected by AMD : 1.5x5900=8800pts and estimated here,by me, 7800-8800pts.


So there you go,I tried my best to try and figure out what is going on with those gimped Bulldozer ES out in the (Chinese) wilderness . Not much left to go,around 45 days or so.We shall soon know how wrong was I. Stay tuned for more.