C vs ASM

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
Rebel
Posts: 7382
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Re: C vs ASM

Post by Rebel »

hgm wrote: Well, it seems that results are more random than anything else.
Indeed.
What is the theoretical optimum for this task anyway? That is, to how many clock cycles does this 22 (or 17) sec translate? And how many times is the loop executed? Can you put a counter in there to measure that once and for all?
I could do that but I already moved to something with more code to optimize, this bubble sort has too little code. I noticed that when removing the swap code of key2 | byte1 | byte2 gives the same performance, adding or removing them makes no differences. Amazing huh? But confirm what I quoted earlier:
Execution units
Out-of-order execution becomes even more efficient when the CPU can do more than one thing at the same time. Many CPUs can do two, three or four things at the same time if the things to do are independent of each other and do not use the same execution units in the CPU. Most CPUs have at least two integer ALU’s (Arithmetic Logic Units) so that they can do two or more integer additions per clock cycle.
User avatar
Rebel
Posts: 7382
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Re: C vs ASM

Post by Rebel »

rvida wrote:The slowdown is probably due to the same reason why I really _hate_ x86 asm: jump target alignment.
Perhaps chapter 9.6 is of any help.

http://www.agner.org/optimize/optimizing_assembly.pdf
rbarreira
Posts: 900
Joined: Tue Apr 27, 2010 3:48 pm

Re: C vs ASM

Post by rbarreira »

Compilers will only be as good at optimizing as an expert ASM programmer when:

- compilers are able to make advanced whole-program analysis in order to extract constraints on variables and program-flow (this requires a quite high level of intelligence)

- compilers are able to perfom a series of benchmarks with different assembly translations of the same code to determine which one is the fastest (profile-guided optimization goes a bit in this direction but doesn't really accomplish the required test-reimplement-test feedback loop)
User avatar
Rebel
Posts: 7382
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Re: C vs ASM

Post by Rebel »

Moving to quicksort since it has more code to optimize. Given enough memory we are going to sort 100 million random 32-bit numbers. So far I have:

Code: Select all

Digital Mars
Free memory = 1535 Mb
Allocated space for 153 Mb entries
Creating 100000000 Zobrist keys
Sorting now...  Done in 23681

Code: Select all

GCC 4.6.1 (32-bit)
Free memory = 1861 Mb
Allocated space for 186 Mb entries
Creating 100000000 Zobrist keys
Sorting now...  Done in 18762

Code: Select all

GCC 4.6.1 (64-bit)
Free memory = 3072 Mb
Allocated space for 307 Mb entries
Creating 100000000 Zobrist keys
Sorting now...  Done in 14821
The program then, it will take me some time to write the ASM for quicksort.

Code: Select all

#include <stdio.h>
#include <stdlib.h>

int MAX_ENTRIES;

#define QSORT           0           // 0=C | 1=ASM
#define DEBUG           0
#define ITEMS           100000000   // items to sort (100 million)

struct hash
  { unsigned long key1;
    unsigned long key2;
    unsigned char byte1;
    unsigned char byte2; };
struct hash *cht;

static int time1,time2;

//      Create random 32 bit keys
//      =========================

static unsigned long x_random[55] = {
 1410651636UL, 3012776752UL, 3497475623UL, 2892145026UL, 1571949714UL,
 3253082284UL, 3489895018UL, 387949491UL,  2597396737UL, 1981903553UL,
 3160251843UL, 129444464UL,  1851443344UL, 4156445905UL, 224604922UL,
 1455067070UL, 3953493484UL, 1460937157UL, 2528362617UL, 317430674UL,
 3229354360UL, 117491133UL,  832845075UL,  1961600170UL, 1321557429UL,
 747750121UL,  545747446UL,  810476036UL,  503334515UL,  4088144633UL,
 2824216555UL, 3738252341UL, 3493754131UL, 3672533954UL, 29494241UL,
 1180928407UL, 4213624418UL, 33062851UL,   3221315737UL, 1145213552UL,
 2957984897UL, 4078668503UL, 2262661702UL, 65478801UL,   2527208841UL,
 1960622036UL, 315685891UL,  1196037864UL, 804614524UL,  1421733266UL,
 2017105031UL, 3882325900UL, 810735053UL,  384606609UL,  2393861397UL };

static int init_random = 1;
static unsigned long y_random[55];
static int j_random, k_random;
static unsigned long ul_random;

void zobrist(void);
void quicksort(void);

int main (int argc,char *argv[])

{       int x;

//      Get free memory
//      ===============
        static size_t ms;
        static char *p_memory,*bgn;
        for (ms = (size_t)(3ull*1024*1048576); ms >= 2000; ms -= 100000)        // search till 12 Gb
         { p_memory = (char*)malloc(ms); if (p_memory) break; }
        printf("Free memory = %d Mb\n\n",ms>>20);
        free(p_memory);
        p_memory=(char*)malloc(ms);
        cht=(struct hash *)p_memory;
        bgn=p_memory;
        p_memory=p_memory+ms;
        MAX_ENTRIES=(int)((p_memory-bgn)/10);                       // length structure *cht
        MAX_ENTRIES=MAX_ENTRIES-100;                                // create some slack space
        printf("Allocated space for %d Mb entries\n\n",MAX_ENTRIES>>20);
        if (MAX_ENTRIES < ITEMS) { printf("Not enough memory\n\n"); exit(1); }

        MAX_ENTRIES=ITEMS;
        printf("Creating %d Zobrist keys\n\n",MAX_ENTRIES);

        zobrist();                       // make 100 million random 32-bit keys
        printf("Sorting now...  ");

        time1=clock();
        quicksort();
        time2=clock();

        printf("Done in %d\n\n",time2-time1);

        #if DEBUG
        for (x=0; x<20; x++) printf("%8x\t %8x\t (%2x) (%2x)\n",
        cht[x].key1,cht[x].key2,cht[x].byte1,cht[x].byte2);
        #endif
}

void zobrist()

//      Create random 32-bit integers
//      =============================

{       int x;

        if (init_random)
         { int i;

           init_random = 0;
           for (i = 0; i < 55; i++) y_random[i] = x_random[i];
           j_random = 24 - 1;
           k_random = 55 - 1; }

        for (x=0; x<MAX_ENTRIES; x++)
         { ul_random = (y_random[k_random] += y_random[j_random]);
           if (--j_random < 0) j_random = 55 - 1;
           if (--k_random < 0) k_random = 55 - 1;
           cht[x].key1=0;
           cht[x].key1=cht[x].key1 ^ ul_random;      // key1
           cht[x].key2=cht[x].key1;                  // key2 (copy to check proper swapping)
           cht[x].byte1=cht[x].key1;                 // byte1 (lower 8 bits to check proper swapping)
           cht[x].byte2=cht[x].key1>>24;             // byte1 (first 8 bits to check proper swapping)
           }

        #if DEBUG
        for (x=0; x<20; x++) printf("%8x\t %8x\t (%2x) (%2x)\n",
         cht[x].key1,cht[x].key2,cht[x].byte1,cht[x].byte2);
        printf("\n");
        #endif
}

void quicksort()

{       unsigned int h,zz; unsigned char ch;
        int t1,l,i,j,r;
        int st1[3000],st2[3000];

        t1=1; st1[1]=0; st2[1]=MAX_ENTRIES-1;

lab8:   l=st1[t1]; r=st2[t1]; t1=t1-1;

lab9:   i=l; j=r;
        h=cht[(l+r)>>1].key1;                        // delen door 2

lab10:  if (cht[i].key1>=h  ||  i >= r) goto lab20;
        i=i+1; goto lab10;

lab20:  if (cht[j].key1<=h  ||  j <= l) goto lab30;
        j=j-1; goto lab20;

lab30:  if (i > j)  goto lab40;
        zz=cht[i].key1;   cht[i].key1=cht[j].key1;    cht[j].key1=zz;
        zz=cht[i].key2;   cht[i].key2=cht[j].key2;    cht[j].key2=zz;
        ch=cht[i].byte1;  cht[i].byte1=cht[j].byte1;  cht[j].byte1=ch;
        ch=cht[i].byte2;  cht[i].byte2=cht[j].byte2;  cht[j].byte2=ch;

        i=i+1; j=j-1;

lab40:  if (i <= j) goto lab10;

        if (r-i <= j-l) goto lab60;

        if (l >= j) goto lab50;
        t1=t1+1; st1[t1]=l; st2[t1]=j;

lab50:  l=i; goto lab80;

lab60:  if (i >= r) goto lab70;

        t1=t1+1; st1[t1]=i; st2[t1]=r;

lab70:  r=j;

lab80:  if (r > l)  goto lab9;

        if (t1) goto lab8;

}
Joost Buijs
Posts: 1641
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: C vs ASM

Post by Joost Buijs »

Out of curiosity I also run this one through the Intel compiler.

32 bit:

Code: Select all

Free memory = 1758 Mb

Allocated space for 175 Mb entries

Creating 100000000 Zobrist keys

Sorting now...  Done in 11584
64 bit:

Code: Select all

Free memory = 3072 Mb

Allocated space for 307 Mb entries

Creating 100000000 Zobrist keys

Sorting now...  Done in 10343
The routine that determines free memory does not work properly for a 64 bit build. Probably you have to change some variables to 64 bit. Of course it doesn't matter if you don't need more than 4GB.
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: C vs ASM

Post by Gerd Isenberg »

Rebel wrote:
rvida wrote:The slowdown is probably due to the same reason why I really _hate_ x86 asm: jump target alignment.
Perhaps chapter 9.6 is of any help.

http://www.agner.org/optimize/optimizing_assembly.pdf
Playing with these modern processors seems applied chaos theory, where tiny changes in code and data may yield to huge butterfly-effects, specially with small test programs.

If I only read chapter 2.2.4 Advanced Memory Access in the intel optimization guide on speculatively load and stores, DPL, store forewarding, memory disambiguator with dependency predicton, store retirement competing for cache banks with executing loads maintained as a background task by the memory order buffer ...

http://www.intel.com/content/www/us/en/ ... anual.html

Happy optimizing ;-)
User avatar
hgm
Posts: 28390
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: C vs ASM

Post by hgm »

I don't know if this is a good idea. By moving to a 400MB data set you have torned this into a problem that will be totally dominated by DRAM access time. Most likely I will be able to write a C code that will outperform your best ASM quick-sort by at least an order of magnitude by just optimizing caching. Code quality will have zero effect on speed.

This reminds me of the occasion when Dan Corbit tried to speed up my tablebase generator by PGO compiling. He was dismayed that the initial profiling showed that 99% of the execution time was in the single statement dtm=EGT[index]; Not much to optimize on that, and optimizing the other 200 statements could at most save him 1%. :lol:

(Of course it would not really have saved anything, even if he could have shrunk its execution time to zero, as the time saved would simply have been added to the time taken by that single statement. It would only have been 'waiting faster'.)
ZirconiumX
Posts: 1360
Joined: Sun Jul 17, 2011 11:14 am
Full name: Hannah Ravensloft

Re: C vs ASM

Post by ZirconiumX »

The test program seems to hang at 100% CPU power.

Code: Select all

gcc -O3 -fwhole-program -fprofile-generate -s -pipe -o test test.c
Matthew:out
tu ne cede malis, sed contra audentior ito
User avatar
hgm
Posts: 28390
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: C vs ASM

Post by hgm »

I read up a bit on what is new in Sandy-Bridge micro-architecture, and it seems to me that decoding of small loops like we are dealing with here should no longer be an issue at all. Sandy Bridge has a uOp cache for 6K uOps, which should be able to supply the (already decoded) uOps for the entire duration of the loop. In fact the loops are so small, that the 28-uOp queue should be able to hold it entirely, and act like a Loop Steam Detector, re-supplying the same uOps to the back-end over and over again.
User avatar
hgm
Posts: 28390
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: C vs ASM

Post by hgm »

Btw, what a weird bubble sort did you write. On my machine (i7-2600) the C code took 13 sec. I could not compile the assembly, but by adding one line oc C code I could accelerate it to 0.031 sec. I am sure I could do even better, because r1 is always r2+1 (even after my change), so there would not have been any need to have both of them. But a factor 419 seemed enough for a days work. :lol:

Code: Select all

void bubbles()

{       unsigned int zz; unsigned char ch; int r1,r2;

sort05: r1=0; r2=-1;
sort10: r1++; r2++;
        if (r1>=MAX_ENTRIES) return;
//cnt1++;
        if (key1[r2] <= key1[r1]) goto sort10;
        zz=key1[r1]; key1[r1]=key1[r2]; key1[r2]=zz;    // swap
        zz=key2[r1]; key2[r1]=key2[r2]; key2[r2]=zz;
        ch=byte1[r1]; byte1[r1]=byte1[r2]; byte1[r2]=ch;
        ch=byte2[r1]; byte2[r1]=byte2[r2]; byte2[r2]=ch;
//cnt2++;
	r2-=2; if(r1-=2) goto sort10; // [HGM] accelerating patch
        goto sort05;                                    // again
}
The problem is that you run through the entire list comparing the entries to find the first reversed pair. Swapping those of course leaves all preceding entries untouched, but you run through them again nevertheless. While it would have been sufficient to step one back.

So the code is not at all what it seems (mostly swapping array elements). The swap part is only executed about 9e6 times, but the comparison part (the sort10 loop) is executed 24e9 times! Only this loop has any impact on performance, (2 increments, one redundant, two memory loads and two compare+branch instructions). And the loop is not properly aligned, because the compiler has difficulty recognizing it, due to the use of gotos rather than standard control structures. And alignment of such small loops is critical. This is why the results are so erratic.