C vs ASM

Rebel · Post by **Rebel** » Wed Mar 06, 2013 1:16 pm

hgm wrote: Well, it seems that results are more random than anything else.

Indeed.

What is the theoretical optimum for this task anyway? That is, to how many clock cycles does this 22 (or 17) sec translate? And how many times is the loop executed? Can you put a counter in there to measure that once and for all?

I could do that but I already moved to something with more code to optimize, this bubble sort has too little code. I noticed that when removing the swap code of key2 | byte1 | byte2 gives the same performance, adding or removing them makes no differences. Amazing huh? But confirm what I quoted earlier:

Execution units
Out-of-order execution becomes even more efficient when the CPU can do more than one thing at the same time. Many CPUs can do two, three or four things at the same time if the things to do are independent of each other and do not use the same execution units in the CPU. Most CPUs have at least two integer ALU’s (Arithmetic Logic Units) so that they can do two or more integer additions per clock cycle.

Rebel · Post by **Rebel** » Wed Mar 06, 2013 1:24 pm

rvida wrote:The slowdown is probably due to the same reason why I really _hate_ x86 asm: jump target alignment.

Perhaps chapter 9.6 is of any help.

http://www.agner.org/optimize/optimizing_assembly.pdf

rbarreira · Post by **rbarreira** » Wed Mar 06, 2013 1:35 pm

Compilers will only be as good at optimizing as an expert ASM programmer when:

- compilers are able to make advanced whole-program analysis in order to extract constraints on variables and program-flow (this requires a quite high level of intelligence)

- compilers are able to perfom a series of benchmarks with different assembly translations of the same code to determine which one is the fastest (profile-guided optimization goes a bit in this direction but doesn't really accomplish the required test-reimplement-test feedback loop)

Rebel · Post by **Rebel** » Wed Mar 06, 2013 1:40 pm

Moving to quicksort since it has more code to optimize. Given enough memory we are going to sort 100 million random 32-bit numbers. So far I have:

Code: Select all

Digital Mars
Free memory = 1535 Mb
Allocated space for 153 Mb entries
Creating 100000000 Zobrist keys
Sorting now...  Done in 23681

Code: Select all

GCC 4.6.1 (32-bit)
Free memory = 1861 Mb
Allocated space for 186 Mb entries
Creating 100000000 Zobrist keys
Sorting now...  Done in 18762

Code: Select all

GCC 4.6.1 (64-bit)
Free memory = 3072 Mb
Allocated space for 307 Mb entries
Creating 100000000 Zobrist keys
Sorting now...  Done in 14821

The program then, it will take me some time to write the ASM for quicksort.

Code: Select all

#include <stdio.h>
#include <stdlib.h>

int MAX_ENTRIES;

#define QSORT           0           // 0=C | 1=ASM
#define DEBUG           0
#define ITEMS           100000000   // items to sort (100 million)

struct hash
  { unsigned long key1;
    unsigned long key2;
    unsigned char byte1;
    unsigned char byte2; };
struct hash *cht;

static int time1,time2;

//      Create random 32 bit keys
//      =========================

static unsigned long x_random[55] = {
 1410651636UL, 3012776752UL, 3497475623UL, 2892145026UL, 1571949714UL,
 3253082284UL, 3489895018UL, 387949491UL,  2597396737UL, 1981903553UL,
 3160251843UL, 129444464UL,  1851443344UL, 4156445905UL, 224604922UL,
 1455067070UL, 3953493484UL, 1460937157UL, 2528362617UL, 317430674UL,
 3229354360UL, 117491133UL,  832845075UL,  1961600170UL, 1321557429UL,
 747750121UL,  545747446UL,  810476036UL,  503334515UL,  4088144633UL,
 2824216555UL, 3738252341UL, 3493754131UL, 3672533954UL, 29494241UL,
 1180928407UL, 4213624418UL, 33062851UL,   3221315737UL, 1145213552UL,
 2957984897UL, 4078668503UL, 2262661702UL, 65478801UL,   2527208841UL,
 1960622036UL, 315685891UL,  1196037864UL, 804614524UL,  1421733266UL,
 2017105031UL, 3882325900UL, 810735053UL,  384606609UL,  2393861397UL };

static int init_random = 1;
static unsigned long y_random[55];
static int j_random, k_random;
static unsigned long ul_random;

void zobrist(void);
void quicksort(void);

int main (int argc,char *argv[])

{       int x;

//      Get free memory
//      ===============
        static size_t ms;
        static char *p_memory,*bgn;
        for (ms = (size_t)(3ull*1024*1048576); ms >= 2000; ms -= 100000)        // search till 12 Gb
         { p_memory = (char*)malloc(ms); if (p_memory) break; }
        printf("Free memory = %d Mb\n\n",ms>>20);
        free(p_memory);
        p_memory=(char*)malloc(ms);
        cht=(struct hash *)p_memory;
        bgn=p_memory;
        p_memory=p_memory+ms;
        MAX_ENTRIES=(int)((p_memory-bgn)/10);                       // length structure *cht
        MAX_ENTRIES=MAX_ENTRIES-100;                                // create some slack space
        printf("Allocated space for %d Mb entries\n\n",MAX_ENTRIES>>20);
        if (MAX_ENTRIES < ITEMS) { printf("Not enough memory\n\n"); exit(1); }

        MAX_ENTRIES=ITEMS;
        printf("Creating %d Zobrist keys\n\n",MAX_ENTRIES);

        zobrist();                       // make 100 million random 32-bit keys
        printf("Sorting now...  ");

        time1=clock();
        quicksort();
        time2=clock();

        printf("Done in %d\n\n",time2-time1);

        #if DEBUG
        for (x=0; x<20; x++) printf("%8x\t %8x\t (%2x) (%2x)\n",
        cht[x].key1,cht[x].key2,cht[x].byte1,cht[x].byte2);
        #endif
}

void zobrist()

//      Create random 32-bit integers
//      =============================

{       int x;

        if (init_random)
         { int i;

           init_random = 0;
           for (i = 0; i < 55; i++) y_random[i] = x_random[i];
           j_random = 24 - 1;
           k_random = 55 - 1; }

        for (x=0; x<MAX_ENTRIES; x++)
         { ul_random = (y_random[k_random] += y_random[j_random]);
           if (--j_random < 0) j_random = 55 - 1;
           if (--k_random < 0) k_random = 55 - 1;
           cht[x].key1=0;
           cht[x].key1=cht[x].key1 ^ ul_random;      // key1
           cht[x].key2=cht[x].key1;                  // key2 (copy to check proper swapping)
           cht[x].byte1=cht[x].key1;                 // byte1 (lower 8 bits to check proper swapping)
           cht[x].byte2=cht[x].key1>>24;             // byte1 (first 8 bits to check proper swapping)
           }

        #if DEBUG
        for (x=0; x<20; x++) printf("%8x\t %8x\t (%2x) (%2x)\n",
         cht[x].key1,cht[x].key2,cht[x].byte1,cht[x].byte2);
        printf("\n");
        #endif
}

void quicksort()

{       unsigned int h,zz; unsigned char ch;
        int t1,l,i,j,r;
        int st1[3000],st2[3000];

        t1=1; st1[1]=0; st2[1]=MAX_ENTRIES-1;

lab8:   l=st1[t1]; r=st2[t1]; t1=t1-1;

lab9:   i=l; j=r;
        h=cht[(l+r)>>1].key1;                        // delen door 2

lab10:  if (cht[i].key1>=h  ||  i >= r) goto lab20;
        i=i+1; goto lab10;

lab20:  if (cht[j].key1<=h  ||  j <= l) goto lab30;
        j=j-1; goto lab20;

lab30:  if (i > j)  goto lab40;
        zz=cht[i].key1;   cht[i].key1=cht[j].key1;    cht[j].key1=zz;
        zz=cht[i].key2;   cht[i].key2=cht[j].key2;    cht[j].key2=zz;
        ch=cht[i].byte1;  cht[i].byte1=cht[j].byte1;  cht[j].byte1=ch;
        ch=cht[i].byte2;  cht[i].byte2=cht[j].byte2;  cht[j].byte2=ch;

        i=i+1; j=j-1;

lab40:  if (i <= j) goto lab10;

        if (r-i <= j-l) goto lab60;

        if (l >= j) goto lab50;
        t1=t1+1; st1[t1]=l; st2[t1]=j;

lab50:  l=i; goto lab80;

lab60:  if (i >= r) goto lab70;

        t1=t1+1; st1[t1]=i; st2[t1]=r;

lab70:  r=j;

lab80:  if (r > l)  goto lab9;

        if (t1) goto lab8;

}

Joost Buijs · Post by **Joost Buijs** » Wed Mar 06, 2013 2:25 pm

Out of curiosity I also run this one through the Intel compiler.

32 bit:

Code: Select all

Free memory = 1758 Mb

Allocated space for 175 Mb entries

Creating 100000000 Zobrist keys

Sorting now...  Done in 11584

64 bit:

Code: Select all

Free memory = 3072 Mb

Allocated space for 307 Mb entries

Creating 100000000 Zobrist keys

Sorting now...  Done in 10343

The routine that determines free memory does not work properly for a 64 bit build. Probably you have to change some variables to 64 bit. Of course it doesn't matter if you don't need more than 4GB.

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Mar 06, 2013 3:18 pm

Rebel wrote:
rvida wrote:The slowdown is probably due to the same reason why I really _hate_ x86 asm: jump target alignment.
Perhaps chapter 9.6 is of any help.

http://www.agner.org/optimize/optimizing_assembly.pdf

Playing with these modern processors seems applied chaos theory, where tiny changes in code and data may yield to huge butterfly-effects, specially with small test programs.

If I only read chapter 2.2.4 Advanced Memory Access in the intel optimization guide on speculatively load and stores, DPL, store forewarding, memory disambiguator with dependency predicton, store retirement competing for cache banks with executing loads maintained as a background task by the memory order buffer ...

http://www.intel.com/content/www/us/en/ ... anual.html

Happy optimizing

hgm · Post by **hgm** » Wed Mar 06, 2013 3:23 pm

I don't know if this is a good idea. By moving to a 400MB data set you have torned this into a problem that will be totally dominated by DRAM access time. Most likely I will be able to write a C code that will outperform your best ASM quick-sort by at least an order of magnitude by just optimizing caching. Code quality will have zero effect on speed.

This reminds me of the occasion when Dan Corbit tried to speed up my tablebase generator by PGO compiling. He was dismayed that the initial profiling showed that 99% of the execution time was in the single statement dtm=EGT[index]; Not much to optimize on that, and optimizing the other 200 statements could at most save him 1%.

(Of course it would not really have saved anything, even if he could have shrunk its execution time to zero, as the time saved would simply have been added to the time taken by that single statement. It would only have been 'waiting faster'.)

ZirconiumX · Post by **ZirconiumX** » Wed Mar 06, 2013 5:18 pm

The test program seems to hang at 100% CPU power.

Code: Select all

gcc -O3 -fwhole-program -fprofile-generate -s -pipe -o test test.c

Matthew:out

hgm · Post by **hgm** » Wed Mar 06, 2013 6:25 pm

I read up a bit on what is new in Sandy-Bridge micro-architecture, and it seems to me that decoding of small loops like we are dealing with here should no longer be an issue at all. Sandy Bridge has a uOp cache for 6K uOps, which should be able to supply the (already decoded) uOps for the entire duration of the loop. In fact the loops are so small, that the 28-uOp queue should be able to hold it entirely, and act like a Loop Steam Detector, re-supplying the same uOps to the back-end over and over again.

hgm · Post by **hgm** » Wed Mar 06, 2013 8:52 pm

Btw, what a weird bubble sort did you write. On my machine (i7-2600) the C code took 13 sec. I could not compile the assembly, but by adding one line oc C code I could accelerate it to 0.031 sec. I am sure I could do even better, because r1 is always r2+1 (even after my change), so there would not have been any need to have both of them. But a factor 419 seemed enough for a days work.

Code: Select all

void bubbles()

{       unsigned int zz; unsigned char ch; int r1,r2;

sort05: r1=0; r2=-1;
sort10: r1++; r2++;
        if (r1>=MAX_ENTRIES) return;
//cnt1++;
        if (key1[r2] <= key1[r1]) goto sort10;
        zz=key1[r1]; key1[r1]=key1[r2]; key1[r2]=zz;    // swap
        zz=key2[r1]; key2[r1]=key2[r2]; key2[r2]=zz;
        ch=byte1[r1]; byte1[r1]=byte1[r2]; byte1[r2]=ch;
        ch=byte2[r1]; byte2[r1]=byte2[r2]; byte2[r2]=ch;
//cnt2++;
	r2-=2; if(r1-=2) goto sort10; // [HGM] accelerating patch
        goto sort05;                                    // again
}

The problem is that you run through the entire list comparing the entries to find the first reversed pair. Swapping those of course leaves all preceding entries untouched, but you run through them again nevertheless. While it would have been sufficient to step one back.

So the code is not at all what it seems (mostly swapping array elements). The swap part is only executed about 9e6 times, but the comparison part (the sort10 loop) is executed 24e9 times! Only this loop has any impact on performance, (2 increments, one redundant, two memory loads and two compare+branch instructions). And the loop is not properly aligned, because the compiler has difficulty recognizing it, due to the use of gotos rather than standard control structures. And alignment of such small loops is critical. This is why the results are so erratic.

C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM