speed up or avoiding move sorting

Ras · Post by **Ras** » Mon Mar 20, 2017 1:59 am

The key idea is to eliminate the outest of the 3-fold loops and get rid of all the involved branching by rolling out the inner two loops. The next idea is to avoid the outest loop if less than three runs of the outest loop would be involved. So the whole thing is not exactly equivalent to the regular shellsort.

In lots of testing games, the number of raw moves per position was never more than 93, so the highest useful Ciura number is 57.

Here is a benchmark for x86 command line; I'm using Cygwin under Windows, so it should work as well under Linux. I'm using GCC with the "likely/unlikely" defines copied from the Linux kernel. The "dummy" output is just to keep the compiler from optimising away the benchmark.

I've included the "regular" shellsort implementation so that comparing the differences to the jump table implementation is easy. In fact, the inner two loops are copy-paste.

Code: Select all

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define LIKELY&#40;x&#41;   __builtin_expect&#40;!!&#40;x&#41;,1&#41;
#define UNLIKELY&#40;x&#41; __builtin_expect&#40;!!&#40;x&#41;,0&#41;

#define MV_PER_POS  60l
#define NUM_POS     8000000l

#define GAP_4       57
#define GAP_3       23
#define GAP_2       10
#define GAP_1       4
#define GAP_0       1

struct mvdata &#123;
  uint8_t flag;
  uint8_t from;
  uint8_t to;
  int8_t mvv_lva;
&#125;;

typedef union &#123;
  struct mvdata m;
  uint32_t u;
&#125; MOVE;

double timediff_sec&#40;struct timeval t0, struct timeval t1&#41;
&#123;
    return (&#40;t1.tv_sec - t0.tv_sec&#41; + &#40;t1.tv_usec - t0.tv_usec&#41; / 1000000.0&#41;;
&#125;

static void Shellsort_Opt&#40;MOVE *restrict movelist, int N&#41;
&#123;
    /*note&#58; the jump table does not exactly match the shell sort gaps.
    it does not make sense e.g. with N=11 to run through the loop
    for gap=10 because the loop will run only once, that is not worth
    the overhead.*/
    static void *jump_table&#91;&#93; = &#123;
    /*N = 0..1*/
    &&shellsort_no_action, &&shellsort_no_action,
    /*N = 2..6*/
    &&shellsort_gap_0, &&shellsort_gap_0, &&shellsort_gap_0,
    &&shellsort_gap_0, &&shellsort_gap_0,
    /*N = 7..12*/
    &&shellsort_gap_1, &&shellsort_gap_1, &&shellsort_gap_1,
    &&shellsort_gap_1, &&shellsort_gap_1, &&shellsort_gap_1,
    /*N = 13 .. 25*/
    &&shellsort_gap_2, &&shellsort_gap_2, &&shellsort_gap_2,
    &&shellsort_gap_2, &&shellsort_gap_2, &&shellsort_gap_2,
    &&shellsort_gap_2, &&shellsort_gap_2, &&shellsort_gap_2,
    &&shellsort_gap_2, &&shellsort_gap_2, &&shellsort_gap_2,
    &&shellsort_gap_2,
    /*N = 26 .. 59*/
    &&shellsort_gap_3, &&shellsort_gap_3, &&shellsort_gap_3, &&shellsort_gap_3,
    &&shellsort_gap_3, &&shellsort_gap_3, &&shellsort_gap_3, &&shellsort_gap_3,
    &&shellsort_gap_3, &&shellsort_gap_3, &&shellsort_gap_3, &&shellsort_gap_3,
    &&shellsort_gap_3, &&shellsort_gap_3, &&shellsort_gap_3, &&shellsort_gap_3,
    &&shellsort_gap_3, &&shellsort_gap_3, &&shellsort_gap_3, &&shellsort_gap_3,
    &&shellsort_gap_3, &&shellsort_gap_3, &&shellsort_gap_3, &&shellsort_gap_3,
    &&shellsort_gap_3, &&shellsort_gap_3, &&shellsort_gap_3, &&shellsort_gap_3,
    &&shellsort_gap_3, &&shellsort_gap_3, &&shellsort_gap_3, &&shellsort_gap_3,
    &&shellsort_gap_3, &&shellsort_gap_3,
    /*N = 60*/
    &&shellsort_gap_4
    &#125;;

    /*clip N to 60 for the jump, that's enough
    because the maximum sequence gap is 57.*/
    goto *jump_table&#91;(&#40;N < 61&#41; ? N &#58; 60&#41;&#93;;

shellsort_gap_4&#58;
    for &#40;int i = GAP_4; &#40;LIKELY&#40;i < N&#41;); i++)
    &#123;
        int value = movelist&#91;i&#93;.m.mvv_lva;
        uint32_t tmp_move = movelist&#91;i&#93;.u;
        int j;
        for &#40;j = i - GAP_4; &#40;LIKELY&#40;j >= 0&#41;); j -= GAP_4&#41;
        &#123;
            MOVE current_move;
            current_move.u = movelist&#91;j&#93;.u;
            if &#40;UNLIKELY&#40;current_move.m.mvv_lva >= value&#41;)
                break;
            movelist&#91;j+GAP_4&#93;.u = current_move.u;
        &#125;
        movelist&#91;j+GAP_4&#93;.u = tmp_move;
    &#125;

shellsort_gap_3&#58;
    for &#40;int i = GAP_3; &#40;LIKELY&#40;i < N&#41;); i++)
    &#123;
        int value = movelist&#91;i&#93;.m.mvv_lva;
        uint32_t tmp_move = movelist&#91;i&#93;.u;
        int j;
        for &#40;j = i - GAP_3; &#40;LIKELY&#40;j >= 0&#41;); j -= GAP_3&#41;
        &#123;
            MOVE current_move;
            current_move.u = movelist&#91;j&#93;.u;
            if &#40;UNLIKELY&#40;current_move.m.mvv_lva >= value&#41;)
                break;
            movelist&#91;j+GAP_3&#93;.u = current_move.u;
        &#125;
        movelist&#91;j+GAP_3&#93;.u = tmp_move;
    &#125;
    
shellsort_gap_2&#58;
    for &#40;int i = GAP_2; &#40;LIKELY&#40;i < N&#41;); i++)
    &#123;
        int value = movelist&#91;i&#93;.m.mvv_lva;
        uint32_t tmp_move = movelist&#91;i&#93;.u;
        int j;
        for &#40;j = i - GAP_2; &#40;LIKELY&#40;j >= 0&#41;); j -= GAP_2&#41;
        &#123;
            MOVE current_move;
            current_move.u = movelist&#91;j&#93;.u;
            if &#40;UNLIKELY&#40;current_move.m.mvv_lva >= value&#41;)
                break;
            movelist&#91;j+GAP_2&#93;.u = current_move.u;
        &#125;
        movelist&#91;j+GAP_2&#93;.u = tmp_move;
    &#125;

shellsort_gap_1&#58;
    for &#40;int i = GAP_1; &#40;LIKELY&#40;i < N&#41;); i++)
    &#123;
        int value = movelist&#91;i&#93;.m.mvv_lva;
        uint32_t tmp_move = movelist&#91;i&#93;.u;
        int j;
        for &#40;j = i - GAP_1; &#40;LIKELY&#40;j >= 0&#41;); j -= GAP_1&#41;
        &#123;
            MOVE current_move;
            current_move.u = movelist&#91;j&#93;.u;
            if &#40;UNLIKELY&#40;current_move.m.mvv_lva >= value&#41;)
                break;
            movelist&#91;j+GAP_1&#93;.u = current_move.u;
        &#125;
        movelist&#91;j+GAP_1&#93;.u = tmp_move;
    &#125;

shellsort_gap_0&#58;
    for &#40;int i = GAP_0; &#40;LIKELY&#40;i < N&#41;); i++)
    &#123;
        int value = movelist&#91;i&#93;.m.mvv_lva;
        uint32_t tmp_move = movelist&#91;i&#93;.u;
        int j;
        for &#40;j = i - GAP_0; &#40;LIKELY&#40;j >= 0&#41;); j -= GAP_0&#41;
        &#123;
            MOVE current_move;
            current_move.u = movelist&#91;j&#93;.u;
            if &#40;UNLIKELY&#40;current_move.m.mvv_lva >= value&#41;)
                break;
            movelist&#91;j+GAP_0&#93;.u = current_move.u;
        &#125;
        movelist&#91;j+GAP_0&#93;.u = tmp_move;
    &#125;
    
shellsort_no_action&#58; /*for N=0 and N=1*/
    return;
&#125;

static void Shellsort&#40;MOVE *restrict movelist, int N&#41;
&#123;
    static int shell_sort_gaps&#91;&#93; = &#123;1, 4, 10, 23, 57 /*, 132, 301, 701*/&#125;;
    int sizeIndex;
    for &#40;sizeIndex = sizeof&#40;shell_sort_gaps&#41;/sizeof&#40;shell_sort_gaps&#91;0&#93;) - 1; &#40;LIKELY&#40;sizeIndex >= 0&#41;); sizeIndex--)
    &#123;
        int i;
        int gap = shell_sort_gaps&#91;sizeIndex&#93;;
        for &#40;i = gap; &#40;LIKELY&#40;i < N&#41;); i++)
        &#123;
            int value = movelist&#91;i&#93;.m.mvv_lva;
            uint32_t tmp_move = movelist&#91;i&#93;.u;
            int j;
            for &#40;j = i - gap; &#40;LIKELY&#40;j >= 0&#41;); j -= gap&#41;
            &#123;
                MOVE current_move;
                current_move.u = movelist&#91;j&#93;.u;
                if &#40;UNLIKELY&#40;current_move.m.mvv_lva >= value&#41;)
                    break;
                movelist&#91;j+gap&#93;.u = current_move.u;
            &#125;
            movelist&#91;j+gap&#93;.u = tmp_move;
        &#125;
    &#125;
&#125;

int MOVE_cmp_by_mvv_lva&#40;const void *a, const void *b&#41; 
&#123; 
    MOVE *ia = &#40;MOVE *&#41;a;
    MOVE *ib = &#40;MOVE *&#41;b;
    return &#40;int&#41;ib->m.mvv_lva - &#40;int&#41;ia->m.mvv_lva;
&#125; 


void Init_List&#40;MOVE *restrict mv_list, size_t len&#41;
&#123;
    size_t i;
    MOVE amove;
    amove.u = 0;
    srand&#40;0&#41;;   
    for &#40;i = 0; i < len; i++)
    &#123;
        amove.m.mvv_lva = &#40;rand&#40;) & 0xff&#41; - 128;
        mv_list&#91;i&#93;.u = amove.u;
    &#125;
&#125;

int main&#40;void&#41;
&#123;
    size_t i;
    MOVE *move_list;
    struct timeval start_time, stop_time;
    
    move_list = &#40;MOVE *) calloc&#40;MV_PER_POS*NUM_POS, sizeof&#40;MOVE&#41;);
    if &#40;move_list == NULL&#41;
    &#123;
        printf&#40;"insufficient memory.\r\n");
        return&#40;1&#41;;
    &#125;

    Init_List&#40;move_list, MV_PER_POS*NUM_POS&#41;;
    gettimeofday&#40;&start_time, 0&#41;;
    for &#40;i = 0; i < MV_PER_POS*NUM_POS; i += MV_PER_POS&#41;
        Shellsort&#40;move_list + i, MV_PER_POS&#41;;
    gettimeofday&#40;&stop_time, 0&#41;;
    printf&#40;"shellsort&#58;     %.2lf seconds // dummy&#58; %d\r\n", timediff_sec&#40;start_time, stop_time&#41;, move_list&#91;MV_PER_POS*NUM_POS - 1&#93;.m.mvv_lva&#41;;
    
    Init_List&#40;move_list, MV_PER_POS*NUM_POS&#41;;
    gettimeofday&#40;&start_time, 0&#41;;
    for &#40;i = 0; i < MV_PER_POS*NUM_POS; i += MV_PER_POS&#41;
        Shellsort_Opt&#40;move_list + i, MV_PER_POS&#41;;
    gettimeofday&#40;&stop_time, 0&#41;;
    printf&#40;"shellsort opt&#58; %.2lf seconds // dummy&#58; %d\r\n", timediff_sec&#40;start_time, stop_time&#41;, move_list&#91;MV_PER_POS*NUM_POS - 1&#93;.m.mvv_lva&#41;;
    
    Init_List&#40;move_list, MV_PER_POS*NUM_POS&#41;;
    gettimeofday&#40;&start_time, 0&#41;;
    for &#40;i = 0; i < MV_PER_POS*NUM_POS; i += MV_PER_POS&#41;
        qsort&#40;move_list + i, MV_PER_POS, sizeof&#40;MOVE&#41;, MOVE_cmp_by_mvv_lva&#41;;
    gettimeofday&#40;&stop_time, 0&#41;;
    printf&#40;"quicksort&#58;     %.2lf seconds // dummy&#58; %d\r\n", timediff_sec&#40;start_time, stop_time&#41;, move_list&#91;MV_PER_POS*NUM_POS - 1&#93;.m.mvv_lva&#41;;
    
    
    free&#40;move_list&#41;;
    
    return 0;
&#125;

Engin · Post by **Engin** » Mon Mar 20, 2017 8:40 am

pfff, if you are mean that speed is all about gain elo so please continue....

i don't care much about speed, i weight eval knowledge and a good search is much better to gain elo. its slow down the search and nps but who cares about this.

the goal is to shrink down the search tree with less nodes to search in alpha beta.

of course it would be nice to have that a little speed up the whole chess engine, for example speed up with bitboards, pop count, prefetch and such like things.

hgm · Post by **hgm** » Mon Mar 20, 2017 9:53 am

Code: Select all

float history&#91;64&#93;&#91;64&#93;;
float historyMax;

// on beta cutoff

float h = history&#91;fromSqr&#93;&#91;toSqr&#93; += depth*depth;
if&#40;h > historyMax&#41;  historyMax = h;

// in node local variables

int nrOfMoves = 0;
long long int map&#91;16&#93;;
unsigned char first&#91;1024&#93;;
unsigned char next&#91;MAXMOVES&#93;;
Move moveList&#91;MAXMOVES&#93;;

for&#40;i=0; i<16; i++) map&#91;i&#93; = 0; // mark all bins empty

// in non-capture generation

int key = *&#40;int*) (&history&#91;fromSqr&#93;&#91;toSqr&#93;); // read as integer
int max = *&#40;int*) &historyMax;
int bin = &#40;max - key&#41; >> 18 & 1023; // discard lower 18 and upper 4 bits
int binGroup = bin >> 6;
moveList&#91;nrOfMoves&#93; = MOVE&#40;fromSqr, toSqr&#41;;
if&#40;&#40;map&#91;binGroup&#93; & 1LL << bin&#41; == 0&#41; &#123; // maps to empty bin, initialize
  map&#91;binGroup&#93; |= 1LL << bin; // mark used
  next&#91;nrOfMoves&#93; = 0xFF; // invalid move number to mark end of list
&#125; else
  next&#91;rOfMoves&#93; = first&#91;bin&#93;; // insert in list
first&#91;bin&#93; = nrOfMoves;
nrOfMoves++;

// loop over moves

for&#40;binGroup=0; binGroup<16; binGroup++) &#123;
  int m = map&#91;binGroup&#93;;
  while&#40;m&#41; &#123;
    int nextBit = BSF&#40;m&#41;;
    int bin = 64*binGroup + nextBit;
    int moveNr = first&#91;bin&#93;;
    do &#123;
      Move move = moveList&#91;moveNr&#93;;

      MakeMove&#40;move&#41;;
      score = - Search&#40;);
      UnMake&#40;);

      ...

      moveNr = next&#91;moveNr&#93;;
    &#125; while&#40;moveNr != 0xFF&#41;; // next move from bin
    m -= 1LL << nextBit; // this bin done
  &#125; // next bin
&#125; // next group of 64 bins

IIRC the format of a float is:

Code: Select all

MSB   sign exponent&#40;8bit&#41; mantisse&#40;23bit&#41;   LSB

so we extract the lower 5 bits of the exponent plus the upper 5 bits of the mantisse to get the bin number. (The sign bit can be ignored, because history counts are always positive.) This assumes the exponent will never reach above 31. Wich seems reasonable, as it indicates a power of 2, so that history counts can run up to 2^31, which is also what they can do if you use int counters.

Because 5 bits of the mantisse are used, the bins span ranges of 3% to 1.5%. (1/32 - 1/64). E.g. all history scores between 1.00000 and 1.00001 (binary, so 1 + 1/32) times some power of 2 (which replaces the '1' before the binary point) would map to the same bin. That makes the bins so narrow that sorting within a bin (in the unlikely case that multiple moves map to the same bin) seems meaningless.

I guess you could do by far fewer bins, perhaps even just 64, so that no looping over binGroups is ever needed, and 'map' can be a scalar. The bins would then span a factor 1.5 to 1.25.

Stan Arts · Post by **Stan Arts** » Mon Mar 20, 2017 10:20 am

You can always choose a faster and simpler sorting method near the leafs as that is where it hurts the most.
Makes some sense too to invest more effort sorting when there is a lot of depth left to go.

hgm · Post by **hgm** » Mon Mar 20, 2017 10:27 am

Well, in QS you don't have to sort non-captures, so that already helps!

Engin · Post by **Engin** » Mon Mar 20, 2017 2:48 pm

there are well non-captures in king evasions.

anyway i didnt sort all in quiesce, instead i select next best score each time if a move is need by quiesce until it cutoff alpa >= beta

hgm · Post by **hgm** » Mon Mar 20, 2017 3:51 pm

Engin wrote:there are well non-captures in king evasions.

But what criteria could you use for sorting them? I don't expect normal history to be any good.

In drop games it is often good to try King withdrawals first (after capturing the checker, of course). If you interpose, the checking side just 'renews' the check by capturing the interposed piece with the checker. And when you recapture, drop the captured piece with a new check. So you invite series of checks, all extended.

jdart · Post by **jdart** » Mon Mar 20, 2017 6:07 pm

Most of the nodes are near the leaves, so doing a fancier sort farther up in the tree is not likely to help much.

--Jon

DustyMonkey · Post by **DustyMonkey** » Tue Mar 21, 2017 10:35 am

It seems to me that "complete" sorting isnt necessary.

The goal of the sort is to try the most likely best moves first, but because there is already significant noise in the approximation used (values in history table, etc) there is also no reason to get the sort exactly right either, an approximation could be good there too.

My optimization advice would be to try truncated bitonic sorting networks, by excluding the "shortest" comparators of "complete" bitonic sorts. With a ~50% reduction in comparators each item will still be ranked withing +/- 3 of their true rank.

7% sounds awful high unless your benchmark is based on an abnormal set of positions dominated by above average game tree branching factors.

Sven · Post by **Sven** » Tue Mar 21, 2017 11:51 am

DustyMonkey wrote:It seems to me that "complete" sorting isnt necessary.

The goal of the sort is to try the most likely best moves first, but because there is already significant noise in the approximation used (values in history table, etc) there is also no reason to get the sort exactly right either, an approximation could be good there too.

My optimization advice would be to try truncated bitonic sorting networks, by excluding the "shortest" comparators of "complete" bitonic sorts. With a ~50% reduction in comparators each item will still be ranked withing +/- 3 of their true rank.

7% sounds awful high unless your benchmark is based on an abnormal set of positions dominated by above average game tree branching factors.

It may be less important whether move no. 37 appears at rank 34 or 40, since it is very likely that we are at an ALL node when actually dealing with those "very late moves", and sorting at an ALL node does not have high priority.

But I think it is crucial for a near-optimal overall tree size that move no. 1 appears at rank 1, and the same for other top-ranked moves. Therefore I'm not sure whether that a "within +/- 3 of their true rank" solution, as you mentioned, would be appropriate.

speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting