Sorting algorithms

tpetzke · Post by **tpetzke** » Thu Apr 27, 2017 6:10 pm

case 12&#58;compare_swap&#40;1, 2&#41;; compare_swap&#40;0, 2&#41;; compare_swap&#40;0, 1&#41;; compare_swap&#40;4, 5&#41;; compare_swap&#40;3, 5&#41;;
				compare_swap&#40;3, 4&#41;; compare_swap&#40;0, 3&#41;; compare_swap&#40;1, 4&#41;; compare_swap&#40;2, 5&#41;; compare_swap&#40;2, 4&#41;;
				compare_swap&#40;1, 3&#41;; compare_swap&#40;2, 3&#41;; compare_swap&#40;7, 8&#41;; compare_swap&#40;6, 8&#41;; compare_swap&#40;6, 7&#41;;
				compare_swap&#40;10, 11&#41;;compare_swap&#40;9, 11&#41;;compare_swap&#40;9, 10&#41;;compare_swap&#40;6, 9&#41;;compare_swap&#40;7, 10&#41;;
				compare_swap&#40;8, 11&#41;;compare_swap&#40;8, 10&#41;;compare_swap&#40;7, 9&#41;; compare_swap&#40;8, 9&#41;; compare_swap&#40;0, 6&#41;;
				compare_swap&#40;1, 7&#41;; compare_swap&#40;2, 8&#41;; compare_swap&#40;2, 7&#41;; compare_swap&#40;1, 6&#41;; compare_swap&#40;2, 6&#41;;
				compare_swap&#40;3, 9&#41;; compare_swap&#40;4, 10&#41;;compare_swap&#40;5, 11&#41;;compare_swap&#40;5, 10&#41;;compare_swap&#40;4, 9&#41;;
				compare_swap&#40;5, 9&#41;; compare_swap&#40;3, 6&#41;; compare_swap&#40;4, 7&#41;; compare_swap&#40;5, 8&#41;; compare_swap&#40;5, 7&#41;;
				compare_swap&#40;4, 6&#41;; compare_swap&#40;5, 6&#41;;
				return;

Dann Corbit · Post by **Dann Corbit** » Thu Apr 27, 2017 8:45 pm

Do you have a program that writes these or did you do it by hand?

Did you use Knuth's merge insertion algorithm, or something else?

Thanar · Post by **Thanar** » Thu Apr 27, 2017 10:47 pm

I'm not sure what Thomas used, but here is a generator for most of the standard sorting networks: http://pages.ripco.net/~jgamble/nw.html

tpetzke · Post by **tpetzke** » Fri Apr 28, 2017 7:55 am

I used an online generator to generate the required pair operations and then transformed it into source code.

Then I did some unit testing to ensure that the nets perform the way they are supposed to.

mar · Post by **mar** » Fri Apr 28, 2017 2:00 pm

That's very interesting. Are sorting networks stable? I think they might be.

Thanar · Post by **Thanar** » Fri Apr 28, 2017 5:02 pm

Are sorting networks stable?

The standard ones are not stable in general.

Sorting networks that mimic an insertion sort or bubble sort are stable, and work fine for small arrays.

From my testing, the main speed benefit of using sorting networks is the ability to eliminate almost all branches (and in particular difficult-to-predict branches) from the sorting code. To achieve this, you need to make sure that your "compare_swap" routine compiles down to cmov assembly instructions (or something similar), instead of branching code. This is possible with the right C code using gcc, or with inline assembler macros.

he extra "compare_swaps" used in the stable sorting networks compared to the optimal non-stable ones is not that significant for small arrays.

Thanar · Post by **Thanar** » Fri Apr 28, 2017 6:18 pm

Here’s some sample C code for a stable sorting network that just sorts an array of ints.

It uses a sorting network that imitates a bubble sort that compare_swaps through the whole array (ensuring the last element is correctly sorted), then goes through length-1 of the array, etc. This allows us to reuse the code for smaller length arrays using a "Duffs device" style switch statement.

Code: Select all

#define SWAP&#40;x,y&#41;  &#123; int px = p&#91;x&#93;; int py = p&#91;y&#93;; int tmp = p&#91;x&#93; = px <= py ? px &#58; py; p&#91;y&#93; ^= px ^ tmp; &#125; 

static inline void stable_sorting_network&#40;int *begin, int *end&#41;
&#123;
  int *p, *q, tmp;
    
  p=begin;
  switch&#40;end - begin&#41; &#123;
    default&#58;
    case 10&#58; SWAP&#40;0,1&#41;; SWAP&#40;1,2&#41;; SWAP&#40;2,3&#41;; SWAP&#40;3,4&#41;;
             SWAP&#40;4,5&#41;; SWAP&#40;5,6&#41;; SWAP&#40;6,7&#41;; SWAP&#40;7,8&#41;; SWAP&#40;8,9&#41;;
    case 9&#58; SWAP&#40;0,1&#41;; SWAP&#40;1,2&#41;; SWAP&#40;2,3&#41;; SWAP&#40;3,4&#41;;
            SWAP&#40;4,5&#41;; SWAP&#40;5,6&#41;; SWAP&#40;6,7&#41;; SWAP&#40;7,8&#41;;
    case 8&#58; SWAP&#40;0,1&#41;; SWAP&#40;1,2&#41;; SWAP&#40;2,3&#41;; SWAP&#40;3,4&#41;;
            SWAP&#40;4,5&#41;; SWAP&#40;5,6&#41;; SWAP&#40;6,7&#41;;
    case 7&#58; SWAP&#40;0,1&#41;; SWAP&#40;1,2&#41;; SWAP&#40;2,3&#41;; SWAP&#40;3,4&#41;;
            SWAP&#40;4,5&#41;; SWAP&#40;5,6&#41;;
    case 6&#58; SWAP&#40;0,1&#41;; SWAP&#40;1,2&#41;; SWAP&#40;2,3&#41;; SWAP&#40;3,4&#41;;
            SWAP&#40;4,5&#41;;
    case 5&#58; SWAP&#40;0,1&#41;; SWAP&#40;1,2&#41;; SWAP&#40;2,3&#41;; SWAP&#40;3,4&#41;;
    case 4&#58; SWAP&#40;0,1&#41;; SWAP&#40;1,2&#41;; SWAP&#40;2,3&#41;;
    case 3&#58; SWAP&#40;0,1&#41;; SWAP&#40;1,2&#41;;
    case 2&#58; SWAP&#40;0,1&#41;;
    case 1&#58;
    case 0&#58;
        break;
  &#125;
 
  // Insert remaining elements into sorted list.
  for &#40;p = begin + 10; p < end; ++p&#41; &#123;
    tmp = *p;
    for &#40;q = p; q != begin && *&#40;q-1&#41; > tmp; --q&#41;
      *q = *&#40;q-1&#41;;
    *q = tmp;
  &#125;
&#125;

You can paste the code into https://godbolt.org/ to see that it compiles down to cmovle and xor instructions when compiled with, for example, x86-64 gcc 5.3 –O2

And here is the time used to sort arrays of random ints of various lengths (The time is based on counting cycles with the rdtsc instruction):

Code: Select all

                               Array length
Sort method           4   6   8  10  12  14  16  18  20  22  24  Total cycles
std&#58;&#58;sort            69 112 174 244 338 371 426 626 754 822 968  4905
Insertion            49  94 147 201 258 310 380 457 520 584 706  3706
stable sort network  13  28  50  77 134 206 269 339 408 448 518  2490

So for this simple case, the sorting network is over twice as fast for small array sizes. Of course, the gains aren’t quite as big when you’re sorting structs based on a field in the struct, but it is still faster.

hgm · Post by **hgm** » Fri Apr 28, 2017 6:43 pm

This seems all very expensive compared to a simple disperse & collect. How many different values of the sort key can you expect for these small sets of moves anyway. Is this just for sorting captures MVV/LVA-wise? In Chess that is just 20 possible key values. So dispersing all moves over 20 linked lists, and then running trough the lists in order would already do it.

tpetzke · Post by **tpetzke** » Fri Apr 28, 2017 8:11 pm

How many different values of the sort key can you expect for these small sets of moves anyway.

I sort moves based on different criteria e.g. based on the history score to name one. So I encounter a bit more than 20 different key values. The value is fitted into the upper bits of the unsigned integer I use for move representation so for sorting I just sort integers.

hgm · Post by **hgm** » Fri Apr 28, 2017 8:22 pm

OK, so you have 256 key values. Can still be done without too much overhead through 256 linked lists. As you can hook the moves in their respective lists as soon as you score them, there would be no need to pack the key into the move, earning some of the overhead back. Of course if you are really tight on space you could use the high byte for the link. With 256 key values and maximally 24 moves to sort, the lists would hardly ever contain more than one move, so the end-branch of the loop to run through the lists becomes highly predictably (as 'not taken').

Sorting algorithms

Re: Sorting algorithms

Re: Sorting algorithms

Re: Sorting algorithms

Re: Sorting algorithms

Re: Sorting algorithms

Re: Sorting algorithms

Re: Sorting algorithms

Re: Sorting algorithms

Re: Sorting algorithms

Re: Sorting algorithms