Some x64 assembler for the curious

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Some x64 assembler for the curious

Post by matthewlai »

Michael Sherwin wrote: Wed Mar 27, 2019 5:57 pm I have already done the proof that handwritten assembler is faster. I have programmed two perft examples. One in pure C and one with handwritten assembler for the move generator, make move and take back.

Both versions make and unmake all moves generated and do no cache counting.
GNUChess 4 style in pure C, bench 7.
42,682,123 nodes per second. (new optimised compile with MSVS 2017)

GNUChess 4 style with move generator, make and unmake in assembler, bench 7.
74,754,475 nodes per second. (old compile with Turbo C and Turbo Assembler)

I guess now the argument could be that my C code must be really poorly done. Oh well there is always something someone can throw back and the burden of proof will always be on me. However, even if my C programming skills are that bad then I'm better off writing in assembler anyway! :lol:
In that case, it would be fun to look at disassembly of the compiler's output and see why it's slower? That would either point to a compiler blindspot in optimization, or something to do with the C code (eg. that the algorithm implemented isn't exactly identical).
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Some x64 assembler for the curious

Post by Michael Sherwin »

matthewlai wrote: Thu Mar 28, 2019 12:32 pm
Michael Sherwin wrote: Wed Mar 27, 2019 5:57 pm I have already done the proof that handwritten assembler is faster. I have programmed two perft examples. One in pure C and one with handwritten assembler for the move generator, make move and take back.

Both versions make and unmake all moves generated and do no cache counting.
GNUChess 4 style in pure C, bench 7.
42,682,123 nodes per second. (new optimised compile with MSVS 2017)

GNUChess 4 style with move generator, make and unmake in assembler, bench 7.
74,754,475 nodes per second. (old compile with Turbo C and Turbo Assembler)

I guess now the argument could be that my C code must be really poorly done. Oh well there is always something someone can throw back and the burden of proof will always be on me. However, even if my C programming skills are that bad then I'm better off writing in assembler anyway! :lol:
In that case, it would be fun to look at disassembly of the compiler's output and see why it's slower? That would either point to a compiler blindspot in optimization, or something to do with the C code (eg. that the algorithm implemented isn't exactly identical).
Okay I can point to one crucial point of comparison--the traversal of a slider on an empty board.

Code: Select all

case BISHOP:
             sbns=bns+bol[fs];
             sbnd=bnd+bol[fs];
             ts=*(sbns+fs);
             while(ts<64)
             {
               switch(wtrgt[brd[ts]])
               {
                 case VACANT:
                        link(MOVE);
                        ts=*(sbns+ts);
                        continue;
                 case FRIEND:             
                        ts=*(sbnd+ts);
                        continue;
                 case ENEMY:
                        link(CAPTURE);
                        ts=*(sbnd+ts);
                        continue;
                 default:
                        return FALSE;
               }                            
             }
             continue;
Versus this.

Code: Select all

wbmrm:      mov         [tree.fsq+eax*8],cl
            mov         [tree.tsq+eax*8],bl
            mov         [tree.typ+eax*8],BMOV
            inc         eax
            movsx       ebx,[bns+esi+ebx]
            mov         edx,[brd+ebx*4]
            jmp         [wbmf+edx*4]
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Some x64 assembler for the curious

Post by Michael Sherwin »

I guess this is the most telling. Traversing all the squares of a slider in capture only mode on an empty board using two moves and a jump.

Code: Select all

wbcns:      movsx       ebx,[bns+esi+ebx]
            mov         edx,[brd+ebx*4]
            jmp         [wbcf+edx*4]
It is impossible for any C compiler to beat that!
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Some x64 assembler for the curious

Post by Joost Buijs »

Michael Sherwin wrote: Thu Mar 28, 2019 3:39 pm I guess this is the most telling. Traversing all the squares of a slider in capture only mode on an empty board using two moves and a jump.

Code: Select all

wbcns:      movsx       ebx,[bns+esi+ebx]
            mov         edx,[brd+ebx*4]
            jmp         [wbcf+edx*4]
It is impossible for any C compiler to beat that!
This is typically what humans do when they write in assembler: keep everything in the CPU registers, keep the number of instructions low and try to avoid branches as much as possible. With the advent of superscalar CPU's things are not so simple anymore, I have seen on many occasions that modern compilers outsmart you in several ways with code that looks ugly and stupid at first sight. This is why I gave up on programming in assembler a long time ago.
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: Some x64 assembler for the curious

Post by Rebel »

Joost Buijs wrote: Fri Mar 29, 2019 7:56 am
Michael Sherwin wrote: Thu Mar 28, 2019 3:39 pm I guess this is the most telling. Traversing all the squares of a slider in capture only mode on an empty board using two moves and a jump.

Code: Select all

wbcns:      movsx       ebx,[bns+esi+ebx]
            mov         edx,[brd+ebx*4]
            jmp         [wbcf+edx*4]
It is impossible for any C compiler to beat that!
This is typically what humans do when they write in assembler: keep everything in the CPU registers, keep the number of instructions low and try to avoid branches as much as possible. With the advent of superscalar CPU's things are not so simple anymore, I have seen on many occasions that modern compilers outsmart you in several ways with code that looks ugly and stupid at first sight. This is why I gave up on programming in assembler a long time ago.
And a good compiler will translate a (considerable) switch-case in a jump table anyway.
90% of coding is debugging, the other 10% is writing bugs.
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Some x64 assembler for the curious

Post by mar »

Rebel wrote: Fri Mar 29, 2019 2:03 pm And a good compiler will translate a (considerable) switch-case in a jump table anyway.
True, but you still need the switch inside the loop. Also, the compiler typically generates code to verify that the expression you switch on is in a valid range. You can still hint the compiler that default is unreachable (not sure there's a standard way though).

True jump tables are possible as a non-standard extension in gcc/clang, the so-called "computed goto".

There used to be a nice trick for bytecode interpreters where they simply stored a pointer to next instruction handler directly within bytecode (+ optional opcode data), gaining substantial performance gain over traditional interpreters by avoiding to decode the next instruction. Of course, even a bad machine code translation/JIT beats an excellent interpreter anytime.
Why I do I mention this? Well, cleverly designed low-level code using jump tables can easily outperform compiler-generated loop+switch.
Martin Sedlak
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Some x64 assembler for the curious

Post by Michael Sherwin »

Just a bit of an ironic update. I was having some difficulties with the pawn move generator and interfacing the design with the makemove routine. I actually solved the problem very brilliantly but then I forgot that I solved it. That caused me a couple of days of confusion. So I decided to rip the pawn code out and start over. I had almost deleted all the pawn code when my eye caught something that looked like an error that I had made. But it was bothering me so I took a couple of days to study why I did that. It turned out that I had modified the pawn move tables in such a way that made the problem go away. I just forgot that I did that. Luckily I had not saved the code and I was able to restore the pawn code. Then I took some days off because my brain hurt. Anyway I am back on track now.

The idea of the move generation is to do as little as possible until more needs to be done. Therefore I only store the from square and to square when I make a normal move. For special moves I store the from square and a value greater than 63 for the to square. Then in the makemove routine I check if the to square is greater than 63 and if it is I subtract 64 and use the special value to jump through a jump table to the appropriate code that knows what to do.

Okay, I have posted about my learning/memory disability several times. This is why it takes me magnitudes longer to get anything done. And it is not getting any better as I get older. I just wish that someone would take my idea of a different kind of chess engine seriously and partner with me and help me finish this. It can probably be done in a few weeks. I promise it is a very powerful new idea. The single threaded version will be free, however the multithreaded version will be commercial for a moderate price. So there should be some financial reward. No promises though as it remains to be proven. I can see in my mind that it will be very much like the Zeros except smarter and run at the speed of a material only searcher. It won't even have an evaluation function. Can anyone even rap their mind around that? I just do not know if I can finish it on my own.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Some x64 assembler for the curious

Post by Michael Sherwin »

See I forgot to mention that any candidate needs to be able to program in x64 assembler and be able to program smp in C/C++ and is trusted by the talkchess community and I guess follow my design. And probably some other things that I have forgotten, lol. Oh and the source code will remain private.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Harald
Posts: 318
Joined: Thu Mar 09, 2006 1:07 am

Re: Some x64 assembler for the curious

Post by Harald »

This thread reminds me of an old Z80 assembler chess program 'Schachmatt' by Gerhard Scheuermann.
It is a very fast mate solver that was published in the german computer magazine CHIP in 03/1984.
It does not play chess but generates, makes, unmakes moves and looks for mates.
The piece abbreviations and comments in the listing are in german.

It can also be seen as a hex dump with loader in a PDF scan on this link:
https://www.andreaszapf.de/media/ZX81/S ... sgabe3.pdf on page 56.
"Computer-Programme Sinclair ZX 81" number 3 published by CHIP.

I just scanned the 14 pages of the commented Z80 assembler source code into a pdf file (11.5 MByte).
I will not make this public because I may not have the rights to do that
but I can probably share this file via e-mail with interested people on request. (harald.luessen@gmx.de)

Performance on ZX81:
Problem 1: white Kd5 Rh1, black Ka8, white mates in 3
Problem 2: white Ke4 Qf6 Rd8 Bb8 Bf3 Nd7 Pc4, black Kc6 Ra5 Bh6 Ne1 Ne6 Pa7 Pb7 Pb6 Pf7, white mates
Sargon MGS: 88s/24m05s
Chess Chall. Voice: 700s/24m09s
Chess Champ. SSIII: 81s/48m
'Schachmatt' on ZX81: 7.6s/1m53s

Assembler can be very fast.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Some x64 assembler for the curious

Post by Michael Sherwin »

Harald wrote: Fri Apr 05, 2019 5:50 pm Assembler can be very fast.
Indeed! Most programmers underestimate what really good assembler can do. It can't beat a good C compiler like it used to but assembler by a true guru can still teach C a thing or two.

Yes Z80 and 6502 were a lot of fun and back in the day seemed lightning fast. My engine will be more than a mate solver. It will have a secret weapon or two (that I have made post on) that will replace the eval function and give a far better result. It will sort of mine the evaluation info in a more pure form than by rule of thumb. And it will do it on the fly without a significant cost. And it will do Reinforcement Learning in real time while it is thinking. The RL will even help greatly in move ordering. And it will be better move ordering than in current alpha/beta engines that use rule of thumb for that also. I do not understand why I have not been able to interest anyone. So I'll probably have to go it alone if I still can.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through