Back To The Beginning

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Back To The Beginning

Post by Joost Buijs »

fabianVDW wrote: Sun Sep 08, 2019 8:10 pm
fabianVDW wrote: Sat Sep 07, 2019 6:55 pm
Joost Buijs wrote: Sat Sep 07, 2019 12:57 pm
fabianVDW wrote: Sat Sep 07, 2019 12:21 pm AFAIK, movegeneration speed is a good chunk of the runtime in modern engines. Atleast in FabChess, it is about 25-30% of the time used, being on par with the evaluation function (IIRC).
25% to 30% of the time used by the move generation is quite a lot. Maybe it depends upon staged move generation or not and whether you execute the hash move with or without move generation, when I profile my engine move generation takes considerably less than 10% of the total time used.
I also do staged movegen in FabChess. Maybe I remembered the numbers wrong... I will do another profiling session tomorrow.
Okay, I did some profiling. First of all, Fab runs perft 6 on startpos with about 80MNPS (with bulk counting) on my I5-6400. During search, movegen takes up about 5.5% of the time, but it uses a struct which contains all the pseudo legal moves for every piece. This struct is later reused in eval for mobility etc. Calculation of that struct is about 15% of the time.

I am sure my engine is badly tuned in that regard aswell, as I seem not to be able to recognize bottlenecks well
This is more or less comparable to what I see over here, move-gen takes roughly 4% of the time, I calculate the squares attacked by each piece (for both colors) on the fly in the evaluation function, when in check I have a separate routine to calculate the squares attacked by the opponent because my move-gen needs this.

The most time consuming routines in my engine are: Eval(), TT-probe() and See(). I always wonder why probing the transposition table is rather slow, it is a table with four 16 byte entries per bucket, I never managed to get it any faster. Prefetching doesn't help either, with SMP it even slows the engine down a little.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Back To The Beginning

Post by bob »

thought I would reply out of interest. On my 2 year old MacBook, 2.2ghz i7 circa 2017 I see this number for perf 6 from start position:

White(1): perft 6
total moves=119060324 time=3.11
White(1):

Here's the next one:

White(1): perft 7
total moves=3195901860 time=82.65

My perft was never optimized for speed. It is a pure generate/make/unmake implementation with no hashing or anything.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Back To The Beginning

Post by Michael Sherwin »

bob wrote: Wed Sep 11, 2019 5:39 am thought I would reply out of interest. On my 2 year old MacBook, 2.2ghz i7 circa 2017 I see this number for perf 6 from start position:

White(1): perft 6
total moves=119060324 time=3.11
White(1):

Here's the next one:

White(1): perft 7
total moves=3195901860 time=82.65

My perft was never optimized for speed. It is a pure generate/make/unmake implementation with no hashing or anything.
Downloaded Crafty 25.3 from github. On my machine overclocked to 4GHz (at present) perft 7 74.53 sec. 42,880,744 nodes per second. So your laptop at 2.2 GHz is no slouch! Mine at 4 GHz bench 7 is 48.29 sec, 69,311,177 nodes per sec. 69.3 / 42.9 = 1.615. Mine is also, "pure generate/make/unmake".

The perft in Crafty 25.3 is much faster than the perft in Crafty 19.x. I remember that my bitboard perft example outperformed C19.x by almost double. Now it takes longer, 117.4 seconds.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Back To The Beginning

Post by Joost Buijs »

bob wrote: Wed Sep 11, 2019 5:39 am thought I would reply out of interest. On my 2 year old MacBook, 2.2ghz i7 circa 2017 I see this number for perf 6 from start position:

White(1): perft 6
total moves=119060324 time=3.11
White(1):

Here's the next one:

White(1): perft 7
total moves=3195901860 time=82.65

My perft was never optimized for speed. It is a pure generate/make/unmake implementation with no hashing or anything.
It all comes very close to each other, on my 4 GHz. machine with perft 7 from start position I get:

perft: 7
moves: 3195901860 ok!
time: 55.7651 sec.
mps: 57310090

This is with full up/down-dating, no bulk-counting or hashing. It is not my goal to make perft() run as fast as possible, it is just a means to check whether the move logic is working like it should.
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Back To The Beginning

Post by hgm »

When I had first written qperft (I was using a Pentium M at the time), I tried to speed it up by using assembler. To not waste time on non-time-ritical code, I started with the assembler output of the C compiler, and 'hand-optimized' the code of the time-critical loops. It turned out I could delete as much as 40% of the instructions there, by keeping data in registers rather than pointlessly storing it and reading it back from memory.

When I later timed the various perfts, the speedup turned out to be exactly.... 0%!

This experience made me give up assembly coding. Although I guess assembler can still have advantages if you need it to access hardware features that the compiler would ignore (like YMM registers).

The speed of move generation in itself is a bit meaningless. You have to consider it in combination with the time spent on move sorting. Move generators that superficially are equally fast can still lead to a significant difference in performance if one of them generates the moves in the desired search order (e.g. MVV/LVA), so that the moves do not have to be stored in a list, and sorted, but can be searched immediately.
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Back To The Beginning

Post by zullil »

hgm wrote: Wed Sep 11, 2019 10:45 am When I had first written qperft (I was using a Pentium M at the time), I tried to speed it up by using assembler. To not waste time on non-time-ritical code, I started with the assembler output of the C compiler, and 'hand-optimized' the code of the time-critical loops. It turned out I could delete as much as 40% of the instructions there, by keeping data in registers rather than pointlessly storing it and reading it back from memory.

When I later timed the various perfts, the speedup turned out to be exactly.... 0%!

This experience made me give up assembly coding. Although I guess assembler can still have advantages if you need it to access hardware features that the compiler would ignore (like YMM registers).
And yet asmfish was markedly faster than Stockfish with exactly the same serial search.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Back To The Beginning

Post by Michael Sherwin »

zullil wrote: Wed Sep 11, 2019 11:33 am
hgm wrote: Wed Sep 11, 2019 10:45 am When I had first written qperft (I was using a Pentium M at the time), I tried to speed it up by using assembler. To not waste time on non-time-ritical code, I started with the assembler output of the C compiler, and 'hand-optimized' the code of the time-critical loops. It turned out I could delete as much as 40% of the instructions there, by keeping data in registers rather than pointlessly storing it and reading it back from memory.

When I later timed the various perfts, the speedup turned out to be exactly.... 0%!

This experience made me give up assembly coding. Although I guess assembler can still have advantages if you need it to access hardware features that the compiler would ignore (like YMM registers).
And yet asmfish was markedly faster than Stockfish with exactly the same serial search.
Exactly. The two examples Carnivor and Godzilla use the exact same method. Carnivor is written in C and Godzilla has the move generator, MakeMove and TakeBack written in assembler.

Carnivor in C bench 6 3.198 seconds
Godzilla in asm bench 6 1.903 seconds
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Back To The Beginning

Post by Joost Buijs »

Michael Sherwin wrote: Wed Sep 11, 2019 11:50 am
zullil wrote: Wed Sep 11, 2019 11:33 am
hgm wrote: Wed Sep 11, 2019 10:45 am When I had first written qperft (I was using a Pentium M at the time), I tried to speed it up by using assembler. To not waste time on non-time-ritical code, I started with the assembler output of the C compiler, and 'hand-optimized' the code of the time-critical loops. It turned out I could delete as much as 40% of the instructions there, by keeping data in registers rather than pointlessly storing it and reading it back from memory.

When I later timed the various perfts, the speedup turned out to be exactly.... 0%!

This experience made me give up assembly coding. Although I guess assembler can still have advantages if you need it to access hardware features that the compiler would ignore (like YMM registers).
And yet asmfish was markedly faster than Stockfish with exactly the same serial search.
Exactly. The two examples Carnivor and Godzilla use the exact same method. Carnivor is written in C and Godzilla has the move generator, MakeMove and TakeBack written in assembler.

Carnivor in C bench 6 3.198 seconds
Godzilla in asm bench 6 1.903 seconds
I don't know what Carnivor and Godzilla are, but I'm pretty sure that when you code the C version for speed that the difference with the asm version will be negligible. It's the same with Stockfish, it is coded for readability (sort of) but it is not coded for speed.
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Back To The Beginning

Post by hgm »

zullil wrote: Wed Sep 11, 2019 11:33 amAnd yet asmfish was markedly faster than Stockfish with exactly the same serial search.
This could be due to the quality of the original source code. I understood that Stockfish is written in C++.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Back To The Beginning

Post by bob »

Joost Buijs wrote: Wed Sep 11, 2019 8:20 am
bob wrote: Wed Sep 11, 2019 5:39 am thought I would reply out of interest. On my 2 year old MacBook, 2.2ghz i7 circa 2017 I see this number for perf 6 from start position:

White(1): perft 6
total moves=119060324 time=3.11
White(1):

Here's the next one:

White(1): perft 7
total moves=3195901860 time=82.65

My perft was never optimized for speed. It is a pure generate/make/unmake implementation with no hashing or anything.
It all comes very close to each other, on my 4 GHz. machine with perft 7 from start position I get:

perft: 7
moves: 3195901860 ok!
time: 55.7651 sec.
mps: 57310090

This is with full up/down-dating, no bulk-counting or hashing. It is not my goal to make perft() run as fast as possible, it is just a means to check whether the move logic is working like it should.
I believe that perft first showed up in Crafty. The oldest version I have is 8.11 which was probably 1994/1995ish. It has "perf" rather than "perft". By version 10 (the next version series I have) it had been changed to perft although perf was still present. "perf" seemed to be a static test, while perft worked as it does today. My intent was not to use this as a "tachometer" to measure speed, but to use it as a debugging tool to detect movgen/make/unmake errors when those were changed for optimization. 8.11 actually produces wrong numbers, but there are a zillion warnings and it was really a 32 bit program that probably has lots of bugs in the 64 bit world.