Improve my chess engine (mainly nps and branching factor)

Rein Halbersma · Post by **Rein Halbersma** » Fri Jul 04, 2014 4:58 pm

Richard Allbert wrote:
Bloodbane wrote:"c++ std based map transposition tables"

That is probably slowing you down quite a bit since inserting stuff into a std::map is very slow. Most people(i.e. everyone) just new or malloc a part of memory just for the TT and avoid the overhead of std::map. CPW-Engine should have a perfectly good example on this.
Following on from this, I noticed in your Position class you are using std::vector<int> for your pieces - I did this in my first ever program and it slowed it down a lot. May be better just to use an int array[].

It depends whether your Position can be copied or not. If you don't copy-make moves, and just have a make() / undo(), the vector keeps its memory during a search and there should be no slowdown.

For a TT you can easily adapt the CPW code to use a std::vector<TTEntry> for its storage, and eliminate all the manual memory management. Again zero slowdown.

mar · Post by **mar** » Fri Jul 04, 2014 8:38 pm

As long as you iterate using indices (no iterator/range loop crap) you should be fine (no performance hit). Of course vector requires an extra level of indirection but that can be cached.
using vector instead of plain array means trolling the compiler.
but hey computers are fast these days

Rein Halbersma · Post by **Rein Halbersma** » Fri Jul 04, 2014 8:39 pm

mar wrote:Well... how do you expect to perform with stuff like this:
Code: Select all
Move *tmp = new Move&#40;*move&#41;;
storing UCI move string in Move structure (ok only a static array but still),
indexing out of bounds (debug mode vector should catch it for you)
and who knows what else. Your engine is obviously original so thumbs up.

My suggestion is to fix bugs, simplify move and board structure, use negamax to get rid of separate min/max code,
generate moves on stack (you use depth-first search anyway, I see no point in holding the whole tree in a dynamic structure, you even dynamically allocate each move!)
maximum number of moves per position should be no more than 256 (there is an exact number that you can google, it's less than that, 218 IIRC).
Just be careful to avoid excessive stack usage (typically apps can use 1 meg/512 kb).

I use a std::vector<Move, StackAlloc> for my move generator, where I use Howard Hinnant's stack allocator. This lets you define an array[N] of raw memory on the stack. Then you do a e.g. reserve(N) on the vector that simply sets the capacity pointer of the vector to the end of the stack array. Only whenever the array is exhausted by the generated moves so far, it will go to the heap. Choosing N judiciously (say to allow for 99.9% of generated move lists lengths) there is very little overhead compared to a raw stack array, and complete safety against rare positions with very large numbers of moves that exceed N.

I'm not saying it is a good thing if you know the theoretical upper bound a priori, but for my purposes (draughts engine that plays on arbitrarily large boards) it works flawlessly.

mar · Post by **mar** » Fri Jul 04, 2014 9:56 pm

Rein Halbersma wrote:I use a std::vector<Move, StackAlloc> for my move generator, where I use Howard Hinnant's stack allocator. This lets you define an array[N] of raw memory on the stack. Then you do a e.g. reserve(N) on the vector that simply sets the capacity pointer of the vector to the end of the stack array. Only whenever the array is exhausted by the generated moves so far, it will go to the heap. Choosing N judiciously (say to allow for 99.9% of generated move lists lengths) there is very little overhead compared to a raw stack array, and complete safety against rare positions with very large numbers of moves that exceed N.

I'm not saying it is a good thing if you know the theoretical upper bound a priori, but for my purposes (draughts engine that plays on arbitrarily large boards) it works flawlessly.

I admit that I have no clue about the upper bound on draughts (English draughts is 3 rows, right? I remember we played 2 row draughts as kids here because you could use chess pieces for that).
Unfortunately not all people understand allocators (C++11 has slim allocators IIRC which have much simpler interfaces) and they can (easily) "shoot themselves in the foot" to quote Stroustroup.
Don't get me wrong, I like C++, but I prefer what I call "reasonable or common sense C++". I don't like "modern" C++ much, I wasn't impressed by Loki at all. Fact is, modern C++ is based on stuff that wasn't even intended to work that way... Remember a colleague who tried to explain his wonderful typelists to me, but the code was so obfuscated that he couldn't understand the code himself!!
I feel a bit sad when I see talented people waste their potential on writing "clever code". But that's just my opinion...I'm not a fashion guy

Rein Halbersma · Post by **Rein Halbersma** » Sat Jul 05, 2014 1:23 pm

mar wrote: I admit that I have no clue about the upper bound on draughts (English draughts is 3 rows, right? I remember we played 2 row draughts as kids here because you could use chess pieces for that).

English draughts is played on a chessboard with 3 rows of men each in the initial position. It's in 10x10 International draughts (4 rows of men each) where very large move lists can be a problem for engines with fixed move arrays. There exist positions with hundreds (~900) moves, which come about from equivalent capture sequences of 10+ pieces along different routes. You could of course eliminate them by comparing the locations of captured pieces but that also has a cost. I typically allocate stack for 16 moves and go to the heap if more moves are required.

Unfortunately not all people understand allocators (C++11 has slim allocators IIRC which have much simpler interfaces) and they can (easily) "shoot themselves in the foot" to quote Stroustroup.
Don't get me wrong, I like C++, but I prefer what I call "reasonable or common sense C++". I don't like "modern" C++ much, I wasn't impressed by Loki at all. Fact is, modern C++ is based on stuff that wasn't even intended to work that way... Remember a colleague who tried to explain his wonderful typelists to me, but the code was so obfuscated that he couldn't understand the code himself!!

I never used Loki, but did read the Modern C++ Design book by its author. That code was enormously complicated because of missing language features (no variadic templates e.g.). In C++11/14, all of this goes away and you can do awesome things with very little code. Just things like auto, lambdas and using instead of typedef, makes code so much easier to write and read.

I feel a bit sad when I see talented people waste their potential on writing "clever code". But that's just my opinion...I'm not a fashion guy

I feel sad when talented people code in assembly or close to it and waste their time instead of using high-level libraries at small abstraction costs. To this day, it really hurts me to see that a high-quality engine like Stockfish doesn't use a std::vector<TTEntry> for its TT storage, but allocates it using malloc etc. The extra pointer indirection surely isn't going to cost any ELO whatsoever.

Richard Allbert · Post by **Richard Allbert** » Sat Jul 05, 2014 10:00 pm

Henk wrote:This is why data encapsulation/data hiding was introduced. To prevent rewriting large pieces of code.

Problem is that it conflicts with efficiency(speed).

No, it doesn't. The issue is using the right tool for the task required. Here, for pieces described as a list of int, a vector is overkill.

Also I agree completely with Lucas.

I'm far from an expert, but I get the feeling half the C++ stuff written is done so simply to say "I used templates / STL / whatever" without considering whether it was needed or not.

And the code ends up unreadable by anyone else.

Richard Allbert · Post by **Richard Allbert** » Sat Jul 05, 2014 10:05 pm

Rein Halbersma wrote: It depends whether your Position can be copied or not. If you don't copy-make moves, and just have a make() / undo(), the vector keeps its memory during a search and there should be no slowdown.

This reminds me of something else needed in the code - a copy constuctor to prevent copying. The other nice side of C++ - the phantom copies of objects.

Rein Halbersma wrote: For a TT you can easily adapt the CPW code to use a std::vector<TTEntry> for its storage, and eliminate all the manual memory management. Again zero slowdown.

Please do - I'd like to see if you get the same nps. Genuinely interested. You say it's easy.

Rein Halbersma · Post by **Rein Halbersma** » Sat Jul 05, 2014 10:54 pm

Sure, challenge accepted what would be an acceptable test for you (test suite, or search in specific position)?

Richard Allbert · Post by **Richard Allbert** » Sun Jul 06, 2014 7:31 pm

Rein Halbersma wrote:Sure, challenge accepted what would be an acceptable test for you (test suite, or search in specific position)?

And this is why I hardly post here anymore.

There was no 'challenge' no proving a point, just interest in your comment.

Which as with everything on is topic is taken as some kind of contest.

Small wonder the community has dwindled.

Rein Halbersma · Post by **Rein Halbersma** » Sun Jul 06, 2014 8:56 pm

Richard Allbert wrote:
Rein Halbersma wrote:Sure, challenge accepted what would be an acceptable test for you (test suite, or search in specific position)?

And this is why I hardly post here anymore.

There was no 'challenge' no proving a point, just interest in your comment.

Which as with everything on is topic is taken as some kind of contest.

Small wonder the community has dwindled.

Oh sorry, mine was also just an innocent remark: what's wrong with a little competition? In any case, if you are still interested, here is what I get on the current master branch for ./stockfish bench (couldn't find a github repo for CPW-Engine)

Code: Select all

Total time &#40;ms&#41; &#58; 4382
Nodes searched  &#58; 7634225
Nodes/second    &#58; 1742178

I then apply the following rather straightforward changes:

tt.h

Code: Select all

 #ifndef TT_H_INCLUDED
 #define TT_H_INCLUDED
 
 #include "misc.h"
 #include "types.h"
+#include <vector>
 
 /// The TTEntry is the 10 bytes transposition table entry, defined as below&#58;
 ///
@@ -83,19 +65,18 @@ struct TTCluster &#123;
 class TranspositionTable &#123;
 
 public&#58;
- ~TranspositionTable&#40;) &#123; free&#40;mem&#41;; &#125;
   void new_search&#40;) &#123; generation += 4; &#125; // Lower 2 bits are used by Bound
 
   const TTEntry* probe&#40;const Key key&#41; const;
-  TTEntry* first_entry&#40;const Key key&#41; const;
+  const TTEntry* first_entry&#40;const Key key&#41; const;
+  TTEntry* first_entry&#40;const Key key&#41;;
   void resize&#40;uint64_t mbSize&#41;;
   void clear&#40;);
   void store&#40;const Key key, Value v, Bound type, Depth d, Move m, Value statV&#41;;
 
 private&#58;
   uint32_t clusterCount;
-  TTCluster* table;
-  void* mem;
+  std&#58;&#58;vector<TTCluster> table;
   uint8_t generation; // Size must be not bigger than TTEntry&#58;&#58;genBound8
 &#125;;
 
@@ -106,7 +87,12 @@ extern TranspositionTable TT;
 /// a cluster given a position. The lowest order bits of the key are used to
 /// get the index of the cluster inside the table.
 
-inline TTEntry* TranspositionTable&#58;&#58;first_entry&#40;const Key key&#41; const &#123;
+inline const TTEntry* TranspositionTable&#58;&#58;first_entry&#40;const Key key&#41; const &#123;
+  return &table&#91;&#40;uint32_t&#41;key & &#40;clusterCount - 1&#41;&#93;.entry&#91;0&#93;;
+&#125;
+
+inline TTEntry* TranspositionTable&#58;&#58;first_entry&#40;const Key key&#41; &#123;
   return &table&#91;&#40;uint32_t&#41;key & &#40;clusterCount - 1&#41;&#93;.entry&#91;0&#93;;
 &#125;

So I removed the destructor ~TranspositionTable(), removed the void* mem, and added a non-const overload for first_entry() and made the return value for its const brother equal to const TTEntry*. Finally I made the table equal to a std::vector<TTCluster>.

tt.cpp

Code: Select all

 #include <cstring>
 #include <iostream>
+#include <algorithm>
 
 #include "bitboard.h"
 #include "tt.h"
@@ -40,18 +41,22 @@ void TranspositionTable&#58;&#58;resize&#40;uint64_t mbSize&#41; &#123;
       return;
 
   clusterCount = newClusterCount;
-
-  free&#40;mem&#41;;
-  mem = calloc&#40;clusterCount * sizeof&#40;TTCluster&#41; + CACHE_LINE_SIZE - 1, 1&#41;;
-
-  if (!mem&#41;
-  &#123;
-      std&#58;&#58;cerr << "Failed to allocate " << mbSize
-                << "MB for transposition table." << std&#58;&#58;endl;
-      exit&#40;EXIT_FAILURE&#41;;
+  table.reserve&#40;clusterCount * TTClusterSize&#41;;
+
+ /*
+  * unfortunately, Stockfish compiles with -fno-exceptions
+  *
+  try &#123;
+          table.reserve&#40;clustercount * TTClusterSize&#41;;
+  &#125; catch&#40;std&#58;&#58;bad_alloc const&) &#123;
+          std&#58;&#58;cerr << "Failed to allocate " << mbSize
+                    << "MB for transposition table." << std&#58;&#58;endl;
+          exit&#40;EXIT_FAILURE&#41;;
+  &#125; catch&#40;...) &#123;
+          std&#58;&#58;cerr << "Unknown exception";
+          exit&#40;EXIT_FAILURE&#41;;
   &#125;
-
-  table = &#40;TTCluster*)(&#40;uintptr_t&#40;mem&#41; + CACHE_LINE_SIZE - 1&#41; & ~&#40;CACHE_LINE_SIZE - 1&#41;);
+*/
 &#125;
 
 
@@ -61,7 +66,7 @@ void TranspositionTable&#58;&#58;resize&#40;uint64_t mbSize&#41; &#123;
 
 void TranspositionTable&#58;&#58;clear&#40;) &#123;
 
-  std&#58;&#58;memset&#40;table, 0, clusterCount * sizeof&#40;TTCluster&#41;);
+  std&#58;&#58;fill&#40;table.begin&#40;), table.end&#40;), TTCluster&#40;));
 &#125;
 
 
@@ -71,7 +76,7 @@ void TranspositionTable&#58;&#58;clear&#40;) &#123;
 
 const TTEntry* TranspositionTable&#58;&#58;probe&#40;const Key key&#41; const &#123;
 
-  TTEntry* tte = first_entry&#40;key&#41;;
+  TTEntry* tte = const_cast<TTEntry*>&#40;first_entry&#40;key&#41;);

Here I removed the manual memory allocation with calloc() in favor of the std::vector reserve() member function, and the std::memset() in favor of a std::fill(). Note that because Stockfish does not use exceptions, I had to comment out the try/catch clause around the reserve(). I also drop the manual memory alignment trick. It is possible to restore this through using a special allocator, e.g. one from Boost.Align. I haven't tried it. Note also that I put in an ugly const_cast<TTEntry*> to alert the innocent viewers to the fact that probe() is a const member function but yet it modifies the generation inside TT entries.

OK, payment time: what's the penalty for using std::vector over raw memory:

Code: Select all

Total time &#40;ms&#41; &#58; 4438
Nodes searched  &#58; 7634225
Nodes/second    &#58; 1720194

That's a 1.3% penalty for not worrying about memory allocation anymore. Not quite the zero overhead that I imagined, but close enough for my taste. Perhaps all that cache alignment is good for something. In any case, I take that to go please

Improve my chess engine (mainly nps and branching factor)

Re: Improve my chess engine (mainly nps and branching factor

Re: Improve my chess engine (mainly nps and branching factor

Re: Improve my chess engine (mainly nps and branching factor

Re: Improve my chess engine (mainly nps and branching factor

Re: Improve my chess engine (mainly nps and branching factor

Re: Improve my chess engine (mainly nps and branching factor

Re: Improve my chess engine (mainly nps and branching factor

Re: Improve my chess engine (mainly nps and branching factor

Re: Improve my chess engine (mainly nps and branching factor

Re: Improve my chess engine (mainly nps and branching factor