New Engine announcement: Pygmalion

R. Tomasi · Post by **R. Tomasi** » Thu Sep 02, 2021 3:00 pm

Hello everybody

since this is my first post on this forum, allow me to introduce myself quickly: My Name is Roland Tomasi, I am working as freelance software developer/architect since approx. 30 years in the field of AI and OCR. I have a strong background in mathematics and theoretical physics. Every now and then I involve myself in "small" programming projects that are totally unrelated to my work, I do that for the fun it brings, but also to keep my programming skill from rusting (which quickly happens if you only work in a narrow segment of the art for a while). I call these projects "Fingerübungen", an analogy to the etudes piano players do to keep their skills up.

The chess engine I am announcing here started exactly as such a project. I am hard pressed to tell you guys the exact year that I started working on it, but it must be somewhere around 2013. Indeed after a few days of coding I had a program that played correct chess, but was about the weakest opponent one can imagine. It was a totally naive implementation written with C#. A few google searches made me discover CPW, which is imho one of the most astounding wikis on the net: so much brain-candy, so many ideas. I became addicted to chess-programming.

Over the last years the "engine" was rewritten several times and ported to C++. I even played around with an OpenCL version in the hopes of harnessing the power of modern GPUs, but (as you may already guess) failed miserably at efficiently parallelizing the search in a way that fits these vector-based architectures of the GPUs. About two years ago I had a version that was playing reasonably strong (it consistently beat the FairyMax version bundled with Winboard at the time), but it was buggy, full of search instabilities and generally not anything I would show around without being ashamed of my work.

A few months ago I started reading Scott Meyers "Effective Modern C++" (highly recommended read if your interested in C++17) and decided that rewriting the engine from scratch with the concepts of that book in mind. This time it is intended to be the real deal and I plan to go public with that engine soon, the work is not finished yet, however. The design goals are the following:

- A library allowing to implement other boardgames (variants, checkers, etc.) with different boardsizes and more than two players from this project. I have never really tried anything other than 2 players yet, so in that direction, aswell as when it comes to variant move generation a lot of work still remains to be done. As proof of concept I have a working tic-tac-toe engine, which is quite nice, since that game is totally solvable (always a draw if you play correctly) and with a 3x3 board and only one piece type and only one move type probably the most minimal boardgame imaginable. I probably own the most overengineered tic-tac-toe engine in the world now

- A chess engine that makes use of the power that C++ gives to the programmer: template-heavy, lot's of constexpr, noexcept, C++17 move semantics, etc. I have to admit, that since I try to not allocate on the heap whenever possible, there aren't so many instances where move semantics are helpful.

- Totally portable code. A cmake project, that compiles against any C++17 capable toolchain and strictly adheres to the C++17 standard. It should not use any third-party libraries, only the standard library included with C++17. I may however, at some point include tablebases lookups, and do not intend to reinvent the wheel there, so at least for that I will probably use some sort of third-party library. Still the goal is that it should be able to compile on any platform with a C++17 capable compiler.

- Achieve a strength of more than 2500 ELO.

The engine's name will be Pygmalion, in reference to to greek mythology (https://en.wikipedia.org/wiki/Pygmalion_(mythology)), because at some point my girlfriend started to complain, that I love the engine more than her, which reminded me of that greek legend.

I do plan to make my source-code public, once it reaches a mature-enough state, however I have not decided on the licensing model yet. Open source is still a new world for me, and I have to sift through the intricacies of the different licensing models before being able to make a decision.

If you are interested in the current development build, I am happy to share executables. Contact me via PM in that case. What I would need to know then however, is for which platform (Windows/Linux?) and for which processor (x64? SSE? AVX?, etc) I will have to build it. Bear in mind: what you will get is a development snap-shot. The current state of the engine is very much comparable to a formula 1 car in the box, whith lot's of components taken out for inspection. It will not play any form of strong chess in its current state

I realize that this post already is quite long. I plan to give more detailed descriptions of the individual engine components in this thread, but I guess it will be one post per component, and I don't want to give any shedule on how fast all these posts will come.

Anyways, glad to be here on the forum. I have been lurking here for quite some while now, and many of you guys have become quite known to me, although that is obviously only one-way atm.

[EDIT: typos]

R. Tomasi · Post by **R. Tomasi** » Thu Sep 02, 2021 3:01 pm

Oh, and sorry for pestering the admins with the registration.

maksimKorzh · Post by **maksimKorzh** » Thu Sep 02, 2021 5:42 pm

Hi Roland!

First of all welcome to the community. It's always a pleasure to have newcomers especially with solid backgrounds.
I'm running a chess programming related channel which you might be potentially interested in.
Here's one the most popular series I've released so far:

It's a live coding of a extremely didactic chess engine in C.

Also I have many other chess programming related tutorials that can be found here:
https://www.youtube.com/channel/UCB9-pr ... /playlists

Hope it helps, waiting for Pygmalion release!

R. Tomasi · Post by **R. Tomasi** » Thu Sep 02, 2021 5:48 pm

maksimKorzh wrote: ↑Thu Sep 02, 2021 5:42 pm I'm running a chess programming related channel which you might be potentially interested in.
Here's one the most popular series I've released so far:

It's a live coding of a extremely didactic chess engine in C.

Also I have many other chess programming related tutorials that can be found here:
https://www.youtube.com/channel/UCB9-pr ... /playlists

Hope it helps, waiting for Pygmalion release!

Thank you for providing the links and for the time+effort that you invest in creating such a channel and the tutorials. I will definitely have a look!

R. Tomasi · Post by **R. Tomasi** » Thu Sep 02, 2021 9:28 pm

pygmalion-intrinsics

This is the first module I will describe here. It is required by all other modules. It's purpose is twofold:

Determine the target platforms properties and enable/disable platform specific intrinsics accordingly. Basically stuff like popcnt, bitscans, etc are abstracted for simplified use.
Provide templates for "basic types" that will be used by the other modules. I think the most noteworthy of these is:
Code: Select all
```
template<size_t COUNT_BITS, bool IS_COMPACT, typename>
	class uint_t;
```
This is a generalization of std::uint8_t, std::uint16_t, ... family of types - it represents an unsigned integer of arbitrary bit length. The COUNT_BITS argument - as the name suggests - is the number of bits the unsigned integer should have. The IS_COMPACT argument determines if performance should be traded for size. Since the engine is bitboard based, arbitrary board sizes will require unsigned integers with different bit counts to represent such a bitboard. For chess uint_t<64,...> is the right choice. For tic-tac-toe, which has a board with 9 squares, it is uint_t<9,...>, for example. It provides all the methods one would expect from an unsigned integer, like arithmetic operations, comparisions, shifts, etc. On top of that it provides methods to perform the stuff you typically need in a bitboard based engine: popcnt, bitscans, PDEP/PEXT, etc. These will - if possible - be optimized using intrinsics available on the target platform.

Provide abstraction classes for things that the engine will need:

Code: Select all

	template<typename VALUE, typename SCORE>
	class sort;

This is what would be called a static class in C#: it only contains static methods. The purpose of this class is to provide a fast sorting algorithm for arbitrary types of scores and values. It will order the values in a list by descending score (highest first). I am aware that the STL provides sorting algorithms, but they are not suitable here: I need an algorithm that is especially performant for small lists (order of magnitude 10 elements) at different locations in my search.

For lengths up to 32 it will use parallel sorting networks (which afaik outperform almost any other algorithm for these sizes). The catch here is, that these sorting networks can perform multiple compare and exchange operations in parallel, which can (if the platform allows for it) be executed truly parallel in a SIMD-wise fashion. Here is an example of the n=32 network:

Code: Select all

constexpr static void sort_N32(VALUE* pValues, SCORE* pScores) noexcept
{
	comparator<0, 16, 1, 17, 2, 18, 3, 19, 4, 20, 5, 21, 6, 22, 7, 23, 8, 24, 9, 25, 10, 26, 11, 27, 12, 28, 13, 29, 14, 30, 15, 31>(pValues, pScores);
	comparator<0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15, 16, 24, 17, 25, 18, 26, 19, 27, 20, 28, 21, 29, 22, 30, 23, 31>(pValues, pScores);
	comparator<8, 16, 9, 17, 10, 18, 11, 19, 12, 20, 13, 21, 14, 22, 15, 23, 0, 4, 1, 5, 2, 6, 3, 7, 24, 28, 25, 29, 26, 30, 27, 31>(pValues, pScores);
	comparator<8, 12, 9, 13, 10, 14, 11, 15, 16, 20, 17, 21, 18, 22, 19, 23, 0, 2, 1, 3, 28, 30, 29, 31>(pValues, pScores);
	comparator<4, 16, 5, 17, 6, 18, 7, 19, 12, 24, 13, 25, 14, 26, 15, 27, 0, 1, 30, 31>(pValues, pScores);
	comparator<4, 8, 5, 9, 6, 10, 7, 11, 12, 16, 13, 17, 14, 18, 15, 19, 20, 24, 21, 25, 22, 26, 23, 27>(pValues, pScores);
	comparator<4, 6, 5, 7, 8, 10, 9, 11, 12, 14, 13, 15, 16, 18, 17, 19, 20, 22, 21, 23, 24, 26, 25, 27>(pValues, pScores);
	comparator<2, 16, 3, 17, 6, 20, 7, 21, 10, 24, 11, 25, 14, 28, 15, 29>(pValues, pScores);
	comparator<2, 8, 3, 9, 6, 12, 7, 13, 10, 16, 11, 17, 14, 20, 15, 21, 18, 24, 19, 25, 22, 28, 23, 29>(pValues, pScores);
	comparator<2, 4, 3, 5, 6, 8, 7, 9, 10, 12, 11, 13, 14, 16, 15, 17, 18, 20, 19, 21, 22, 24, 23, 25, 26, 28, 27, 29>(pValues, pScores);
	comparator<2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29>(pValues, pScores);
	comparator<1, 16, 3, 18, 5, 20, 7, 22, 9, 24, 11, 26, 13, 28, 15, 30>(pValues, pScores);
	comparator<1, 8, 3, 10, 5, 12, 7, 14, 9, 16, 11, 18, 13, 20, 15, 22, 17, 24, 19, 26, 21, 28, 23, 30>(pValues, pScores);
	comparator<1, 4, 3, 6, 5, 8, 7, 10, 9, 12, 11, 14, 13, 16, 15, 18, 17, 20, 19, 22, 21, 24, 23, 26, 25, 28, 27, 30>(pValues, pScores);
	comparator<1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30>(pValues, pScores);
}

An example for an AVX2 version of such a comparator (here for 8 parallel elements) looks like this (in this case for 32bits scores and 16bit values):

Code: Select all

constexpr static void vector8_AVX2(VALUE* pValues, SCORE* pScores) noexcept
{
	const __m128i score1low{ _mm_insert_epi32(_mm_insert_epi32(_mm_insert_epi32(_mm_insert_epi32(_mm_undefined_si128(), pScores[IDX1a],0), pScores[IDX1b], 1), pScores[IDX1c], 2), pScores[IDX1d], 3) };
	const __m128i score2low{ _mm_insert_epi32(_mm_insert_epi32(_mm_insert_epi32(_mm_insert_epi32(_mm_undefined_si128(), pScores[IDX2a],0), pScores[IDX2b], 1), pScores[IDX2c], 2), pScores[IDX2d], 3) };
	const __m128i score1high{ _mm_insert_epi32(_mm_insert_epi32(_mm_insert_epi32(_mm_insert_epi32(_mm_undefined_si128(), pScores[IDX1e],0), pScores[IDX1f], 1), pScores[IDX1g], 2), pScores[IDX1h], 3) };
	const __m128i score2high{ _mm_insert_epi32(_mm_insert_epi32(_mm_insert_epi32(_mm_insert_epi32(_mm_undefined_si128(), pScores[IDX2e],0), pScores[IDX2f], 1), pScores[IDX2g], 2), pScores[IDX2h], 3) };
	const __m256i score1{ _mm256_inserti128_si256(_mm256_castsi128_si256(score1low),score1high, 1) };
	const __m256i score2{ _mm256_inserti128_si256(_mm256_castsi128_si256(score2low),score2high, 1) };
	const __m128i value1low{ _mm_insert_epi16(_mm_insert_epi16(_mm_insert_epi16(_mm_insert_epi16(_mm_undefined_si128(), pValues[IDX1a],0), pValues[IDX1b], 1), pValues[IDX1c], 6), pValues[IDX1d], 7) };
	const __m128i value1high{ _mm_insert_epi16(_mm_insert_epi16(_mm_insert_epi16(_mm_insert_epi16(_mm_undefined_si128(), pValues[IDX1e],0), pValues[IDX1f], 1), pValues[IDX1g], 6), pValues[IDX1h], 7) };
	const __m256i value1{ _mm256_inserti128_si256(_mm256_castsi128_si256(value1low),value1high, 1) };
	const __m128i value2low{ _mm_insert_epi16(_mm_insert_epi16(_mm_insert_epi16(_mm_insert_epi16(_mm_undefined_si128(), pValues[IDX2a],0), pValues[IDX2b], 1), pValues[IDX2c], 6), pValues[IDX2d], 7) };
	const __m128i value2high{ _mm_insert_epi16(_mm_insert_epi16(_mm_insert_epi16(_mm_insert_epi16(_mm_undefined_si128(), pValues[IDX2e],0), pValues[IDX2f], 1), pValues[IDX2g], 6), pValues[IDX2h], 7) };
	const __m256i value2{ _mm256_inserti128_si256(_mm256_castsi128_si256(value2low),value2high, 1) };
	const __m256i comparision{ _mm256_cmpgt_epi32(score2, score1) };
	const __m256i maskValue{ _mm256_shufflehi_epi16(_mm256_shufflelo_epi16(comparision, 0b00001000), 0b00001000) };
	const __m256i deltaScore{ _mm256_and_si256(_mm256_xor_si256(score1, score2), comparision) };
	const __m256i resultScore1{ _mm256_xor_si256(score1, deltaScore) };
	const __m256i resultScore2{ _mm256_xor_si256(score2, deltaScore) };
	const __m256i deltaValue{ _mm256_and_si256(_mm256_xor_si256(value1, value2), maskValue) };
	const __m256i resultValue1{ _mm256_xor_si256(value1, deltaValue) };
	const __m256i resultValue2{ _mm256_xor_si256(value2, deltaValue) };
	const __m128i resultScore1low{ _mm256_extracti128_si256(resultScore1, 0) };
	const __m128i resultScore1high{ _mm256_extracti128_si256(resultScore1, 1) };
	const __m128i resultScore2low{ _mm256_extracti128_si256(resultScore2, 0) };
	const __m128i resultScore2high{ _mm256_extracti128_si256(resultScore2, 1) };
	const __m128i resultValue1low{ _mm256_extracti128_si256(resultValue1, 0) };
	const __m128i resultValue1high{ _mm256_extracti128_si256(resultValue1, 1) };
	const __m128i resultValue2low{ _mm256_extracti128_si256(resultValue2, 0) };
	const __m128i resultValue2high{ _mm256_extracti128_si256(resultValue2, 1) };
	pScores[IDX1a] = _mm_extract_epi32(resultScore1low, 0);
	pScores[IDX1b] = _mm_extract_epi32(resultScore1low, 1);
	pScores[IDX1c] = _mm_extract_epi32(resultScore1low, 2);
	pScores[IDX1d] = _mm_extract_epi32(resultScore1low, 3);
	pScores[IDX1e] = _mm_extract_epi32(resultScore1high, 0);
	pScores[IDX1f] = _mm_extract_epi32(resultScore1high, 1);
	pScores[IDX1g] = _mm_extract_epi32(resultScore1high, 2);
	pScores[IDX1h] = _mm_extract_epi32(resultScore1high, 3);
	pScores[IDX2a] = _mm_extract_epi32(resultScore2low, 0);
	pScores[IDX2b] = _mm_extract_epi32(resultScore2low, 1);
	pScores[IDX2c] = _mm_extract_epi32(resultScore2low, 2);
	pScores[IDX2d] = _mm_extract_epi32(resultScore2low, 3);
	pScores[IDX2e] = _mm_extract_epi32(resultScore2high, 0);
	pScores[IDX2f] = _mm_extract_epi32(resultScore2high, 1);
	pScores[IDX2g] = _mm_extract_epi32(resultScore2high, 2);
	pScores[IDX2h] = _mm_extract_epi32(resultScore2high, 3);
	pValues[IDX1a] = _mm_extract_epi16(resultValue1low, 0);
	pValues[IDX1b] = _mm_extract_epi16(resultValue1low, 1);
	pValues[IDX1c] = _mm_extract_epi16(resultValue1low, 6);
	pValues[IDX1d] = _mm_extract_epi16(resultValue1low, 7);
	pValues[IDX1e] = _mm_extract_epi16(resultValue1high, 0);
	pValues[IDX1f] = _mm_extract_epi16(resultValue1high, 1);
	pValues[IDX1g] = _mm_extract_epi16(resultValue1high, 6);
	pValues[IDX1h] = _mm_extract_epi16(resultValue1high, 7);
	pValues[IDX2a] = _mm_extract_epi16(resultValue2low, 0);
	pValues[IDX2b] = _mm_extract_epi16(resultValue2low, 1);
	pValues[IDX2c] = _mm_extract_epi16(resultValue2low, 6);
	pValues[IDX2d] = _mm_extract_epi16(resultValue2low, 7);
	pValues[IDX2e] = _mm_extract_epi16(resultValue2high, 0);
	pValues[IDX2f] = _mm_extract_epi16(resultValue2high, 1);
	pValues[IDX2g] = _mm_extract_epi16(resultValue2high, 6);
	pValues[IDX2h] = _mm_extract_epi16(resultValue2high, 7);
}

Feel free to point out flaws in my AVX2 programming (I am not terribly skilled at that).

For larger length the sorting algorithm will default to an implementation of quicksort, however once one leg of the quicksort partition has not more than 32 elements it will then sort that tail using the respective sorting networks.

Code: Select all
```
template<size_t COUNT_MAXPATTERNBITS, size_t COUNT_BITS, typename VALUE, typename INFO, typename INSTANCE>
	class magictable;
```
This is an abstract class/template implementing a magic lookup table. The INSTANCE argument will be the derived class, such that "virtual" methods can be implemented without the performance hit you would usually take from the vtable (see https://en.wikipedia.org/wiki/Barton%E2 ... kman_trick for details).

It will implement the lookup either with PDEP/PEXT like intrinsics or using magic multiplication (it configures itself according to what it thinks is best suited for the target platform).
An exhaustive testing framework, that automatically tests all the types/classes on large random sets and performs sanity checks. (Running all tests takes approx. 2 days on my machine). This helped a lot with debugging, made me find bugs I would otherwise not have detected so easily. These classes have to be 100% trustworthy, because everything else is build on them. Testing is most crucial here, I think.

[Edit: typos+formatting]

Mergi · Post by **Mergi** » Thu Sep 02, 2021 10:14 pm

Very exciting! Since I've started rewriting my own engine recently in C++, I've become a big fan of modern C++, and i'm trying to implement and learn as much as i can. Starting with vectorized sorting algorithms - while i'm a bit skeptical about how useful that will be - ... very ambitious indeed!

I'm honestly very much looking forward to the future of this project.

R. Tomasi · Post by **R. Tomasi** » Thu Sep 02, 2021 10:18 pm

Mergi wrote: ↑Thu Sep 02, 2021 10:14 pm Very exciting! Since I've started rewriting my own engine recently in C++, I've become a big fan of modern C++, and i'm trying to implement and learn as much as i can. Starting with vectorized sorting algorithms - while i'm a bit skeptical about how useful that will be - ... very ambitious indeed! I'm honestly very much looking forward to the future of this project.

Thank you very much for your feedback. In my defense I have to say that I did not exactly start out with vectorized sorting algorithms. At the moment the "engine" is at a state that already plays chess. So far it has 13 modules. The one I described above is only the first that I describe here, because it is fundamental to the rest of them. Obviously I did not start with the vectorized sorting, but with more basic stuff.

And yes - it is an ambitious project

R. Tomasi · Post by **R. Tomasi** » Thu Sep 02, 2021 10:23 pm

The use case for that sorting algorithms is mostly during move ordering, of course. However, since I use a staged move generator, I also sort the stages according to a kind of history heuristic. Also since it uses an IID framework, I do not regenerate moves that I already generated on a previous depth. These have to be re-sorted in the new depth iteration, though. Typically (I did some statistics on this) we're talking about 10-30 moves here. I am aware that most people use some form of selection sort here, but to my amazement (I really did not expect that), the sorting networks beat the selection sort on average. Which is kind of a mystery to me - I would have expected the opposite, and may be true only for my particular case.

R. Tomasi · Post by **R. Tomasi** » Fri Sep 03, 2021 1:00 pm

pygmalion-state

This module is intended to provide everything required to describe the state of a boardgame, which is essentially the board sate. It is configured using the following template as description:

Code: Select all

template<size_t COUNT_PIECES, size_t COUNT_RANKS, size_t COUNT_FILES, size_t COUNT_FLAGS, size_t COUNT_HASHBITS, typename CUMULATION>
	class descriptor_state;

Most of the arguments are self-explaining, I guess. Many types that are required, like playerType, pieceType, squareType, etc. will be defined accordingly. There will be a COUNT_PLAYERS argument, too. For the moment I have removed that and always set it to two players, since the search function will need some minor adaptations to support games that are not for two players.

The CUMULATION argument can be any type - it will be included to the board state as a member variable and is intended to give the option of extending the board state with stuff that needs to be incrementally updated, but is specific to a certain game. For chess that may be the 50-moves counter and a material score, for example. In the case of tic-tac-toe it is not needed, I therefore just use char in that case (the compilers complain when I use void, because you cannot properly initialize a void member).

And of course the actual template/class that represents the board state will be exported by this module:

Code: Select all

template<typename DESCRIPTOR_STATE, typename INSTANCE>
	class board;

It will provide "virtual" methods that one can override in order to incrementally update the game specific stuff in the cumulation member. (i.e. onClear, onAddedPiece, onSetMovingPlayer, onRemovedPiece, onSetFlag, onClearedFlag, onInitialize). These are again implmented using the Barton-Nockham trick - compilers will be able to efficiently inline/optimize those.

R. Tomasi · Post by **R. Tomasi** » Fri Sep 03, 2021 1:53 pm

Barton–Nackman trick.... apologies

New Engine announcement: Pygmalion

New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion