A new chess engine : m8 (comming not so soon)

mar · Post by **mar** » Mon Feb 02, 2015 12:57 pm

lucasart wrote: This is remarkable for an algorithm that is so simple. Considering that most people have 4 cores or less, it seems like an ideal tradeoff. Even 4 to 8 scaling is not too bad.

One technicality I don't understand: 1 thread runs depth d, while n-1 run d+1. seems easy enough, but… how do you manage aspiration windows? or does running depth d include the aspiration loop?

Actually each other thread runs depth d+1.
So you have 1 "master" and n-1 helpers (or slaves if you want).
master starts at d, first helper at d+1, second helper d again and so on.

Actually I don't do anything special, still having the same aspiration loop.
Helpers are simply started each iteration within aspiration loop, at root.
Each time a helper finishes (it can happen that a helper finishes before "master", all helpers (+master) are aborted and I simply return score, (bestmove and PV if available) from the thread that finishes first.
Each thread has local eval cache/pawn hash (but in my case it was quite simple because I had search already wrapped in a class, including local caches).

mar · Post by **mar** » Mon Feb 02, 2015 1:09 pm

actually it's even simpler than that because there is no "master thread" - master is simply thread that runs iterative deepening and aspiration loop and root search just like in single core mode.
each time a helper finished root iteration, it singals that master search via volatile flag and master then aborts stops all helpers and collects the data.
I will probably write a thread about how I do lazy smp in cheng in detail because it may be useful to others.

Ferdy · Post by **Ferdy** » Mon Feb 02, 2015 1:58 pm

mathmoi wrote:
Ferdy wrote:Perhaps you may try chess variant, say capablanca, 8 rows x 10 columns. Two pieces each are added in each side, one piece is a combination of knight and rook the other piece is a combination of knight and bishop. It is fun and interesting to explore the possibilities of these two new pieces over the expanded board.
I'd like to implements other variants beside FRC but m8 will be a bitboard engine and I don't think there is an easy way to implement a 8 x 10 variant as a bitboard engine. Is there?

The gothic Vortex by Ed Trice (8x10) is a bitboard engine.

http://talkchess.com/forum/viewtopic.ph ... thic+chess

Have a look also on the spartan (8x8) different armies from both sides but seemed equal, very interesting plays.

http://www.spartanchessonline.com/vrules.html

matthewlai · Post by **matthewlai** » Mon Feb 02, 2015 2:10 pm

mar wrote:actually it's even simpler than that because there is no "master thread" - master is simply thread that runs iterative deepening and aspiration loop and root search just like in single core mode.
each time a helper finished root iteration, it singals that master search via volatile flag and master then aborts stops all helpers and collects the data.
I will probably write a thread about how I do lazy smp in cheng in detail because it may be useful to others.

I'd just like to point out that using volatile for synchronization almost always results in undefined behaviour. There are multiple reasons for this.

http://stackoverflow.com/questions/2484 ... rogramming

A better option is to use a memory barrier if it's performance-critical, and mutex if not.

SMIRF · Post by **SMIRF** » Mon Feb 02, 2015 2:37 pm

Every argument for supporting 10x8 and random chess variants is very welcomed.
Currently I am (slowly) developing a completely new such engine myself.

mar · Post by **mar** » Mon Feb 02, 2015 4:44 pm

matthewlai wrote:I'd just like to point out that using volatile for synchronization almost always results in undefined behaviour. There are multiple reasons for this.

http://stackoverflow.com/questions/2484 ... rogramming

A better option is to use a memory barrier if it's performance-critical, and mutex if not.

Yes you're right.

I assume these things hold:
1) synchronization primitives (such as mutex) act as fences. (otherwise I don't see how they could work at all)
2) volatile only prevents the compiler from caching the flag in a register (I'm 100% sure it would work even if I remove volatile) - this may not hold for Microsoft compiler where volatile acts as a fence from what I understood
3) memory is visible between cores (sooner or later; L1 is per core) (otherwise I don't see how you could reschedule threads to run on different cores)
4) on ARM I assume that caches will be flushed sooner or later (I assume sooner because of TT probes)

So I think I should be perfectly fine..

Problem is I'm actively polling for the flag in search (say once per node) - I don't care that much if it aborts the search now or 100 nodes later. Atomics have measurable overhead (I assume the same applies to fences).

matthewlai · Post by **matthewlai** » Mon Feb 02, 2015 5:42 pm

mar wrote:
matthewlai wrote:I'd just like to point out that using volatile for synchronization almost always results in undefined behaviour. There are multiple reasons for this.

http://stackoverflow.com/questions/2484 ... rogramming

A better option is to use a memory barrier if it's performance-critical, and mutex if not.
Yes you're right.

I assume these things hold:
1) synchronization primitives (such as mutex) act as fences. (otherwise I don't see how they could work at all)
2) volatile only prevents the compiler from caching the flag in a register (I'm 100% sure it would work even if I remove volatile) - this may not hold for Microsoft compiler where volatile acts as a fence from what I understood
3) memory is visible between cores (sooner or later; L1 is per core) (otherwise I don't see how you could reschedule threads to run on different cores)
4) on ARM I assume that caches will be flushed sooner or later (I assume sooner because of TT probes)

So I think I should be perfectly fine..

Problem is I'm actively polling for the flag in search (say once per node) - I don't care that much if it aborts the search now or 100 nodes later. Atomics have measurable overhead (I assume the same applies to fences).

Synchronization primitives like mutexes do definitely act as fences. Otherwise a lot of code would break.

Volatile is not designed to be a synchronization primitive, though. It's supposed to be used for things like memory-mapped IO. Though it does seem like Microsoft compiler inserts fence for volatile access, probably because too many programmers assume that. It's probably not the case for other compilers.

It's entirely possible and quite likely that it will work perfectly well for you, but it doesn't change the fact that it's undefined behaviour, and may break with future compilers and/or future architectures, or just combinations that you haven't tested on.

If inserting a fence does have a measurable overhead, and you don't care about stopping 100 nodes later, why not just check the shared variable every 100 nodes? That would reduce the overhead by a factor of 100, and your code will no longer rely on undefined behaviour. It does seem odd that it would have a measurable overhead, though. It should just disable instruction reordering for a few instructions, so the overhead shouldn't be more than a few instructions. A search should be hundreds of instructions if you do move generation (and eval in the case of q-search).

Regarding atomics, what implementation of atomics are you using? For most implementations, if the atomic size you want is already atomic on the architecture in question, the compiler will optimize it to be just a plain old variable access. There should be zero overhead at run time.

Only on architectures where accesses to the size you want is not atomic, should there be an overhead.

mar · Post by **mar** » Mon Feb 02, 2015 7:28 pm

matthewlai wrote:Volatile is not designed to be a synchronization primitive, though. It's supposed to be used for things like memory-mapped IO. Though it does seem like Microsoft compiler inserts fence for volatile access, probably because too many programmers assume that.

Or because it makes sense

Btw. I'm more concerned about reordering on CPU (compiler barrier != CPU barrier)

If inserting a fence does have a measurable overhead, and you don't care about stopping 100 nodes later, why not just check the shared variable every 100 nodes?

Good idea, I probably need to redesign some things when I'm in mood to do that.

A search should be hundreds of instructions if you do move generation (and eval in the case of q-search).

I haven't measured it but I bet it's more than hundreds per node (even though pipeline helps a lot here).

Regarding atomics, what implementation of atomics are you using? For most implementations, if the atomic size you want is already atomic on the architecture in question, the compiler will optimize it to be just a plain old variable access. There should be zero overhead at run time.

Only on architectures where accesses to the size you want is not atomic, should there be an overhead.

I'm talking about RMW atomics that make more sense to me, like inc/dec and CAS. IIRC the overhead for increment was 8x on x86 (lock xadd), but I don't remember
(this is one of the reasons why atomic refcounting is slower than nonatomic, assuming you change the counter often).

matthewlai · Post by **matthewlai** » Mon Feb 02, 2015 7:40 pm

mar wrote:
matthewlai wrote:Volatile is not designed to be a synchronization primitive, though. It's supposed to be used for things like memory-mapped IO. Though it does seem like Microsoft compiler inserts fence for volatile access, probably because too many programmers assume that.
Or because it makes sense Btw. I'm more concerned about reordering on CPU (compiler barrier != CPU barrier)

Well, the end results of CPU reordering and compiler reordering are the same - your memory accesses are reordered. When you issue a fence, it's the compiler's job to make sure the CPU doesn't reorder as well (usually by issuing a memory fence instruction).

It doesn't make sense for people using volatile as intended - to access memory-mapped IO devices. In those cases, a memory barrier introduces extra latency for no reason. If you need to stream a large buffer of data into a hardware address (through a volatile pointer), and don't have DMA, having each access trigger a memory barrier would be detrimental to performance.

The standard also doesn't say the compiler need to do that.

Microsoft probably doesn't care about that because their compiler can't target embedded devices anyways (deeply embedded, not phones etc, which are really more like PCs), but other compilers do.

Regarding atomics, what implementation of atomics are you using? For most implementations, if the atomic size you want is already atomic on the architecture in question, the compiler will optimize it to be just a plain old variable access. There should be zero overhead at run time.

Only on architectures where accesses to the size you want is not atomic, should there be an overhead.

I'm talking about RMW atomics that make more sense to me, like inc/dec and CAS. IIRC the overhead for increment was 8x on x86 (lock xadd), but I don't remember
(this is one of the reasons why atomic refcounting is slower than nonatomic, assuming you change the counter often).[/quote]
But why do you need that for a flag? Assignments aren't RMW, and should be atomic on most/all architectures.

mar · Post by **mar** » Mon Feb 02, 2015 7:56 pm

matthewlai wrote:Well, the end results of CPU reordering and compiler reordering are the same - your memory accesses are reordered. When you issue a fence, it's the compiler's job to make sure the CPU doesn't reorder as well (usually by issuing a memory fence instruction).

Compiler barriers tells the compiler to not reorder instruction it outputs (asm volatile("" ::: "memory") in gcc), which is different from memory barrier instructions that do something similar on CPU (EDIT: wrt. memory access - for nitpicks).
I would assume though that C++11 barriers are hw barriers (or nothing on some architectures), implying compiler barrier automatically.

But why do you need that for a flag? Assignments aren't RMW, and should be atomic on most/all architectures.

Huh? Where did I say I need atomic increment for a flag? I was talking about overhead of atomics in general.
What do you want from me anyway?

If you want to imply that my code is crap, do it straight. I can take it.

A new chess engine : m8 (comming not so soon)

Re: A new chess engine : m8 (comming not so soon)

Re: A new chess engine : m8 (comming not so soon)

Re: A new chess engine : m8 (comming not so soon)

Re: A new chess engine : m8 (comming not so soon)

Re: A new chess engine : m8 (comming not so soon)

Re: A new chess engine : m8 (comming not so soon)

Re: A new chess engine : m8 (comming not so soon)

Re: A new chess engine : m8 (comming not so soon)

Re: A new chess engine : m8 (comming not so soon)

Re: A new chess engine : m8 (comming not so soon)