A little speed-up

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

A little speed-up

Post by stegemma »

Only for those who already doesn't know this kind of tricks (so this is not for Muller ;) ), sometime we can have a little speed-up in codes like this one:

Code: Select all


typedef int16_t tsq;

struct clsSimpleMove
{
	union
	{
		uint64_t data;
		struct
		{
			int16_t src, dst;
			tsq taken, alfavalue;
		};
	};
};

bool clsEngineSimple::PushMove(int src, int dst)
{
	tsq taken = board[dst];
	if ((taken & color) == bit_empty)
	{
#if 1
		// Nodes: 3251309062, Time : 59988 ms, Nodes / s : 54198420
		uint64_t data = (uint16_t)taken;
		data = &#40;data << 16&#41; | &#40;uint16_t&#41;dst;
		data = &#40;data << 16&#41; | &#40;uint16_t&#41;src;
		pLastMove->data = data;
#else
		// Nodes&#58; 3251309062, Time &#58; 62965 ms, Nodes / s &#58; 51635947
		pLastMove->src = src;
		pLastMove->dst = dst;
		pLastMove->taken = taken;
#endif
		++pLastMove;
	&#125;
	return taken != bit_empty;
&#125;
This is just a test of the new generator of Satana, where i've removed all the 64 bit stuffs and get back to a simpler "old-way" approach (it is a perft 7 from starting position, still wrong and not complete). The speed-up si not big, only about 5% on my PC but is better than nothing. Of course it is not very portable because it depends on data alignement and endianess.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: A little speed-up

Post by matthewlai »

stegemma wrote:Only for those who already doesn't know this kind of tricks (so this is not for Muller ;) ), sometime we can have a little speed-up in codes like this one:

Code: Select all


typedef int16_t tsq;

struct clsSimpleMove
&#123;
	union
	&#123;
		uint64_t data;
		struct
		&#123;
			int16_t src, dst;
			tsq taken, alfavalue;
		&#125;;
	&#125;;
&#125;;

bool clsEngineSimple&#58;&#58;PushMove&#40;int src, int dst&#41;
&#123;
	tsq taken = board&#91;dst&#93;;
	if (&#40;taken & color&#41; == bit_empty&#41;
	&#123;
#if 1
		// Nodes&#58; 3251309062, Time &#58; 59988 ms, Nodes / s &#58; 54198420
		uint64_t data = &#40;uint16_t&#41;taken;
		data = &#40;data << 16&#41; | &#40;uint16_t&#41;dst;
		data = &#40;data << 16&#41; | &#40;uint16_t&#41;src;
		pLastMove->data = data;
#else
		// Nodes&#58; 3251309062, Time &#58; 62965 ms, Nodes / s &#58; 51635947
		pLastMove->src = src;
		pLastMove->dst = dst;
		pLastMove->taken = taken;
#endif
		++pLastMove;
	&#125;
	return taken != bit_empty;
&#125;
This is just a test of the new generator of Satana, where i've removed all the 64 bit stuffs and get back to a simpler "old-way" approach (it is a perft 7 from starting position, still wrong and not complete). The speed-up si not big, only about 5% on my PC but is better than nothing. Of course it is not very portable because it depends on data alignement and endianess.
What compiler is that with and what optimization level?

I have found that with GCC at least, it doesn't introduce temporaries sometimes (that can be stored in register) when accessing the same element multiple times. This can be something along the same line.

For example, if you try to add all elements into the first of the array, and you try to do it without a temporary, it would look like this -

Code: Select all

for &#40;i = 1; i < N; ++i&#41;
&#123;
     a&#91;0&#93; += a&#91;i&#93;;
&#125;
It will generate code to do a memory store every time, instead of storing the sum in a register, and just do a memory store at the end. Of course, doing a memory store every time makes it much slower.

This behaviour is correct (and it must be done this way) if a is declared volatile, but it does this even when a is not.

This is true with GCC 4.5 or 4.6 (last time I checked), at -O3.

So I always try to minimize accessing memory through pointer (which includes array access), and always work with explicit temporaries when possible. If it's faster the other way, it's much easier for compiler to optimize it to take out the temporary, and I have no doubt GCC can do it. It just doesn't seem to want to introduce new temporaries.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: A little speed-up

Post by AlvaroBegue »

Code: Select all

      uint64_t data = &#40;uint16_t&#41;taken; 
      data = &#40;data << 16&#41; | &#40;uint16_t&#41;dst; 
      data = &#40;data << 16&#41; | &#40;uint16_t&#41;src; 
I would have thought that this would be better (but needs to be tested, of course):

Code: Select all

    uint64_t data = &#40;taken << 32&#41; | &#40;dst << 16&#41; | src;
The reason is that modifying `data' multiple times introduces data dependencies that might remove opportunities for parallel computation.
brtzsnr
Posts: 433
Joined: Fri Jan 16, 2015 4:02 pm

Re: A little speed-up

Post by brtzsnr »

stegemma wrote:

Code: Select all


typedef int16_t tsq;

struct clsSimpleMove
&#123;
	union
	&#123;
		uint64_t data;
		struct
		&#123;
			int16_t src, dst;
			tsq taken, alfavalue;
		&#125;;
	&#125;;
&#125;;

It's undefined behavior, don't do it. if you write to a union field and read from another field it is not guaranteed to get the same results. See for example http://stackoverflow.com/a/1812359/88622 . Moreover, the compiler is allowed to implement union using #define union struct.
elcabesa
Posts: 855
Joined: Sun May 23, 2010 1:32 pm

Re: A little speed-up

Post by elcabesa »

it's probably better to use c birfield. it isn't undefined behaviour to create a 16 bit varbiles this way

Code: Select all



   union clsSimpleMove
   &#123;
      uint64_t data;
      struct
      &#123;
         unsigned int src&#58;16;
         unsigned int dst&#58;16;
         unsigned int taken&#58;16;
         unsigned int alfavalue&#58;16;

      &#125;;
   &#125;;


User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: A little speed-up

Post by stegemma »

AlvaroBegue wrote:

Code: Select all

      uint64_t data = &#40;uint16_t&#41;taken; 
      data = &#40;data << 16&#41; | &#40;uint16_t&#41;dst; 
      data = &#40;data << 16&#41; | &#40;uint16_t&#41;src; 
I would have thought that this would be better (but needs to be tested, of course):

Code: Select all

    uint64_t data = &#40;taken << 32&#41; | &#40;dst << 16&#41; | src;
The reason is that modifying `data' multiple times introduces data dependencies that might remove opportunities for parallel computation.
I've tried this before the post and it was slower. In fact, i've removed this way of doing because i've get different results in perft count. I suppose because you need type cast to uint64_t before the shift, to avoid loosing part of the bits.
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: A little speed-up

Post by stegemma »

elcabesa wrote:it's probably better to use c birfield. it isn't undefined behaviour to create a 16 bit varbiles this way

Code: Select all



   union clsSimpleMove
   &#123;
      uint64_t data;
      struct
      &#123;
         unsigned int src&#58;16;
         unsigned int dst&#58;16;
         unsigned int taken&#58;16;
         unsigned int alfavalue&#58;16;

      &#125;;
   &#125;;


On my pc this way it is slower:

Nodes: 3251309062, Time: 61901 ms, Nodes/s: 52523489
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: A little speed-up

Post by stegemma »

I'm using VC++ 2013 community edition, with standard optimization for release 64 bit exe (/O2). I've seen only now that there are another flag: /Ot that could optimize better but i've never tried with this.

I've seen me too that using explicit local variable could help the compiler. This is one of the cases when assembly could help a little; in Intel CPU you could assign the low parts of the register and avoid the shift... but that's not portable at all.

[this was an answer to Matthew]
Last edited by stegemma on Sat Mar 28, 2015 11:10 am, edited 1 time in total.
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: A little speed-up

Post by stegemma »

brtzsnr wrote:
stegemma wrote:

Code: Select all


typedef int16_t tsq;

struct clsSimpleMove
&#123;
	union
	&#123;
		uint64_t data;
		struct
		&#123;
			int16_t src, dst;
			tsq taken, alfavalue;
		&#125;;
	&#125;;
&#125;;

It's undefined behavior, don't do it. if you write to a union field and read from another field it is not guaranteed to get the same results. See for example http://stackoverflow.com/a/1812359/88622 . Moreover, the compiler is allowed to implement union using #define union struct.
Yes, you're right but you can use a #define to identify the compiler, and use this trick only when is working.

Of course in the "production" version of the engine i will keep the simplest form, without the trick... but this post is just an hint to study this kind of optimizations.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: A little speed-up

Post by AlvaroBegue »

stegemma wrote:
AlvaroBegue wrote:

Code: Select all

      uint64_t data = &#40;uint16_t&#41;taken; 
      data = &#40;data << 16&#41; | &#40;uint16_t&#41;dst; 
      data = &#40;data << 16&#41; | &#40;uint16_t&#41;src; 
I would have thought that this would be better (but needs to be tested, of course):

Code: Select all

    uint64_t data = &#40;taken << 32&#41; | &#40;dst << 16&#41; | src;
The reason is that modifying `data' multiple times introduces data dependencies that might remove opportunities for parallel computation.
I've tried this before the post and it was slower. In fact, i've removed this way of doing because i've get different results in perft count. I suppose because you need type cast to uint64_t before the shift, to avoid loosing part of the bits.
Yes, I mean doing that with 64-bit types. My code was meant to be an explanation, not actual code.