Hyperbola Quiesscene: hardly any improvement

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Karlo Bala
Posts: 373
Joined: Wed Mar 22, 2006 10:17 am
Location: Novi Sad, Serbia
Full name: Karlo Balla

Re: Hyperbola Quiesscene: hardly any improvement

Post by Karlo Bala »

Gerd Isenberg wrote:
trojanfoe wrote:I have undone the 'save attack masks for use in the move generator' and the performance has gone up (by 50% on 32-bit Windows, and 25% on 64-bit Linux - I am not able to test 64-bit Windows at the moment). However I have conditionally compiled Kogge-Stone and HQ and using KS I get slightly better performance? I think I must have other problems I need to address....

It has surprised me however how 'caching' the attack masks causes a performance drop.
You do Kogge-Stone piece wise and four (bishop, rook) or eight (queen) ray-directions per piece in 32-bit mode and it is faster than HQ!? That is amazing - I can't believe ;-)

Possibly the kogge-stone stuff, which however needs a more stores/loads to the processor stack in 32-bit mode due to lack of registers, somehow hides the latency of other memory reads/writes, for instance tt probes, where in HQ that "latency hiding" is not possible due to additional memory references (even if small tables).

Seems your program is rather handicapped by memory latency and stalls - do you have huge position objects on a ply stack and perform copy-make rather than make-unmake?
I wonder how good is this kind of attack generation?

Code: Select all


Single:

BB	UpAttacks(SQ sq, BB block)
{
	BB upm =	upmask[sq];
	block	&=	upm;
	block	|=	block<<8;	
	block	|=	block<<16;	
	block	|=	block<<32;
	return	&#40;block^upm&#41;<<8;
&#125;

Parallel&#58;

BB UpAttacks&#40;BB gen, BB block&#41;
&#123;
	gen		|=	gen<<8;
	gen		|=	gen<<16;
	gen		|=	gen<<32;

	block	&=	gen;
	block	|=	block<<8;	
	block	|=	block<<16;	
	block	|=	block<<32;

	return	&#40;block^gen&#41;<<8;
&#125;
P.S.
I'm not sure that it is correct :oops:
Best Regards,
Karlo Balla Jr.
trojanfoe

Re: Hyperbola Quiesscene: hardly any improvement

Post by trojanfoe »

Karlo Bala wrote:I wonder how good is this kind of attack generation?

Code: Select all


Single&#58;

BB	UpAttacks&#40;SQ sq, BB block&#41;
&#123;
	BB upm =	upmask&#91;sq&#93;;
	block	&=	upm;
	block	|=	block<<8;	
	block	|=	block<<16;	
	block	|=	block<<32;
	return	&#40;block^upm&#41;<<8;
&#125;

Parallel&#58;

BB UpAttacks&#40;BB gen, BB block&#41;
&#123;
	gen		|=	gen<<8;
	gen		|=	gen<<16;
	gen		|=	gen<<32;

	block	&=	gen;
	block	|=	block<<8;	
	block	|=	block<<16;	
	block	|=	block<<32;

	return	&#40;block^gen&#41;<<8;
&#125;
The parallel function looks very similar to the 'occluded fill up' function used in Kogge-Stone.
Aleks Peshkov
Posts: 892
Joined: Sun Nov 19, 2006 9:16 pm
Location: Russia

Re: Hyperbola Quiesscene: hardly any improvement

Post by Aleks Peshkov »

Tord Romstad wrote:In this minimalistic benchmark, magic bitboards are much faster, even in 64-bit mode. The magic mulitiplication program took 8.048 seconds, while the hyperbolic quintessence program took 14.448 seconds.
Magic will be significantly faster even for perft tests, but it is not necessary same for real performance in games.

Magic rooks tables fit nicely in 2-level CPU cache, but magic pops out all other stuff that chess program would like to have here (PST, Zobrist keys, recent TT-entries).
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Hyperbola Quiesscene: hardly any improvement

Post by Gerd Isenberg »

Karlo Bala wrote: I wonder how good is this kind of attack generation?

Code: Select all


Single&#58;

BB	UpAttacks&#40;SQ sq, BB block&#41;
&#123;
	BB upm =	upmask&#91;sq&#93;;
	block	&=	upm;
	block	|=	block<<8;	
	block	|=	block<<16;	
	block	|=	block<<32;
	return	&#40;block^upm&#41;<<8;
&#125;

Parallel&#58;

BB UpAttacks&#40;BB gen, BB block&#41;
&#123;
	gen		|=	gen<<8;
	gen		|=	gen<<16;
	gen		|=	gen<<32;

	block	&=	gen;
	block	|=	block<<8;	
	block	|=	block<<16;	
	block	|=	block<<32;

	return	&#40;block^gen&#41;<<8;
&#125;
P.S.
I'm not sure that it is correct :oops:
Interesting idea to fill up the blockers parallel prefix wise, looks like a mixture of Classical Approach and Kogge-Stone. If your code is correct, Kogge-Stone seems slightly superiour, if I count instructions for the parallel version.

Code: Select all

BB UpAttacks&#40;BB gen, BB block&#41;
&#123;
   gen      |=   gen<<8;    //  2
   gen      |=   gen<<16;   //  2
   gen      |=   gen<<32;   //  2

   block   &=   gen;        //  1
   block   |=   block<<8;   //  2
   block   |=   block<<16;  //  2
   block   |=   block<<32;  //  2
   return   &#40;block^gen&#41;<<8; //  2
&#125;                           // 15

BB UpAttacks&#40;BB gen, BB block&#41;
&#123;
   gen   |= block & &#40;gen <<  8&#41;;  //  3
   block &=       &#40;block <<  8&#41;;  //  2
   gen   |= block & &#40;gen << 16&#41;;  //  3
   block &=       &#40;block << 16&#41;;  //  2
   gen   |= block & &#40;gen << 32&#41;;  //  3
   return gen << 8;               //  1
&#125;                                 // 14
For the Single version - one has to consider that other approaches covers two (HQ) or even four (magics) ray-directions in one run...

Gerd
tvrzsky
Posts: 128
Joined: Sat Sep 23, 2006 7:10 pm
Location: Prague

Re: Hyperbola Quiesscene: hardly any improvement

Post by tvrzsky »

Tord Romstad wrote:I just hacked up two simple benchmark programs, one for magic bitboards and one for hyperbolic quintessence. Both programs just generate millions of random occupied squares bitboards, loop through all 64 squares for each of the bitboards, and generates queen attacks from all of them. These queen attacks are added together and printed to the standard output when the program exits. It also displays the total computation time, in milliseconds.

In this minimalistic benchmark, magic bitboards are much faster, even in 64-bit mode. The magic mulitiplication program took 8.048 seconds, while the hyperbolic quintessence program took 14.448 seconds.

The source code for the two programs can be downloaded here and here. For use on Windows systems, you probably have to replace the get_system_time() function with whatever function is used to get the time in milliseconds in Windows.

As you can see from the source code, magic bitboards are not really that complicated. The whole source code for magic bitboard attack generation, including initialization code and the huge constant tables, is less than 150 lines of code, and does not require any inline assembly language. As a comparison, the HQ attack generation code (again, including initialization) is about 100 lines of code, without any big tables, but with an ugly little inline assembly language function.

Tord
Hi, is here anybody who tried Tord's benchmark programs in 32 bits? I did and my results were so much slower (for magic bitboards 50 s with binary compiled by MS Visual C++ 2005 Express Edition and 62 s with binary created by gcc 4.02, both with best optimizations to my knowledge) that I wonder what went wrong. My CPU is Intel Core2Duo T5500@1.66GHz.