debugging TT errors

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

debugging TT errors

Post by lucasart »

Hello everyone, and especially those who have actually written an engine and have probably experienced the problems I'm facing now.

Sometimes my engine plays a completely stupid move, just giving a piece for no reason. And when I try to understand what happenned, and start from the position, the results are correct.

Unfortunately this can only mean one thing... That the bug was due to some garbage in the TTable.

The first thing I realised is that in cases of an aborted search by timeout, I was still writing the result in the TTable, which is of course the source of desastrous results. So I fixed that, but it still sometimes plays stupid moves that cannot be reproduced with an empty TTable.

So how would you go about debugging such problems ?
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: debugging TT errors

Post by Dann Corbit »

lucasart wrote:Hello everyone, and especially those who have actually written an engine and have probably experienced the problems I'm facing now.

Sometimes my engine plays a completely stupid move, just giving a piece for no reason. And when I try to understand what happenned, and start from the position, the results are correct.

Unfortunately this can only mean one thing... That the bug was due to some garbage in the TTable.

The first thing I realised is that in cases of an aborted search by timeout, I was still writing the result in the TTable, which is of course the source of desastrous results. So I fixed that, but it still sometimes plays stupid moves that cannot be reproduced with an empty TTable.

So how would you go about debugging such problems ?
1. What does your transaction table entry look like? Do you store the non-hash address part of the hash as a checksum?
2. Do you have a routine to calculate the entire hash from the board? (I assume you use Zobrist hash like everyone else).
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: debugging TT errors

Post by lucasart »

Dann Corbit wrote: 1. What does your transaction table entry look like? Do you store the non-hash address part of the hash as a checksum?
2. Do you have a routine to calculate the entire hash from the board? (I assume you use Zobrist hash like everyone else).
1. Here is my hash entry type

Code: Select all

enum TTScoreType { ScoreExact, ScoreUbound, ScoreLbound };

struct TTEntry {
	U64 key;
	int score, depth;
	TTScoreType type;
	move_t move;
};
What do you mean by "non hash adress of the hash" ?

2. the hash is calculated dynamically, and stored on the stack. So when I undo a move I do an assert to compare. And that way I'm sure it's never going wrong. I use a very simple Zobrist implementation:

Code: Select all

namespace {	// local declaration

inline U64 rand64();

}	// local declaration

U64 zobrist[NB_COLOR][NB_PIECE][NB_SQUARE],
    zobrist_ep[NB_SQUARE],
    zobrist_castle[16];

void init_zobrist()
{
	static bool done = false;
	if (done)
		return;
	else
		done = true;

	for &#40;int c = White; c <= Black; c++)
		for &#40;int p = Pawn; p <= King; p++)
			for &#40;int sq = A1; sq <= H8; sq++)
				zobrist&#91;c&#93;&#91;p&#93;&#91;sq&#93; = rand64&#40;);

	for &#40;int crights = 0; crights < 16; zobrist_castle&#91;crights++&#93; = rand64&#40;));
	for &#40;int sq = A1; sq <= H8; zobrist_ep&#91;sq++&#93; = rand64&#40;));
&#125;

namespace &#123;	// local implementation

inline U64 rand64&#40;)
&#123;
	static U64 seed = 0ULL;
	return seed = seed * 6364136223846793005ULL + 1442695040888963407ULL;
&#125;

&#125;	// local implementation
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: debugging TT errors

Post by bob »

lucasart wrote:Hello everyone, and especially those who have actually written an engine and have probably experienced the problems I'm facing now.

Sometimes my engine plays a completely stupid move, just giving a piece for no reason. And when I try to understand what happenned, and start from the position, the results are correct.

Unfortunately this can only mean one thing... That the bug was due to some garbage in the TTable.

The first thing I realised is that in cases of an aborted search by timeout, I was still writing the result in the TTable, which is of course the source of desastrous results. So I fixed that, but it still sometimes plays stupid moves that cannot be reproduced with an empty TTable.

So how would you go about debugging such problems ?
Are you sure this is a TT issue? It is very likely that you are incorrectly "timing out" the search. When time runs out, you can _not_ back anything up inside the tree, _anywhere_ because the searches you have are only partially complete at the various points in the tree.

This is an easy thing to blow, from experience. Once the "time out" flag is set, if you have already found a best move at the root, it is safe to use even if the search at the root is not complete. But once you time out, you had better not back up a new best move to the root or you are going to commit suicide in many games by playing the move at the root that you were just searching, even though you did not complete it.

hash errors are much less likely to cause you to play a bad move...
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: debugging TT errors

Post by Desperado »

Hello Lucas,

well i dont agree with you that this can _only_ be a TT bug, but of course
it can be.
As example: few weeks back i changed my movedata type and got
a similar behavior. it was hard to find out, but finally my moveIsPseudo()
function did not work correctly anymore. And when started to fix the problem, i was not able to reproduce the error in any case !.

i just want to point out that those not _generated_ moves taken from TT or killerT arent the same when you start a search with empty tables or
you have like in games prefilled tables. At the end you cannot produce
the same tree like it was in the game.

- so my first suggestion is to make your searchtree reproducable.
(especially clear _all_ tables killer,history,TT...whatever)

- to verify that the problem is not caused by timeout, you can do
an engine match to fixed depth(5,7,9 or whatever) and wait for
the first obvious blunder. Because you wrote _sometimes_ , it wont
take many games to get 1 or 2 blunders, which you can try to fix.

- nobody knows better your engine than you do. Check out if
there are special types of positions where such blunders occure.
as example: king under pressure(check pins,legality code...).
as example: whenever the position eval is around a certain value
(search issue...)
Depending on the blunder frequency you have, you may think
of your code parts which occur in an equivalent frequency.
(i mean as further example: prevent a check with ep move happens
how often ? , not every fifth game of course...)

- the longer,harder the bug is to find, switch of the parts where you
are _100%_ (not 99% :wink: ! ) sure they are working correctly.
Or exactly the other way around, only enable the basic routines and
include step by step all of your functions until you find the source of
the desaster.

- dont exclude any possibility because you _think_ there is no bug,
and have in mind that it can be several bugs not only one.


i know this is the ugly part of programming, but i fear there is no way
around.

Good luck :)

Michael
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: debugging TT errors

Post by lucasart »

bob wrote: Are you sure this is a TT issue? It is very likely that you are incorrectly "timing out" the search. When time runs out, you can _not_ back anything up inside the tree, _anywhere_ because the searches you have are only partially complete at the various points in the tree.

This is an easy thing to blow, from experience. Once the "time out" flag is set, if you have already found a best move at the root, it is safe to use even if the search at the root is not complete. But once you time out, you had better not back up a new best move to the root or you are going to commit suicide in many games by playing the move at the root that you were just searching, even though you did not complete it.

hash errors are much less likely to cause you to play a bad move...
I use a global variable bool abort_search, and test it before returning a move at the root, or saving anything (in the TTable ot updating the History table). At every node of the search (including QS nodes) I call the following function:

Code: Select all

void poll&#40;)
// increment node count, and abort search if end_clock exceeded
&#123;
	if (!(++node_count & 0xfff&#41; && clock&#40;) >= end_clock&#41;
		abort_search = true;
&#125;
And then at the end of the move look, and before saving any TT or History info, I check the variable abort_search. And my root search also doesn't return a move that hasn't been searched entirely:

Code: Select all

move_t root_loop&#40;Board &B, int time_ms&#41;
&#123;
	move_t result = NoMove;

	start_clock = clock&#40;);
	end_clock = start_clock + time_ms * CLOCKS_PER_SEC / 1000;

	H.clear&#40;);
	node_count = 0;
	abort_search = false;

	SearchInfo si&#91;MAX_PLY&#93;;
	memset&#40;si, 0, sizeof&#40;si&#41;);	// fill with NoMove

	for &#40;int depth = 2; !abort_search && depth < MAX_PLY; depth++) &#123;
		
		std&#58;&#58;cout << "info depth " << depth << std&#58;&#58;endl;
		
		int best_score = root_search&#40;B, si, depth&#41;;
		
		if &#40;abort_search&#41;
			break;

		std&#58;&#58;cout << "info score cp " << best_score
			<< " nodes " << node_count
			<< " time " << &#40;clock&#40;) - start_clock&#41; * 1000 / CLOCKS_PER_SEC
			<< " pv " << si->best << std&#58;&#58;endl;
		result = si->best;
	&#125;

	std&#58;&#58;cout << "bestmove " << result << std&#58;&#58;endl;
	
	return result;
&#125;
I don't really know what has happenned since my last post (I've modified a feqw things herre and there), but the bug doesn't seem to happen anymore. What worries me is that I don;'t understand what fixed it :(

I'll also make all search reductions optional, so I can unplug them for easier debugging.

But I have another question:
Evertyone seems to be using refined pseudo random generators like Mersenne numbers for example. I'm just using a (2^64 periodic) linear congruence generator. Do you think it's a bad idea ? Intuitively I don't see why it would lead to different positions having the same key "easily", but maybe I'm missing something.

What do you think ?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: debugging TT errors

Post by bob »

lucasart wrote:
bob wrote: Are you sure this is a TT issue? It is very likely that you are incorrectly "timing out" the search. When time runs out, you can _not_ back anything up inside the tree, _anywhere_ because the searches you have are only partially complete at the various points in the tree.

This is an easy thing to blow, from experience. Once the "time out" flag is set, if you have already found a best move at the root, it is safe to use even if the search at the root is not complete. But once you time out, you had better not back up a new best move to the root or you are going to commit suicide in many games by playing the move at the root that you were just searching, even though you did not complete it.

hash errors are much less likely to cause you to play a bad move...
I use a global variable bool abort_search, and test it before returning a move at the root, or saving anything (in the TTable ot updating the History table). At every node of the search (including QS nodes) I call the following function:

Code: Select all

void poll&#40;)
// increment node count, and abort search if end_clock exceeded
&#123;
	if (!(++node_count & 0xfff&#41; && clock&#40;) >= end_clock&#41;
		abort_search = true;
&#125;
And then at the end of the move look, and before saving any TT or History info, I check the variable abort_search. And my root search also doesn't return a move that hasn't been searched entirely:

Code: Select all

move_t root_loop&#40;Board &B, int time_ms&#41;
&#123;
	move_t result = NoMove;

	start_clock = clock&#40;);
	end_clock = start_clock + time_ms * CLOCKS_PER_SEC / 1000;

	H.clear&#40;);
	node_count = 0;
	abort_search = false;

	SearchInfo si&#91;MAX_PLY&#93;;
	memset&#40;si, 0, sizeof&#40;si&#41;);	// fill with NoMove

	for &#40;int depth = 2; !abort_search && depth < MAX_PLY; depth++) &#123;
		
		std&#58;&#58;cout << "info depth " << depth << std&#58;&#58;endl;
		
		int best_score = root_search&#40;B, si, depth&#41;;
		
		if &#40;abort_search&#41;
			break;

		std&#58;&#58;cout << "info score cp " << best_score
			<< " nodes " << node_count
			<< " time " << &#40;clock&#40;) - start_clock&#41; * 1000 / CLOCKS_PER_SEC
			<< " pv " << si->best << std&#58;&#58;endl;
		result = si->best;
	&#125;

	std&#58;&#58;cout << "bestmove " << result << std&#58;&#58;endl;
	
	return result;
&#125;
I don't really know what has happenned since my last post (I've modified a feqw things herre and there), but the bug doesn't seem to happen anymore. What worries me is that I don;'t understand what fixed it :(

I'll also make all search reductions optional, so I can unplug them for easier debugging.

But I have another question:
Evertyone seems to be using refined pseudo random generators like Mersenne numbers for example. I'm just using a (2^64 periodic) linear congruence generator. Do you think it's a bad idea ? Intuitively I don't see why it would lead to different positions having the same key "easily", but maybe I'm missing something.

What do you think ?
I don't think it matters much at all, so long as you are getting real 64 bit numbers. I have seen people use double floats, which unfortunately have a pretty non-random upper 16 bits because of the exponent/sign in IEEE numbers. Another issue is "period" some sloppy multiplicative congruential generators have a majorly short period for some seeds. You do not want to have the same random number repeated, as that can certainly cause a serious issue...
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: debugging TT errors

Post by lucasart »

bob wrote: I don't think it matters much at all, so long as you are getting real 64 bit numbers. I have seen people use double floats, which unfortunately have a pretty non-random upper 16 bits because of the exponent/sign in IEEE numbers. Another issue is "period" some sloppy multiplicative congruential generators have a majorly short period for some seeds. You do not want to have the same random number repeated, as that can certainly cause a serious issue...
Ok, so I'll keep my generator: it is a *full* period one (2^64), and the numbers I use are those recommended by Donald Knuth (among many choices of full period), based on some empirical studies (statistic tests of autocorrelations, kolmogorov test of adequation to the uniform distribution etc.)
The only waekness I see of any full period LCG is that, for example, (rand() & 1) is a 2-periodic sequence. And more generally, rand() & (2^N-1) is 2^N periodic, for all N <= 64.
Although it could be an issue when doing statistics, it shouldn't be one for Zobrist purposes.

As for the TT bug. I FOUND IT ! I AM SUCH AN IDIOT...

I forgot to put a zobrist key for the turn of play... So the same position if white or black to play had the same key leading to disastrous results of course...
Last edited by lucasart on Fri Dec 17, 2010 1:38 pm, edited 1 time in total.
rbarreira
Posts: 900
Joined: Tue Apr 27, 2010 3:48 pm

Re: debugging TT errors

Post by rbarreira »

random.org has files with raw random data that are easily converted into an array of zobrist keys... I found it easier to use one of those than to worry about getting a good pseudorandom generator...