debugging TT errors

lucasart · Post by **lucasart** » Tue Dec 14, 2010 9:16 pm

Hello everyone, and especially those who have actually written an engine and have probably experienced the problems I'm facing now.

Sometimes my engine plays a completely stupid move, just giving a piece for no reason. And when I try to understand what happenned, and start from the position, the results are correct.

Unfortunately this can only mean one thing... That the bug was due to some garbage in the TTable.

The first thing I realised is that in cases of an aborted search by timeout, I was still writing the result in the TTable, which is of course the source of desastrous results. So I fixed that, but it still sometimes plays stupid moves that cannot be reproduced with an empty TTable.

So how would you go about debugging such problems ?

Dann Corbit · Post by **Dann Corbit** » Tue Dec 14, 2010 9:20 pm

lucasart wrote:Hello everyone, and especially those who have actually written an engine and have probably experienced the problems I'm facing now.

Sometimes my engine plays a completely stupid move, just giving a piece for no reason. And when I try to understand what happenned, and start from the position, the results are correct.

Unfortunately this can only mean one thing... That the bug was due to some garbage in the TTable.

The first thing I realised is that in cases of an aborted search by timeout, I was still writing the result in the TTable, which is of course the source of desastrous results. So I fixed that, but it still sometimes plays stupid moves that cannot be reproduced with an empty TTable.

So how would you go about debugging such problems ?

1. What does your transaction table entry look like? Do you store the non-hash address part of the hash as a checksum?
2. Do you have a routine to calculate the entire hash from the board? (I assume you use Zobrist hash like everyone else).

lucasart · Post by **lucasart** » Tue Dec 14, 2010 9:34 pm

Dann Corbit wrote: 1. What does your transaction table entry look like? Do you store the non-hash address part of the hash as a checksum?
2. Do you have a routine to calculate the entire hash from the board? (I assume you use Zobrist hash like everyone else).

1. Here is my hash entry type

Code: Select all

enum TTScoreType &#123; ScoreExact, ScoreUbound, ScoreLbound &#125;;

struct TTEntry &#123;
	U64 key;
	int score, depth;
	TTScoreType type;
	move_t move;
&#125;;

What do you mean by "non hash adress of the hash" ?

2. the hash is calculated dynamically, and stored on the stack. So when I undo a move I do an assert to compare. And that way I'm sure it's never going wrong. I use a very simple Zobrist implementation:

Code: Select all

namespace &#123;	// local declaration

inline U64 rand64&#40;);

&#125;	// local declaration

U64 zobrist&#91;NB_COLOR&#93;&#91;NB_PIECE&#93;&#91;NB_SQUARE&#93;,
    zobrist_ep&#91;NB_SQUARE&#93;,
    zobrist_castle&#91;16&#93;;

void init_zobrist&#40;)
&#123;
	static bool done = false;
	if &#40;done&#41;
		return;
	else
		done = true;

	for &#40;int c = White; c <= Black; c++)
		for &#40;int p = Pawn; p <= King; p++)
			for &#40;int sq = A1; sq <= H8; sq++)
				zobrist&#91;c&#93;&#91;p&#93;&#91;sq&#93; = rand64&#40;);

	for &#40;int crights = 0; crights < 16; zobrist_castle&#91;crights++&#93; = rand64&#40;));
	for &#40;int sq = A1; sq <= H8; zobrist_ep&#91;sq++&#93; = rand64&#40;));
&#125;

namespace &#123;	// local implementation

inline U64 rand64&#40;)
&#123;
	static U64 seed = 0ULL;
	return seed = seed * 6364136223846793005ULL + 1442695040888963407ULL;
&#125;

&#125;	// local implementation

bob · Post by **bob** » Wed Dec 15, 2010 4:31 am

lucasart wrote:Hello everyone, and especially those who have actually written an engine and have probably experienced the problems I'm facing now.

Sometimes my engine plays a completely stupid move, just giving a piece for no reason. And when I try to understand what happenned, and start from the position, the results are correct.

Unfortunately this can only mean one thing... That the bug was due to some garbage in the TTable.

The first thing I realised is that in cases of an aborted search by timeout, I was still writing the result in the TTable, which is of course the source of desastrous results. So I fixed that, but it still sometimes plays stupid moves that cannot be reproduced with an empty TTable.

So how would you go about debugging such problems ?

Are you sure this is a TT issue? It is very likely that you are incorrectly "timing out" the search. When time runs out, you can _not_ back anything up inside the tree, _anywhere_ because the searches you have are only partially complete at the various points in the tree.

This is an easy thing to blow, from experience. Once the "time out" flag is set, if you have already found a best move at the root, it is safe to use even if the search at the root is not complete. But once you time out, you had better not back up a new best move to the root or you are going to commit suicide in many games by playing the move at the root that you were just searching, even though you did not complete it.

hash errors are much less likely to cause you to play a bad move...

Desperado · Post by **Desperado** » Wed Dec 15, 2010 10:15 am

Hello Lucas,

well i dont agree with you that this can _only_ be a TT bug, but of course
it can be.
As example: few weeks back i changed my movedata type and got
a similar behavior. it was hard to find out, but finally my moveIsPseudo()
function did not work correctly anymore. And when started to fix the problem, i was not able to reproduce the error in any case !.

i just want to point out that those not _generated_ moves taken from TT or killerT arent the same when you start a search with empty tables or
you have like in games prefilled tables. At the end you cannot produce
the same tree like it was in the game.

- so my first suggestion is to make your searchtree reproducable.
(especially clear _all_ tables killer,history,TT...whatever)

- to verify that the problem is not caused by timeout, you can do
an engine match to fixed depth(5,7,9 or whatever) and wait for
the first obvious blunder. Because you wrote _sometimes_ , it wont
take many games to get 1 or 2 blunders, which you can try to fix.

- nobody knows better your engine than you do. Check out if
there are special types of positions where such blunders occure.
as example: king under pressure(check pins,legality code...).
as example: whenever the position eval is around a certain value
(search issue...)
Depending on the blunder frequency you have, you may think
of your code parts which occur in an equivalent frequency.
(i mean as further example: prevent a check with ep move happens
how often ? , not every fifth game of course...)

- the longer,harder the bug is to find, switch of the parts where you
are _100%_ (not 99%

! ) sure they are working correctly.
Or exactly the other way around, only enable the basic routines and
include step by step all of your functions until you find the source of
the desaster.

- dont exclude any possibility because you _think_ there is no bug,
and have in mind that it can be several bugs not only one.

i know this is the ugly part of programming, but i fear there is no way
around.

Good luck

Michael

lucasart · Post by **lucasart** » Thu Dec 16, 2010 7:43 pm

bob wrote: Are you sure this is a TT issue? It is very likely that you are incorrectly "timing out" the search. When time runs out, you can _not_ back anything up inside the tree, _anywhere_ because the searches you have are only partially complete at the various points in the tree.

This is an easy thing to blow, from experience. Once the "time out" flag is set, if you have already found a best move at the root, it is safe to use even if the search at the root is not complete. But once you time out, you had better not back up a new best move to the root or you are going to commit suicide in many games by playing the move at the root that you were just searching, even though you did not complete it.

hash errors are much less likely to cause you to play a bad move...

I use a global variable bool abort_search, and test it before returning a move at the root, or saving anything (in the TTable ot updating the History table). At every node of the search (including QS nodes) I call the following function:

Code: Select all

void poll&#40;)
// increment node count, and abort search if end_clock exceeded
&#123;
	if (!(++node_count & 0xfff&#41; && clock&#40;) >= end_clock&#41;
		abort_search = true;
&#125;

And then at the end of the move look, and before saving any TT or History info, I check the variable abort_search. And my root search also doesn't return a move that hasn't been searched entirely:

Code: Select all

move_t root_loop&#40;Board &B, int time_ms&#41;
&#123;
	move_t result = NoMove;

	start_clock = clock&#40;);
	end_clock = start_clock + time_ms * CLOCKS_PER_SEC / 1000;

	H.clear&#40;);
	node_count = 0;
	abort_search = false;

	SearchInfo si&#91;MAX_PLY&#93;;
	memset&#40;si, 0, sizeof&#40;si&#41;);	// fill with NoMove

	for &#40;int depth = 2; !abort_search && depth < MAX_PLY; depth++) &#123;
		
		std&#58;&#58;cout << "info depth " << depth << std&#58;&#58;endl;
		
		int best_score = root_search&#40;B, si, depth&#41;;
		
		if &#40;abort_search&#41;
			break;

		std&#58;&#58;cout << "info score cp " << best_score
			<< " nodes " << node_count
			<< " time " << &#40;clock&#40;) - start_clock&#41; * 1000 / CLOCKS_PER_SEC
			<< " pv " << si->best << std&#58;&#58;endl;
		result = si->best;
	&#125;

	std&#58;&#58;cout << "bestmove " << result << std&#58;&#58;endl;
	
	return result;
&#125;

I don't really know what has happenned since my last post (I've modified a feqw things herre and there), but the bug doesn't seem to happen anymore. What worries me is that I don;'t understand what fixed it

I'll also make all search reductions optional, so I can unplug them for easier debugging.

But I have another question:
Evertyone seems to be using refined pseudo random generators like Mersenne numbers for example. I'm just using a (2^64 periodic) linear congruence generator. Do you think it's a bad idea ? Intuitively I don't see why it would lead to different positions having the same key "easily", but maybe I'm missing something.

What do you think ?

bob · Post by **bob** » Thu Dec 16, 2010 11:24 pm

lucasart wrote:
bob wrote: Are you sure this is a TT issue? It is very likely that you are incorrectly "timing out" the search. When time runs out, you can _not_ back anything up inside the tree, _anywhere_ because the searches you have are only partially complete at the various points in the tree.

This is an easy thing to blow, from experience. Once the "time out" flag is set, if you have already found a best move at the root, it is safe to use even if the search at the root is not complete. But once you time out, you had better not back up a new best move to the root or you are going to commit suicide in many games by playing the move at the root that you were just searching, even though you did not complete it.

hash errors are much less likely to cause you to play a bad move...
I use a global variable bool abort_search, and test it before returning a move at the root, or saving anything (in the TTable ot updating the History table). At every node of the search (including QS nodes) I call the following function:
Code: Select all
void poll&#40;)
// increment node count, and abort search if end_clock exceeded
&#123;
	if (!(++node_count & 0xfff&#41; && clock&#40;) >= end_clock&#41;
		abort_search = true;
&#125;
And then at the end of the move look, and before saving any TT or History info, I check the variable abort_search. And my root search also doesn't return a move that hasn't been searched entirely:
Code: Select all
move_t root_loop&#40;Board &B, int time_ms&#41;
&#123;
	move_t result = NoMove;

	start_clock = clock&#40;);
	end_clock = start_clock + time_ms * CLOCKS_PER_SEC / 1000;

	H.clear&#40;);
	node_count = 0;
	abort_search = false;

	SearchInfo si&#91;MAX_PLY&#93;;
	memset&#40;si, 0, sizeof&#40;si&#41;);	// fill with NoMove

	for &#40;int depth = 2; !abort_search && depth < MAX_PLY; depth++) &#123;
		
		std&#58;&#58;cout << "info depth " << depth << std&#58;&#58;endl;
		
		int best_score = root_search&#40;B, si, depth&#41;;
		
		if &#40;abort_search&#41;
			break;

		std&#58;&#58;cout << "info score cp " << best_score
			<< " nodes " << node_count
			<< " time " << &#40;clock&#40;) - start_clock&#41; * 1000 / CLOCKS_PER_SEC
			<< " pv " << si->best << std&#58;&#58;endl;
		result = si->best;
	&#125;

	std&#58;&#58;cout << "bestmove " << result << std&#58;&#58;endl;
	
	return result;
&#125;
I don't really know what has happenned since my last post (I've modified a feqw things herre and there), but the bug doesn't seem to happen anymore. What worries me is that I don;'t understand what fixed it

I'll also make all search reductions optional, so I can unplug them for easier debugging.

But I have another question:
Evertyone seems to be using refined pseudo random generators like Mersenne numbers for example. I'm just using a (2^64 periodic) linear congruence generator. Do you think it's a bad idea ? Intuitively I don't see why it would lead to different positions having the same key "easily", but maybe I'm missing something.

What do you think ?

I don't think it matters much at all, so long as you are getting real 64 bit numbers. I have seen people use double floats, which unfortunately have a pretty non-random upper 16 bits because of the exponent/sign in IEEE numbers. Another issue is "period" some sloppy multiplicative congruential generators have a majorly short period for some seeds. You do not want to have the same random number repeated, as that can certainly cause a serious issue...

lucasart · Post by **lucasart** » Fri Dec 17, 2010 1:25 pm

bob wrote: I don't think it matters much at all, so long as you are getting real 64 bit numbers. I have seen people use double floats, which unfortunately have a pretty non-random upper 16 bits because of the exponent/sign in IEEE numbers. Another issue is "period" some sloppy multiplicative congruential generators have a majorly short period for some seeds. You do not want to have the same random number repeated, as that can certainly cause a serious issue...

Ok, so I'll keep my generator: it is a *full* period one (2^64), and the numbers I use are those recommended by Donald Knuth (among many choices of full period), based on some empirical studies (statistic tests of autocorrelations, kolmogorov test of adequation to the uniform distribution etc.)
The only waekness I see of any full period LCG is that, for example, (rand() & 1) is a 2-periodic sequence. And more generally, rand() & (2^N-1) is 2^N periodic, for all N <= 64.
Although it could be an issue when doing statistics, it shouldn't be one for Zobrist purposes.

As for the TT bug. I FOUND IT ! I AM SUCH AN IDIOT...

I forgot to put a zobrist key for the turn of play... So the same position if white or black to play had the same key leading to disastrous results of course...

rbarreira · Post by **rbarreira** » Fri Dec 17, 2010 1:35 pm

random.org has files with raw random data that are easily converted into an array of zobrist keys... I found it easier to use one of those than to worry about getting a good pseudorandom generator...

debugging TT errors

debugging TT errors

Re: debugging TT errors

Re: debugging TT errors

Re: debugging TT errors

Re: debugging TT errors

Re: debugging TT errors

Re: debugging TT errors

Re: debugging TT errors

Re: debugging TT errors