As I was preparing to implement hash tables in Abbess, I ran across a bit a code that looked like a no-brainer fix, but it had a terrible impact on performance. After I loop through the pseudo-legal moves, if none of them were legal, then I want to return a checkmate or stalemate score. Since I use fail-hard alpha-beta, I thought the below code was wrong to return scores that were potentially above beta:
//------- E: IF NO MOVES WERE LEGAL, IT'S EITHER STALEMATE OR CHECKMATE --------
if(stale) {
// No move has been found legal
// if we are not in check, this is stalemate
if(inCheck==0) {
space->ply--;
return Max(alpha,0);
}
else {
// we are in a checkmate position
val=-MATE_SCORE+space->ply;
space->ply--;
if(-MATE_SCORE+space->ply>alpha)
return -MATE_SCORE+space->ply;
else
return alpha;
}
}
//------- E: IF NO MOVES WERE LEGAL, IT'S EITHER STALEMATE OR CHECKMATE --------
if(stale) {
// No move has been found legal
// if we are not in check, this is stalemate
if(inCheck==0) {
space->ply--;
if(0>=beta)
return beta;
else if(0>alpha)
return 0;
else
return alpha;
}
else {
// we are in a checkmate position
val=-MATE_SCORE+space->ply;
space->ply--;
if(val>=beta)
return beta;
else if(val>alpha)
return val;
else
return alpha;
}
}
But in self-play, this performs significantly worse. Am I missing something? I don’t understand why this "fix" would be so detrimental.
I think that this tournament is a little too short to asses a so small elo difference.
could you try doing a tournament between A) & d) with more than 1000 games. 10000 should be a good number of games
how long is each game? igf the game is fast enough you can do more than 1000 games in a day. you can achieve more than 1000 games a days with games shorther than a minute
elcabesa wrote:I think that this tournament is a little too short to asses a so small elo difference.
could you try doing a tournament between A) & d) with more than 1000 games. 10000 should be a good number of games
how long is each game? igf the game is fast enough you can do more than 1000 games in a day. you can achieve more than 1000 games a days with games shorther than a minute
I play 50/1' for most of my testing, and only test overnight, so this is about the limit of what I can do in a day. I haven't been happy with cutechess, so that's about as fast as I can go.
With a 46 elo difference with and without the change and only 28 elo bars, I think it's pretty clear with even this few games.
I will try running the two of them through cutechess tonight at a faster TC, though.
elcabesa wrote:I think that this tournament is a little too short to asses a so small elo difference.
could you try doing a tournament between A) & d) with more than 1000 games. 10000 should be a good number of games :)
how long is each game? igf the game is fast enough you can do more than 1000 games in a day. you can achieve more than 1000 games a days with games shorther than a minute
I play 50/1' for most of my testing, and only test overnight, so this is about the limit of what I can do in a day. I haven't been happy with cutechess, so that's about as fast as I can go.
With a 46 elo difference with and without the change and only 28 elo bars, I think it's pretty clear with even this few games.
I will try running the two of them through cutechess tonight at a faster TC, though.
If you are running on Windows you should use an older version of CuteChess the latest one is unstable there.
Robert Pope wrote:With a 46 elo difference with and without the change and only 28 elo bars, I think it's pretty clear with even this few games
If I understand correctly the elo story, then you have 95% probability that the real elo is within the 2 limits. So if the interval intersection is not empty, you can't say anything about the relative strenght of the 2 engine versions.
JVMerlino wrote:He said he uses fail-hard, so he always returns a score within the [alpha, beta] bounds.
Yes, I missed that, I always use fail-soft.
Still, if you're at a leaf node and return a score outside the alpha-beta bounds, wouldn't that cause a cutoff in the parent node, and you're back to fail-hard anyway? It seems to me that if you fail hard at any point the result would be the same as if you failed hard everywhere.
If that reasoning is correct, what's the point of all those tests to keep the score within the bounds?
I'm probably not thinking straight, that Famous Grouse went down way too easily.
Without knowing more specifics I'd guess you get more cutoffs farther up the tree when you return the fail soft scores in branches containing checkmates and stalemates. And more cutoffs possibly means fewer re-searches and more time spent examining other less futile lines. But that depends on your implementation. Are you using PVS, MTD(f), or something else? If PVS do you use the return value to decide whether to do a re-search on an a fail high? Or do you always re-search fail highs no matter what the return value was? Etc...
There are many factors that come into play here. Doing fail-hard vs fail-soft isn't just about what values you return from a search, but also how you use the values returned from a search.
Also, is this still pre-transposition table? Or did this test run use transposition table?
So, maybe it wasn't so clear after all. 1000 games at a faster time control gave +8 and -8 elo, with 10 elo error bars. So, within the margin, but this test again had a higher score to the original code.