TT aging

crybotmark · Post by **crybotmark** » Sun Jul 09, 2017 1:56 pm

Hi all,

I've been trying to implement aging within my transposition table, but with no real profit. I used to clear my hashtable before every new search from a different position (i.e. go command in UCI), so I tried to leave the hash untouched between positions and to add a flag inside every TT entry representing whether the entry comes from the current search or from a previous one. It does not occupy any more space actually, since 6 bits from the bound parameter were unused.
The old replacement scheme was simple, preferring the lowest depth entry within a bucket of size 4. If the depth of the new search was >= than that of the entry to be replaced, then the overwrite occurs.
The new replacement scheme, now considering aging, always replaces the first entry of the bucket that comes from a previous search, and if no one can be found, the one with the lowest depth is replaced.
Now, the results are really disappointing, since I got more than 80 elo difference between the two schemes, obviously not in favor of aging. Also, since I collect the PV from the main hash table, I noticed that whether the next position differs from the expected PV of the last search, the new PV gets often truncated.
Should I look for a bug or is this a behavior that could happen in some engines?

hgm · Post by **hgm** » Sun Jul 09, 2017 2:41 pm

crybotmark wrote:The old replacement scheme was simple, preferring the lowest depth entry within a bucket of size 4. If the depth of the new search was >= than that of the entry to be replaced, then the overwrite occurs.

This is very bad, as it would mean you don't always store the latest result. You should always store the latest result, even if it is a d=1 result overwriting a d=30 result. 'Recent' is far more important than 'deep'.

Now, the results are really disappointing, since I got more than 80 elo difference between the two schemes, obviously not in favor of aging.

That is very suspect. It suggests that hash entries made in the previous search damage the current one, rather than help it. Which again suggests that they contain wrong information rather than correct information. This could happen, for instance, when there is an error in updating the hash key when you make a move at game level, so that the position you probe and use the score of is an entirely different one than what you should probe.

Perhaps you should advance in small steps, rather than trying too many changes at once. E.g. rather than erasing the entire TT before every move, just erase part of it, not altering the replacement scheme. Keeping 50% of the TT for the next move should never make the engine play worse than running it with half the hash-table size. If the typical search tree would fit entirely in half the TT (which usually is the case, with current memory sizes), it should not even be worse than running with the same, or with infinite hash size.

Also, since I collect the PV from the main hash table, I noticed that whether the next position differs from the expected PV of the last search, the new PV gets often truncated.
Should I look for a bug or is this a behavior that could happen in some engines?

With TT PVs can be truncated. Whether you clear them or not.

crybotmark · Post by **crybotmark** » Sun Jul 09, 2017 3:04 pm

hgm wrote:
crybotmark wrote:The old replacement scheme was simple, preferring the lowest depth entry within a bucket of size 4. If the depth of the new search was >= than that of the entry to be replaced, then the overwrite occurs.
This is very bad, as it would mean you don't always store the latest result. You should always store the latest result, even if it is a d=1 result overwriting a d=30 result. 'Recent' is far more important than 'deep'.

I'm sorry, I may have badly described the situation: before implementing aging, the TT was totally cleared between moves, this meaning newer searches were always stored inside the hash table. What I was referring to is that new results of the same root position were replaced inside the TT only if their associated depth was large enough.

hgm wrote:
That is very suspect. It suggests that hash entries made in the previous search damage the current one, rather than help it. Which again suggests that they contain wrong information rather than correct information. This could happen, for instance, when there is an error in updating the hash key when you make a move at game level, so that the position you probe and use the score of is an entirely different one than what you should probe.

That is a good point, I need to ensure that this is the case. How would you proceed?

hgm · Post by **hgm** » Sun Jul 09, 2017 3:27 pm

crybotmark wrote:What I was referring to is that new results of the same root position were replaced inside the TT only if their associated depth was large enough.

This is what I understood, and it sounds very bad, because the 'if' suggest that you would not store them if their associated depth is too low. And you should always store the result of every search. Replacement is just about which entry you should overwrite. Not about whether you should overwrite something.

That is a good point, I need to ensure that this is the case. How would you proceed?

If you are sure that entries left over from a previous search are helpful rather than toxic, you can try to keep track of the search, and rather than clearing the entire table before every search, only clear entries that were not accessed in the preceding search. This should be better than clearing half of the table randomly.

I am not sure how you clear the hash table now, and how your replacement algortithm benefits from some entries being 'clear'. Resetting the depth to 0 should be sufficient to make sure the normal 'shallowest-of-four' replacement would prefer to overwrite the 'cleared' entries, and thus get rid of those quickly.

crybotmark · Post by **crybotmark** » Mon Jul 10, 2017 8:17 pm

Thank you hgm, I had to test Napoleon in order to give you more information to continue our discussion.
I tested different changes within my TT implementation (without aging):

1) Bucket size = 4, Always replace lowest depth entry, Clear TT between moves - ELO = 2263;
2) Bucket size = 4, Old replacement strategy (i.e. replace lowest depth entry only if current depth is large enough), clear TT between moves - ELO = 2269;
3) Bucket size = 1, Always replace lowest depth entry, clear TT between moves - ELO = 2278;
4) Bucket size = 1, Old replacement strategy (same as 2), clear TT between moves - ELO = 2265;
5) Bucket size = 1, Always replace lowest depth entry, Clear first half of TT between moves - ELO = 2276;
6) Bucket size = 1, Always replace lowest depth entry, Do not clear TT at all - ELO = ? (CURRENTLY TESTING);

I forgot to mention that the hash key I store inside the TT, is the full 64 bit Zobrist key. This is why (I think), 4 buckets were hardly needed, as collisions are rare.
'Clearing the table' is intended as just a memset() over the selected table memory area, which in turn sets each entry's depth to 0.

AlvaroBegue · Post by **AlvaroBegue** » Mon Jul 10, 2017 8:28 pm

crybotmark wrote:Thank you hgm, I had to test Napoleon in order to give you more information to continue our discussion.
I tested different changes within my TT implementation (without aging):

1) Bucket size = 4, Always replace lowest depth entry, Clear TT between moves - ELO = 2263;
2) Bucket size = 4, Old replacement strategy (i.e. replace lowest depth entry only if current depth is large enough), clear TT between moves - ELO = 2269;
3) Bucket size = 1, Always replace lowest depth entry, clear TT between moves - ELO = 2278;
4) Bucket size = 1, Old replacement strategy (same as 2), clear TT between moves - ELO = 2265;
5) Bucket size = 1, Always replace lowest depth entry, Clear first half of TT between moves - ELO = 2276;
6) Bucket size = 1, Always replace lowest depth entry, Do not clear TT at all - ELO = ? (CURRENTLY TESTING);

How are you getting those Elo measurements? What kind of error bar do they have?

I forgot to mention that the hash key I store inside the TT, is the full 64 bit Zobrist key. This is why (I think), 4 buckets were hardly needed, as collisions are rare.

No, that doesn't make a lot of sense.

crybotmark · Post by **crybotmark** » Mon Jul 10, 2017 8:36 pm

AlvaroBegue wrote: How are you getting those Elo measurements? What kind of error bar do they have?

I use cutechess-cli with a gauntlet of other 6 engines (3 stronger, 3 weaker). 2400 games at 15"+0.3" seem to converge to a stable result (even after 5000 games the elo doesn't change more than 1-2 elo). But still, i'm only interested in stating that A is not weaker than B, and viceverse.
The elo is computed with ordo.

AlvaroBegue · Post by **AlvaroBegue** » Mon Jul 10, 2017 9:20 pm

crybotmark wrote:
AlvaroBegue wrote: How are you getting those Elo measurements? What kind of error bar do they have?
I use cutechess-cli with a gauntlet of other 6 engines (3 stronger, 3 weaker). 2400 games at 15"+0.3" seem to converge to a stable result (even after 5000 games the elo doesn't change more than 1-2 elo).

I would expect each Elo estimate to have an error bar of around 5, so the difference between two estimates would have an error bar of around 7 (=5*sqrt(2)).

So keep in mind that you don't have a whole lot of statistical significance here.

But still, i'm only interested in stating that A is not weaker than B, and viceverse.
The elo is computed with ordo.

If you want to state that A is not weaker than B, play a match between A and B directly, ignore the draws and then use the binomial distribution to get a p-value for the result under the hypothesis that A and B are equally strong.

crybotmark · Post by **crybotmark** » Mon Jul 10, 2017 9:30 pm

Wouldn't playing self games produce bad results if I'm trying to generalize playing strength? A may be better than B, but worse than C. playing strength is not transitive.
But i get your point. I do not have strong statistical evidence, yet I can't run much longer tests. I only have 2 machines with 4 cores each and 2400 games with this TC take about 4 hours.

AlvaroBegue · Post by **AlvaroBegue** » Mon Jul 10, 2017 9:53 pm

crybotmark wrote:Wouldn't playing self games produce bad results if I'm trying to generalize playing strength? A may be better than B, but worse than C. playing strength is not transitive.
But i get your point. I do not have strong statistical evidence, yet I can't run much longer tests. I only have 2 machines with 4 cores each and 2400 games with this TC take about 4 hours.

This is an old debate. The main distortion introduced by matching very similar versions of an engine is that the differences are magnified. So you might measure a change as a +10 Elo improvement, and when playing against a varied mix of opponents it might be only +5 Elo (completely made up numbers, for illustration). But if you just want to know which version is better, this distortion actually helps.

So what I do is match only with the previous version to determine if I accept a change, and then every so often I play matches against a few reference opponents to measure the magnitude of the improvement over the months.

TT aging

TT aging

Re: TT aging

Re: TT aging

Re: TT aging

Re: TT aging

Re: TT aging

Re: TT aging

Re: TT aging

Re: TT aging

Re: TT aging