Aspiration Windows

jd1 · Post by **jd1** » Fri Jan 11, 2013 8:33 pm

Hi,

I implemented aspiration windows in Toga based on Stockfish and Discocheck and tested at longer time controls. The result was +10-20 elo after about 6000 games.

However, testing at ultra fast (Toga gets to about depth 6 in the opening) shows that it is weaker at this time control.

Does this mean I am using aspiration windows at too low depths? I start at depth 5 like Stockfish and Discocheck .

Thanks for your help,
Jerry

jdart · Post by **jdart** » Sat Jan 12, 2013 2:03 am

jd1 wrote:H
Does this mean I am using aspiration windows at too low depths? I start at depth 5 like Stockfish and Discocheck .

I think what it really means is that you are using too fast a time control.

IMO there is no point tuning the search for optimal play at depth 5-6 because in a real game you won't be stopping at that depth.

--Jon

jd1 · Post by **jd1** » Sun Jan 13, 2013 6:54 am

Hi Jon,

Thanks for the input. Actually I did test at longer time controls and it works well, so I will commit the change in any case. I just decided to see what happens at very low depths (since it is quick to test). But perhaps the ultra-fast test result is meaningless as you say.

Jerry

lucasart · Post by **lucasart** » Sun Jan 13, 2013 7:11 am

jd1 wrote: However, testing at ultra fast (Toga gets to about depth 6 in the opening) shows that it is weaker at this time control.

Jon is right. That is *too* fast to be meaningful. When I put aspirations windows in DiscoCheck, the elo gain was quite significant:
https://github.com/lucasart/chess/commi ... 8605b85717
After many failed attempts, I realized that the SF way was best.

The depth >= 5 condition doesn't really matter, unless you do razoring or futility pruning at PV nodes (especially at the root) which is why this min depth condition is there to avoid bugs. Also aspiration are useless (and harmful) at low depths like that. Once you have a sufficient time control, this condition becomes useless.

The main difference between DiscoCheck and SF is the speed at which the window is widened. I found that doubling delta performs better in testing than SF's delta += delta/2. Maybe this is true for DiscoCheck but not for SF. I don't know.

One thing to note too, is that aspiration pollutes your move ordering! Indeed, moves are typically marked as "history good" or "history bad" depending on whether they fail high or not. But if the (rootAlpha,rootBeta) window is artificial you introduce some pollution. In theory that could be a reason for a min depth condition too. Searching depth=1,2,3,4 you start building a history table that makes sense, and your move ordering starts to be good. In turn history is used to reduce and even prune moves. So at high iterations, you reduce and prune bad moves. But if your history was screwed, there is some "memory" involved in the process, and you'll screw up the search for later iterations.

jd1 · Post by **jd1** » Sun Jan 13, 2013 10:47 am

Thanks Lucas, that was a very informative and helpful post. In Toga the elo gain is also meaningful, 15-20 elo after 5000 games. I am actually using SF's delta += delta/2 formula but will try yours too. I'm surprised no one has tried this before as it takes about a minute to implement.

By the way, do you think super fast testing like that is worthwhile for minor eval changes?

Jerry

lucasart · Post by **lucasart** » Sun Jan 13, 2013 11:33 am

jd1 wrote: By the way, do you think super fast testing like that is worthwhile for minor eval changes?

This is a difficult question, and I'm sure everyone is wondering too. I really don't know what the answer is, in general. All I can say from my experience is that:
- Different features scale differently. It is not uncommon to see (accounting for error bars) that a feature is: 1/ a clear regression at 25,000 nodes per move 2/ a clear improvement at 50,000 nodes per move 3/ a small regression at 10"+0.1". And feel free to mix and match 1/ 2/ and 3/ in all possible other combinations too!
- It is impossible to predict *a priori* how a feature will scale. Often one assumes that because a feature only affects the qsearch() or the search at low depth, then it can reliably be tested at very low depth, but DiscoCheck proved me wrong on that many times... If things were so easy then eval changes (which mostly affect the qsearch()) could be reliably tested at a fixed depth of 5 or sth, but this is a very easy way to introduce plenty of regression into your engine, so don't do it!

So I would recommend:
- be consistent: decide on testing conditions and apply always the same ones.
- better to have twice more games than a time control twice slower. But there is a limit to everything! You need to find a balance depending on: 1/ your CPU power at hand 2/ your time (I run things overnight or during the day when I work).

As for me, I try to stick to that:
- 4000 games in 10"+0.1", or 15"+0.15" depending on the time available. I have an i7, and use cutechess-cli with 7 concurrent games (Hyper Threading = ON). In practice thie leaves enough CPU power for the other proceses running in the background, and my results are not polluted in any meaningful way. So that takes on average 4000*(10+0.1*60)*2/7/3600=5.08 hours for 10"+0.1" and 7.62 hours for 15"+0.15"
- if after 4000 games the LOS is > 95% the patch is accepted.
- if it is between 50% and 95%, then it depends. If it's a code simplification I commit the patch (less is more, the only rule worth remembering, also applies very well to the eval). Otherwise I don't.
- if the LOS is < 50%, then I never commit, even if the patch is a code simplification.

The only patches I commit w/o testing are non functional patches that clean up code (at no measurable speed cost) or make it faster.

As a result, the developpement is slow, but at least every step I make goes in the right direction (hopefully).

jd1 · Post by **jd1** » Mon Jan 14, 2013 10:06 am

Different features scale differently. It is not uncommon to see (accounting for error bars) that a feature is: 1/ a clear regression at 25,000 nodes per move 2/ a clear improvement at 50,000 nodes per move 3/ a small regression at 10"+0.1". And feel free to mix and match 1/ 2/ and 3/ in all possible other combinations too!
- It is impossible to predict *a priori* how a feature will scale. Often one assumes that because a feature only affects the qsearch() or the search at low depth, then it can reliably be tested at very low depth, but DiscoCheck proved me wrong on that many times... If things were so easy then eval changes (which mostly affect the qsearch()) could be reliably tested at a fixed depth of 5 or sth, but this is a very easy way to introduce plenty of regression into your engine, so don't do it!

Thanks! I can say the same from my experience.
Jerry

Aspiration Windows

Aspiration Windows

Re: Aspiration Windows

Re: Aspiration Windows

Re: Aspiration Windows

Re: Aspiration Windows

Re: Aspiration Windows

Re: Aspiration Windows