ROC analysis of Late Move Reductions

gavanewijk · Post by **gavanewijk** » Fri Mar 22, 2013 11:08 pm

Hi all, I have recently used ROC curves to analyze different implementations of LMR. I wanted to share the method here because I found it very valuable, not only for LMR but for any kind of pruning, reduction, extension, etc. Also it might spark some new ideas, which of course I would like to hear from you.

Wikipedia is a good place to learn about ROC. Basically it is a graphical representation of the performance of binary classifiers. LMR is one such classifier: based on move count, the result of a search at reduced depth, etc. we decide if we will search a move at full depth or prune it. If the model says we should prune, but a search at normal depth would actually return a score>alpha, we have a False Positive. If the model was right, it is a True Positive. A plot of the number of True Positives/Real Positives against False Positives/Real Negatives is an ROC plot. The further up and left, the better the model. A LMR model furthermore has parameters that can be tuned, e.g. the move number after which we test for pruning. Plotting TP/RP against FP/TN for a range of move numbers gives an ROC curve and provides very good insight in the effect of move number on accuracy (FP/TN) vs. efficiency (TP/RP). (sorry I deviate from standard
classifier performance jargon, but I find this more understandable). In my first experiments, I have collected features that could be used in a LMR model for many different positions, plus of course the real
outcome (score<=alpha or not); see https://docs.google.com/file/d/0B0l5kw6 ... sp=sharing

Some learnings I got from this exercise: only applying LMR in ALL nodes is pointless in my program, it hardly improves accuracy yet severely reduces efficiency. Same for move type (capture yes or no): no added value, so I now prune captures just as any other move. The search result at reduced depth is however indispensible. Without it, the result is very inaccurate.

The cool thing of this method is that you have to collect data only ones to investigate different models and different settings. No need to play millions of fast games for each possible setting. The drawback is that you have to make a guess about what accuracy is required. I have used a FP/TN threshold of about 0.1% as limit, arbitrary choice... Another drawback is that there is no direct result on increase of playing strength. Despite the drawbacks, I plan to use ROC analysis for extensions and futility pruning too.

Anyone else looked at statistics of search strategies this way? Other uses for ROC analysis? And what accuracy is OK (how to find out?). I cannot get a grip on that last point so far.

Daniel Shawul · Post by **Daniel Shawul** » Sat Mar 23, 2013 4:30 pm

Hi all, I have recently used ROC curves to analyze different implementations of LMR. I wanted to share the method here because I found it very valuable, not only for LMR but for any kind of pruning, reduction, extension, etc. Also it might spark some new ideas, which of course I would like to hear from you.

Welcome and thanks for sharing your work. I do not have much to say but just want to start discussion as I find the topic interesting.
It is good to have statistical analysis of pruning methods that are usually done in an ad-hoc manner. The way it usually works here come up with
some reasonable idea, and then throw a ton of processors at it and see if it is good. A lot of effort could be saved with your approach.

Wikipedia is a good place to learn about ROC. Basically it is a graphical representation of the performance of binary classifiers. LMR is one such classifier: based on move count, the result of a search at reduced depth, etc. we decide if we will search a move at full depth or prune it. If the model says we should prune, but a search at normal depth would actually return a score>alpha, we have a False Positive. If the model was right, it is a True Positive. A plot of the number of True Positives/Real Positives against False Positives/Real Negatives is an ROC plot. The further up and left, the better the model. A LMR model furthermore has parameters that can be tuned, e.g. the move number after which we test for pruning. Plotting TP/RP against FP/TN for a range of move numbers gives an ROC curve and provides very good insight in the effect of move number on accuracy (FP/TN) vs. efficiency (TP/RP). (sorry I deviate from standard
classifier performance jargon, but I find this more understandable). In my first experiments, I have collected features that could be used in a LMR model for many different positions, plus of course the real
outcome (score<=alpha or not); see https://docs.google.com/file/d/0B0l5kw6 ... sp=sharing

I am not familiar with ROC ,but it sounds like hypothesis testing with type I and II errors. In that case what are the significance level
for false positive and false negative given H0: don't prune this move. Is it asymmetrical as it is often the case? The two errors being I)
pruning a move that shouldn't have been pruned II) Failing to prune a move that should be pruned. How do you determine these levels? If you ask
me 50-50 sounds good for me, but these may require playing games to see what significance levels are appropriate.

Some learnings I got from this exercise: only applying LMR in ALL nodes is pointless in my program, it hardly improves accuracy yet severely reduces efficiency. Same for move type (capture yes or no): no added value, so I now prune captures just as any other move. The search result at reduced depth is however indispensible. Without it, the result is very inaccurate.

As I mentioned before, the approach many use is take whatever worked best in test games. Some time ago I learned that rybka/ippolit use
the complete opposite of what I imagined would be the right reduction approach at CUT vs ALL nodes. Well clearly it worked better for them
but it is didn't make much sense to me. As to captures, my engine clearly looses elo if I even turn on reductions for loosing SEE captures.
Reducing winning/equal captures is very bad for me, but you can get pretty large 'depth'. It may still be alright to reduce it by much
smaller amount compared to other moves though. But note that the reductions are recursive, so they add up pretty quickly!

The cool thing of this method is that you have to collect data only ones to investigate different models and different settings. No need to play millions of fast games for each possible setting. The drawback is that you have to make a guess about what accuracy is required. I have used a FP/TN threshold of about 0.1% as limit, arbitrary choice... Another drawback is that there is no direct result on increase of playing strength. Despite the drawbacks, I plan to use ROC analysis for extensions and futility pruning too.

I guess that is the answer to my question above. So you still need to play games for fixing significance levels.

cheers
Daniel

AlvaroBegue · Post by **AlvaroBegue** » Sat Mar 23, 2013 5:55 pm

If the model says we should prune, but a search at normal depth would actually return a score>alpha, we have a False Positive.

I haven't thought about it very carefully but, wouldn't a false positive be a move that is reduced, the reduced search returns score<=alpha but the non-reduced search would return score>alpha? Perhaps that's what you are saying and I just didn't understand it...

gavanewijk · Post by **gavanewijk** » Sat Mar 23, 2013 6:20 pm

If the model says we should prune, but a search at normal depth would actually return a score>alpha, we have a False Positive.
I haven't thought about it very carefully but, wouldn't a false positive be a move that is reduced, the reduced search returns score<=alpha but the non-reduced search would return score>alpha? Perhaps that's what you are saying and I just didn't understand it...

That was what I intended to say: we skip a full search (e.g. because a search at reduced depth gave score<=alpha) when we shouldn't have.

By the way, the decision to prune does not have to be based on a reduced search result; I have tried to base the decision on other features: only on move number, or on move number combined with alpha-positional score, or on move number and move type (non-capture or capture), etc. It turns out however that using the result of a reduced search really really improves the accuracy a lot and allows you to start pruning at fairly low move numbers.

gavanewijk · Post by **gavanewijk** » Sat Mar 23, 2013 6:49 pm

Hi Daniel, thanks for your thoughts.

It is good to have statistical analysis of pruning methods that are usually done in an ad-hoc manner. The way it usually works here come up with
some reasonable idea, and then throw a ton of processors at it and see if it is good. A lot of effort could be saved with your approach.

In all fairness, I wish I could throw that processor power at it. Just don't have it.

I am not familiar with ROC ,but it sounds like hypothesis testing with type I and II errors. In that case what are the significance level
for false positive and false negative given H0: don't prune this move. Is it asymmetrical as it is often the case? The two errors being I)
pruning a move that shouldn't have been pruned II) Failing to prune a move that should be pruned. How do you determine these levels? If you ask me 50-50 sounds good for me, but these may require playing games to see what significance levels are appropriate.

Hadn't thought of it this way, but I think you're right. In that case, I fixed the the confidence interval for H0 at 99.9% (one-sided), because Type I errors are harmful for the result; Type II errors are just efficiency loss so higher is better but no fixed value. Looking at my data, the interval is at 85% (also one-sided).

As to captures, my engine clearly looses elo if I even turn on reductions for loosing SEE captures.

Interesting, I wonder if the same would happen for my engine. If so, the would mean that a Type I error for a capture is much more harmful than a Type I error for a non-capture, which is not visible in ROC analysis (or hypothesis testing).

gavanewijk · Post by **gavanewijk** » Sun Mar 24, 2013 8:46 am

As to captures, my engine clearly looses elo if I even turn on reductions for loosing SEE captures.
Interesting, I wonder if the same would happen for my engine. If so, the would mean that a Type I error for a capture is much more harmful than a Type I error for a non-capture, which is not visible in ROC analysis (or hypothesis testing).

Nope, different effect in my engine: if I do not reduce captures, elo is hardly affected. Results for 50-game tournament:
Reduce captures: 25.5
Do not reduce captures: 24.5

Now to look for the optimal confidence interval...

AlvaroBegue · Post by **AlvaroBegue** » Sun Mar 24, 2013 11:11 am

gavanewijk wrote:
As to captures, my engine clearly looses elo if I even turn on reductions for loosing SEE captures.
Interesting, I wonder if the same would happen for my engine. If so, the would mean that a Type I error for a capture is much more harmful than a Type I error for a non-capture, which is not visible in ROC analysis (or hypothesis testing).
Nope, different effect in my engine: if I do not reduce captures, elo is hardly affected. Results for 50-game tournament:
Reduce captures: 25.5
Do not reduce captures: 24.5

Now to look for the optimal confidence interval...

You can't deduce much from a 50-game tournament.

In my notes I have this:

20130224: Reduce captures with SEE<0. +6 Elo (449-426-510)

The LOS is only 0.78, which means that after playing 1,385 games I am still not confident it's an improvement. I left it there because I thought it intuitively made sense, not because I believe the result.

gavanewijk · Post by **gavanewijk** » Sun Mar 24, 2013 11:32 am

You can't deduce much from a 50-game tournament.

In my notes I have this:
20130224: Reduce captures with SEE<0. +6 Elo (449-426-510)
The LOS is only 0.78, which means that after playing 1,385 games I am still not confident it's an improvement. I left it there because I thought it intuitively made sense, not because I believe the result.

You're right, just posing the numbers is not well-founded. I was just reading about LOS in another thread, I have to get familiar with that.

Probably the number of games you need to prove superiority depends on how big a difference you want to get proof for. For an Elo difference of 100 points my guess is you will not need 1000 games to prove superiority, for 10 points you will need many games. Any idea how I can deduce the number of games required to get a LOS of 95% for a presumed Elo difference of, say, 100 points? The term Power Analysis comes to mind...

Ajedrecista · Post by **Ajedrecista** » Sun Mar 24, 2013 3:01 pm

Hello Gerard:

gavanewijk wrote:
You can't deduce much from a 50-game tournament.

In my notes I have this:
20130224: Reduce captures with SEE<0. +6 Elo (449-426-510)
The LOS is only 0.78, which means that after playing 1,385 games I am still not confident it's an improvement. I left it there because I thought it intuitively made sense, not because I believe the result.
You're right, just posing the numbers is not well-founded. I was just reading about LOS in another thread, I have to get familiar with that.

Probably the number of games you need to prove superiority depends on how big a difference you want to get proof for. For an Elo difference of 100 points my guess is you will not need 1000 games to prove superiority, for 10 points you will need many games. Any idea how I can deduce the number of games required to get a LOS of 95% for a presumed Elo difference of, say, 100 points? The term Power Analysis comes to mind...

I programmed a Fortran tool a while ago that is called Minimum_number_of_games. I think that it is exactly what you are looking for. Here is an example:

Code: Select all

Minimum_number_of_games, ® 2012-2013.

 Calculation of the minimum number of games in a match between two engines to ensure an Elo gain with a given LOS value:

Write down the wanted Elo gain between 0.1 and 40 Elo (it will be rounded up to 0.01 Elo):

10

 Write down the likelihood of superiority (in percentage) between 90% and 99.9% (LOS will be rounded up to 0.01%):

95

Write down the clock rate of the CPU (in GHz), only for timing the elapsed time of the calculations:

3
_______________________________________________________________________________

Score for a wanted gain of 10.00 Elo:                 51.4387 %
Sample standard deviation for 95.00 % of LOS:          0.8747 %
1.6449 sample standard deviations for 95.00 % of LOS:  1.4387 %

A LOS value of 95.00 % is equivalent to 90.00 % confidence in a two-sided test.

Minimum number of needed games:      3268 games.
_______________________________________________________________________________

If you want to include the draw ratio D, a good approximation is the next one:
Multiply the minimum number of needed games by (1 - D).

D < 1 (the used mathematical model fails with D = 1); close values to D = 1 will bring not reliable results.
_______________________________________________________________________________

End of the calculations. Approximated elapsed time:  16 ms.

Thanks for using Minimum_number_of_games. Press Enter to exit.

The input restrictions are Elo improvements from 0.1 Elo to 40 Elo and LOS from 90% to 99.9%. You can download all my tools from the link of my signature.

There is other programme called Minimum_score_for_no_regression which is intended to do the same task as Ciarrochi's table but with more games.

I hope that my tools will be useful for you.

Regards from Spain.

Ajedrecista.

gavanewijk · Post by **gavanewijk** » Sun Mar 24, 2013 3:21 pm

Ajedrecista wrote: I hope that my tools will be useful for you.

Thanks Jesús, this is exactly what I need! I will have a look at the references in the download and try to understand.

Gerard

ROC analysis of Late Move Reductions

ROC analysis of Late Move Reductions

Re: ROC analysis of Late Move Reductions

Re: ROC analysis of Late Move Reductions

Re: ROC analysis of Late Move Reductions

Re: ROC analysis of Late Move Reductions

Re: ROC analysis of Late Move Reductions

Re: ROC analysis of Late Move Reductions

Re: ROC analysis of Late Move Reductions

Minimum number of games.

Re: Minimum number of games.