Elo uncertainties calculator.

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: SF improvements in GitHub.

Post by gladius »

gladius wrote:Undefended_pieces_removal test is running now :).
Well, with the bishop/knight/rook code in, it only scored +3 in 8k 2" games in a self match. The tests against other engines are still running. I'm going to remove the rook bonus code, and do a self test at longer TC.
zamar
Posts: 613
Joined: Sun Jan 18, 2009 7:03 am

Re: SF improvements in GitHub.

Post by zamar »

gladius wrote:
gladius wrote:Undefended_pieces_removal test is running now :).
Well, with the bishop/knight/rook code in, it only scored +3 in 8k 2" games in a self match. The tests against other engines are still running. I'm going to remove the rook bonus code, and do a self test at longer TC.
+3 is not bad. It's not much, but a slight push is always welcomed. I have to say that You have a good touch on these things, most people just make changes and do not bother veryfying them.

One opinion: When you get +3 elo in self-play, I'm afraid that running gauntlet against other engines is waste of time. You need much much more games to get over the error bar and I'm afraid in practice that is not going to happen.

Evaluation changes like this are often quite neutral: If it helps in self-play, it very likely help also against other engines (although maybe a bit less).
Search tree tweaks and king safety are a completely different issue.

So if you want to further verify the change, I'd run 20k or 30k self-play match. If it shows improvement over one sigma confidence, keep it.

(This is just a personal opionion, I know that some people here will disagree with me)
Joona Kiiski
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: SF improvements in GitHub.

Post by mcostalba »

mcostalba wrote: I have already started the same test removing my patch on "pinning bonuses"....let's see how it goes.
Patch is not bad. After 5390 games at 10"+0.05 we have 1001 - 905 - 3484 ELO +6 (+- 5.6) so I stopped the test because cannot be that the regression.

I am now starting with testing "undefended pieces" patch, if also this proves good then may be I have introduced a nasty bug somewhere, in this case I'd need to bisect a little.....On the good side my last verification run about one month ago on revision 3dccdf5b835b9856bc vs SF_2.2.2 showed no problems(ELO -2 after 15467 games), so any possible regression should be later than this.
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: SF improvements in GitHub.

Post by gladius »

zamar wrote:+3 is not bad. It's not much, but a slight push is always welcomed. I have to say that You have a good touch on these things, most people just make changes and do not bother veryfying them.

One opinion: When you get +3 elo in self-play, I'm afraid that running gauntlet against other engines is waste of time. You need much much more games to get over the error bar and I'm afraid in practice that is not going to happen.

Evaluation changes like this are often quite neutral: If it helps in self-play, it very likely help also against other engines (although maybe a bit less).
Search tree tweaks and king safety are a completely different issue.

So if you want to further verify the change, I'd run 20k or 30k self-play match. If it shows improvement over one sigma confidence, keep it.

(This is just a personal opionion, I know that some people here will disagree with me)
That makes a lot of sense. I'll get the 20k games ready to go next. Right now, running a test at 10"+1", and have 635-605-1948. But, not nearly enough games yet.
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: SF improvements in GitHub.

Post by gladius »

mcostalba wrote: Patch is not bad. After 5390 games at 10"+0.05 we have 1001 - 905 - 3484 ELO +6 (+- 5.6) so I stopped the test because cannot be that the regression.

I am now starting with testing "undefended pieces" patch, if also this proves good then may be I have introduced a nasty bug somewhere, in this case I'd need to bisect a little.....On the good side my last verification run about one month ago on revision 3dccdf5b835b9856bc vs SF_2.2.2 showed no problems(ELO -2 after 15467 games), so any possible regression should be later than this.
Perhaps the eval changes interact poorly? I can't really see how though. They seem pretty unrelated.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: SF improvements in GitHub.

Post by mcostalba »

gladius wrote: Perhaps the eval changes interact poorly? I can't really see how though. They seem pretty unrelated.
Our experience with tuning says that eval changes are largely unrelated (of course we are talking of different eval terms).

Your "undefended pieces" patch seems good at 10"+0.05 ( +10 ELO after 4194 games), so I have suspended the test and started a regression test between 3d0d0237c52474f vs d4c9abb9675586c6.

The good news is that the 3 patches seem very good:

Code: Select all

"Retire undefended pieces" After 4194 games 673 - 804 - 2717 ELO -10 (+- 6.4)
"Retire pinning bonus" After 5390 games 905 - 1001 - 3484 ELO -6 (+- 5.6)
"Add Pawn Storm" After 10670 games 2277 - 1941 - 6452 ELO +11
So in case we are very very lucky and actually find a regression we can aim at +15-+20 ELO points. In the worst case, if I find nothing, I want to apply the patches directly to SF 2.2.2 and see how it goes.
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: SF improvements in GitHub.

Post by gladius »

gladius wrote:That makes a lot of sense. I'll get the 20k games ready to go next. Right now, running a test at 10"+1", and have 635-605-1948. But, not nearly enough games yet.
So, after 16k games at 10"+1" on an i7, the undefended rook test looks to be not good (albeit by a very small margin).
3063 - 3093 - 9844 (-1).

I doubt that is causing the regression, but even so, it looks like it's not worth keeping, and we can go back to the simpler undefended minors check.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: SF improvements in GitHub.

Post by mcostalba »

gladius wrote:
gladius wrote:That makes a lot of sense. I'll get the 20k games ready to go next. Right now, running a test at 10"+1", and have 635-605-1948. But, not nearly enough games yet.
So, after 16k games at 10"+1" on an i7, the undefended rook test looks to be not good (albeit by a very small margin).
3063 - 3093 - 9844 (-1).

I doubt that is causing the regression, but even so, it looks like it's not worth keeping, and we can go back to the simpler undefended minors check.
Ok. I am still struggling with regression tests. I have wrote you a pm.
User avatar
Ajedrecista
Posts: 1966
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

SF pull request #22 in GitHub: for Gary Linscott.

Post by Ajedrecista »

Hello Gary:

I took a look here and I wish good luck because SF is an engine I like a lot.

I see that you calculate error bars and I also made a programme for the same purpose some months ago; it seems that our models are a bit different because I use the draw ratio where you do not use it, or at least it is what I think... not problem at all. I also see that you calculate LOS and I figure that you do like Rémi Coulom proposed in the last equation of this post. I get the same results as you (after 336 games and after 736 games) using this equation.

But I think I discover an important mistake in your programme: while everything seem correct for 99% confidence, I get different results for 95% confidence; I have to do the calculations by hand for not including the draw ratio, but I think that I am not wrong this time. I guess that you use a normal distribution (as me) for computing error bars: if I call sigma to one standard deviation, I suppose that you use this formula:

Code: Select all

sigma = sqrt[score*(1 - score)/n]
Where 'score' is the number you call 'win' (not in percentage) and 'n' is the number of games. We agree that a confidence interval of 99% in a normal distribution is more less mu ± 2.5758*sigma; OTOH a confidence interval of 95% is more less mu ± 1.96*sigma, and here is the mistake that I report: I do the calculations with the parameter z ~ 1.96 and I got different results from you... but I obtain that you are using z ~ 1.69, which is a typo when you did your programme. Please revise it or let me know if I am wrong elsewhere. I compare my results (using the draw ratio) with Marco's results and we get very similar results (I guess that his confidence interval was 95%), so everything seem OK.

As I side note, I say that Ryan Taker wrote 'LOS ~ 58%' where the correct word is 'score' (or 'win' in your case). Using my programme for this match (+32 -19 =31), I get a LOS value (not taking into account draws) of around 96.48%... I expect that we agree here.

Thanks to SF team and people who are trying to improve SF! A new official version is wanted! ;) Surely a self match against SF 2.2.2 and the last development version of SF will bring an Elo gain of 10 or 15 Elo.

Regards from Spain.

Ajedrecista.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: SF pull request #22 in GitHub: for Gary Linscott.

Post by mcostalba »

Ajedrecista wrote: Thanks to SF team and people who are trying to improve SF! A new official version is wanted! ;) Surely a self match against SF 2.2.2 and the last development version of SF will bring an Elo gain of 10 or 15 Elo.
Currently more 10 than 15 so it is too small improvment to release, but if Gary or someone else is able to boost of another 10 ELO I will: 20 ELO from 2.2.2 could be ok although not earth-shattering, but 10 is too small.