Elo uncertainties calculator.

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Ajedrecista
Posts: 1969
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Elo uncertainties calculator.

Post by Ajedrecista »

Hello!

Yesterday, I programmed a little executable in one of the newest programming language... Fortran 95! :P

This tiny programme is very clumsy in the sense that it lacks of output format (for those who knows Fortran, I simply put write(*,*), where the first asterisk is for indicate the file to write (in this case the command prompt, but could be a Notepad, .dat files, etc., but I was so lazy to open a Notepad, put write(111,*) and close this Notepad; and the second asterisk is the format, which I think that * means default, but is really ugly). I apologize, because it is really odd, but at least it is understandable. I do not know anything about the format... I simply put * and that is enough for me.

Elo_uncertainties_calculator.rar (0.60 MB)

(The link will dead in few days, but you are free of doing mirrors of this freeware and open source, clumsy software, always linking to this topic and/or quoting this post).

Elo_uncertainties_calculator.exe should work only in Windows; the code is not portable to other Fortran editions (77, 90, ...) due to my definition of real(KIND=3) instead of real, but as I provide the source, it can be changed; if someone is interested in compile it with gfortran compiler from the command prompt, real variables must be changed in the way I said before, and must add -fimplicit-none flag (because I use implicit none at the start of my code), and then must copy libgfortran-3.dll in the same directory of the compiled executable, because it is dependent of it.

I directly compiled my source with Plato IDE (Portable Silverfrost FTN95), so salflibc.dll must be in the same directory of Elo_uncertainties_calculator programme, and my programme can be used even if it is compressed (I compress it with WinRAR). Once I said all this stuff, I want to remark two things:

a) I give the compiled programme, so in principle there is no need of compile by yourself... I only said that for the people that like to compile sources. ;) Note that due to the presence of salflibc.dll, the programme is a little slow in the start (around six seconds in my computer), so please be patient.

b) I provide the source in .f95 and .txt extensions: with the Notepad, everybody can see the equations I used, and my horrible programming skills... it is true: I almost do not know programming, even in Fortran 90/95, which are the two languages I use more in the university.

Well, I have written a lot and I am sure that nobody knows how Elo_uncertainties_calculator.exe works: do double click in the executable (compressed or uncompressed, it should be the same) and the programme will ask the number of wins, loses and draws of a match between two engines. I remark this because it can not manage tournaments of more than two engines. As everyone can see, my programme is not very versatile. The input data must be write by hand; the strong point I see is when some testers provide results without PGN files (of course, my programme is not so smart to accept PGNs).

Once the number of wins, loses and draws are recognized by the programme, the calculations are done in one second as maximum or even less (at least I hope it). The given results of the output are from 1-sigma, 2-sigma and 3-sigma confidence (~ 68.27%, 95.45% and 99.73% confidence). The more common results of other softwares are for 95% confidence (~ 1.96-sigma confidence), so I strongly recommend to focus on 2-sigma confidence. Elo_uncertainties_calculator results should differ very little with other programmes, otherwise I am doing something really wrong. It is needless to say that my programme can not compete with BayesElo, EloStat, ... and it also was not my intention: I only want to share my programme.

@Moderation team: I do not know if this is the best subforum for this thread. Please feel free of move it if necessary.

I put an example:

https://github.com/mcostalba/Stockfish/pull/10

Gary Linscott is making an incredible effort to raise the Elo of Stockfish (thank you very much!). In his last pull request at GitHub, he has posted this results:

Code: Select all

+2153 -1952 =3895
Around +9 Elo improvement at 40 moves/2 seconds TC; typing these numbers in my humble programme:

Code: Select all

 Elo_uncertainties_calculator, © 2012.

 Calculation of Elo uncertainties in a match between two engines:
 ----------------------------------------------------------------

 (The input and output data is referred to the first engine).

 Please write down non-negative integers.

 Write down the number of wins:

2153

 Write down the number of loses:

1952

 Write down the number of draws:

3895

 ***************************************
 1-sigma confidence ~ 68.27% confidence.
 2-sigma confidence ~ 95.45% confidence.
 3-sigma confidence ~ 99.73% confidence.
 ***************************************

 -----------------------------------------------------------------------

 Confidence interval for            1-sigma:

 Elo rating difference:      8.7311566219790740    Elo

 Lower rating difference:      5.9490758985007990    Elo
 Upper rating difference:      11.514357261062792    Elo

 Lower bound uncertainty:     -2.7820807234782751    Elo
 Upper bound uncertainty:      2.7832006390837182    Elo
 Average error: +-     2.7826406812809966    Elo

 K = (average error)*[sqrt(n)] =      248.88694881202540

 Elo interval: ]     5.9490758985007990    ,     11.514357261062792    [
 -----------------------------------------------------------------------

 Confidence interval for            2-sigma:

 Elo rating difference:      8.7311566219790740    Elo

 Lower rating difference:      3.1677578124844177    Elo
 Upper rating difference:      14.299035956673968    Elo

 Lower bound uncertainty:     -5.5633988094946563    Elo
 Upper bound uncertainty:      5.5678793346948943    Elo
 Average error: +-     5.5656390720947753    Elo

 K = (average error)*[sqrt(n)] =      497.80589213731082

 Elo interval: ]     3.1677578124844177    ,     14.299035956673968    [
 -----------------------------------------------------------------------

 Confidence interval for            3-sigma:

 Elo rating difference:      8.7311566219790740    Elo

 Lower rating difference:     0.38684567272318727    Elo
 Upper rating difference:      17.085551990018378    Elo

 Lower bound uncertainty:     -8.3443109492558867    Elo
 Upper bound uncertainty:      8.3543953680393035    Elo
 Average error: +-     8.3493531586475951    Elo

 K = (average error)*[sqrt(n)] =      746.78884923554435

 Elo interval: ]    0.38684567272318727    ,     17.085551990018378    [
 -----------------------------------------------------------------------

 Number of games of the match:         8000
 Score:      51.256250000000000    %
 Elo rating difference:      8.7311566219790740    Elo
 Draw ratio:      48.687500000000000    %

 *****************************************************************
 1-sigma:     0.40019281851650899    % of the points of the match.
 2-sigma:     0.80038563703301798    % of the points of the match.
 3-sigma:      1.2005784555495270    % of the points of the match.
 *****************************************************************

 End of the calculations.

 Thanks for using Elo_uncertainties_calculator. Press Enter to exit.
So, for 2-sigma confidence, I get ~ +8.73 ± 5.57 (rounding up to 0.01 Elo), and the interval is ~ ]+3.17, +14.3[, which means no regression with ~ 95.45% confidence, given the results of this 8000-game match.

Any feedback, comments, complaints... are welcome, although I think that I will not improved this programme. And, of course, I remember that even Nosferatu is nicer than the format of the output.

Image

(Nosferatu, eine Symphonie des Grauens; by F.W. Murnau, 1922).

Regards from Spain.

Ajedrecista.
User avatar
Ajedrecista
Posts: 1969
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

SF improvements in GitHub.

Post by Ajedrecista »

Hello:

I have checked all the changes in Stockfish after SF 2.2.2 was released:

https://github.com/mcostalba/Stockfish/commits/master

Code: Select all

Jan 17, 2012.

Don't allow LMR to fall in qsearch.

After 10749 games:
Mod vs Orig: 1670 - 1676 - 7403 ELO 0 (+-3.7)

-----------

Jan 23, 2012.

Order bad captures by MVV/LVA.

After 10424 games:
Mod vs Orig 1639 - 1604 - 7181 ELO +1 (+-3.8)

-----------

Jan 27, 2012.

Restore LMR depth limit.

After 16003 games:
Mod vs Orig 2496 - 2421 - 11086 ELO +1 (+-3)

-----------

Feb 03, 2012.

Reduce lock contention in idle_loop.

After 7792 games with 4 threads at very fast TC (2"+0.05):
Mod vs Orig 1722 - 1627 - 4443 ELO +4 (+- 5.1)

-----------

Feb 21, 2012.

Don't update bestValue in check_is_dangerous().

After 24302 games at 2"+0.05:
Mod vs Orig 5122 - 5038 - 13872 ELO +1 (+- 2.9)

-----------

Feb 28, 2012.

Halve rook on open file bonus for endgame.

After 42206 fast games TC 2"+0.05:
Mod vs Orig 12871 - 16849 - 12486 ELO +3 (+- 2.6)

-----------

Mar 04, 2012.

Introduce pinning bonus.

After 27443 games at 2"+0.05:
Mod vs Orig 5900 - 5518 - 16025 ELO +4 (+- 2.7)

-----------

Mar 06, 2012.

Double pinner bonus.

After 34696 games at 2"+0.05:
Mod vs Orig 7474 - 7087 - 20135 ELO +3 (+- 2.4)

-----------

Mar 21, 2012.

Penalize undefended minors.

After 12112 games at 10"+0.05:
Mod vs Orig 2175 - 1997 - 7940 ELO +5 (+- 3.7)

-----------

Mar 22, 2012.

Use a local copy of tte->value().

After 3913 games at 10"+0.05:
Mod vs Orig 662 - 651 - 2600  ELO +0 (+- 6.4)

-----------

Mar 26, 2012.

Merge pull request #9 from glinscott/master.

After 17522 games at 10"+0.05:
Mod vs Orig 3064 - 2967 - 11491 ELO +2

-----------

Mar 27, 2012.

Merge pull request #11 from glinscott/squash.

After 10670 games at 10"+0.05:
Mod vs Orig 2277 - 1941 - 6452 ELO +11 !!!
I paid attention on the matches that test Elo improvements; I tried Elo_uncertainties_calculator and this is what I got for 2-sigma confidence (~ 95.45% confidence):

Code: Select all

+1670 -1676 =7403

-0.19393559676639818 ± 3.7395202335661725

-----------

+1639 -1604 =7181

1.1665666913537868 ± 3.7962750993598505

-----------

+2496 -2421 =11086

1.6283109227786801 ± 3.0447954860176071

-----------

+1722 -1627 =4443

4.2361417066181864 ± 5.1610055702372184

-----------

+5122 -5038 =13872

1.2144102878671067 ± 2.9145428898148096

(The number of games are 24032, and not 24302).

-----------

+12871 =16849 -12486

3.1693695380260586 ± 2.6217624674768905

-----------

+5900 -5518 =16025

4.8365326904078752 ± 2.7055746404858327

-----------

+7474 -7087 =20135

3.8754654158819428 ± 2.4166688511261199

-----------

+2175 -1997 =7940

5.1063397513499519 ± 3.7053963364066128

-----------

+662 -651 =2600

0.97669345976465415 ± 6.4353888065642292

-----------

+3064 -2967 =11491

1.9233875170290040 ± 3.0797866550019397

-----------

+2277 -1941 =6452

10.944420506197918 ± 4.2286345407810256
My results do not differ so much with the ones posted by Marco (this is good). I know that giving so many decimals (as I am doing right now) is ridiculous, but I will add these means and standard deviations (with all the decimals) in the following way:

Code: Select all

_i stands for i subindex.

(Average mean) = SUM(mean_i)

(Average standard deviation) = sqrt({SUM[(standard deviation)_i]²})
My nomenclature is this one: each mean is the Elo difference (Mod. vs. Orig.) in each match, and each standard deviation is the uncertainty (± ... for 2-sigma confidence) of each match. I think this is the way to add independent (not correlated) normal distributions, which I am sure that is not this case.

I take all the decimals and finally I round up the results:

Code: Select all

(Average mean) ~ 38.88370289050876647
(Average standard deviation) ~ 13.241666150449844666693947272699
So, should be correct saying that now SF is ~ +38.88 ± 13.24 (with 2-sigma confidence) compared to SF 2.2.2? The overall number of games are 217562 (!!), splitted in twelve matches with different TC. I know that self-testing tends to exaggerate Elo rating differences, and adding the results of different TC is not the more correct thing... but taking the worst case: +38.88 - 13.24 = +25.64, can be asserted that the current SF is (at least) 25 Elo stronger that 2.2.2 version?

Thank you very much to Marco, Gary et al: you are doing a great job!

Regards from Spain.

Ajedrecista.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: SF improvements in GitHub.

Post by mcostalba »

Thanks to you for this analysis work on SF, very interesting!

I don't want to rain on your parade but I can tell you that if we are very lucky current SF is about 10 ELO stronger than 2.2.2 and almost all the difference is due to the Gary's patch of 27 March. You have to consider that results at very fast TC are not to be considered reliable from a quantitative ELO estimation: they just show that there _could_ be something interesting in the patch under test. Most of them do not survive the longer TC verification (ask Gary!) and anyhow, even in the best case, the ELO advantage tends to smooth out at longer TC. Only very few patches show a scalable behavior and these are the "gold" ones, the incredibly difficult to find and rare gold patches seek by all engine developers. Probably next week I will do a verification test between current SF and the 2.2.2 so to have a direct measure of the advantage...but I can already tell to you that I'd sign immediately for a +10 ELO.
User avatar
Ajedrecista
Posts: 1969
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: SF improvements in GitHub.

Post by Ajedrecista »

Hello Marco:
mcostalba wrote:Thanks to you for this analysis work on SF, very interesting!

I don't want to rain on your parade but I can tell you that if we are very lucky current SF is about 10 ELO stronger than 2.2.2 and almost all the difference is due to the Gary's patch of 27 March. You have to consider that results at very fast TC are not to be considered reliable from a quantitative ELO estimation: they just show that there _could_ be something interesting in the patch under test. Most of them do not survive the longer TC verification (ask Gary!) and anyhow, even in the best case, the ELO advantage tends to smooth out at longer TC. Only very few patches show a scalable behavior and these are the "gold" ones, the incredibly difficult to find and rare gold patches seek by all engine developers. Probably next week I will do a verification test between current SF and the 2.2.2 so to have a direct measure of the advantage...but I can already tell to you that I'd sign immediately for a +10 ELO.
Thanks for your answer. I am aware that too many times, changes that seem good at STC are almost inexistant at LTC: I remember the case of Gull (from 1.1 to 1.2), when the author expected a big gain (+30 or +50 for example, but I do not remember) based on STC results, but the real Elo gain in LTC was much less... IPON list (maybe not very long TC, but definitively not short TC) showed around +5 or +10 (which is not bad, but was far of the author expectancies). More less the same has happened more recently in Cheng (1.05 to 1.06 or 1.06 to 1.07, I do not remeber the correct versions).

People will thank you the test between current SF development version and SF 2.2.2... me too! I think that nobody expected any improvement from SF 2.2.1 to SF 2.2.2 (maybe only bug fixes in time management, as in 2.2.1 fixing 2.2), and finally Ingo showed up an improvement around +8 or +10 (I know that this is inside the error bars). So, surprises exist! Good luck, and thanks again for this wonderful engine.

Regards from Spain.

Ajedrecista.
mar
Posts: 2555
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: SF improvements in GitHub.

Post by mar »

Hi Jesus,
Ajedrecista wrote:More less the same has happened more recently in Cheng (1.05 to 1.06 or 1.06 to 1.07, I do not remeber the correct versions).
Yes that's correct. 1.05 and 1.06 were a big disappointment (in CCRL, they were actually worse than 1.04), for 1.07 I expected +30 but got only +13 in CEGT, while I got +60 in CCRL. The reason is that

1) I can't test. I test only 1' games (1000 games), then I test 40/4 (200 games). Since 1.07 I also test 40/20 (200 games) when tests at 1' and 40/4' agree. I still want to revise that.
2) Cheng can't be compared to real, serious chess engines like Stockfish or Gull, it's still in a toy stage. This is not modesty but reality. A (very) long way to go...

Martin
User avatar
Ajedrecista
Posts: 1969
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: SF improvements in GitHub.

Post by Ajedrecista »

Hello:
mcostalba wrote:Thanks to you for this analysis work on SF, very interesting!

I don't want to rain on your parade but I can tell you that if we are very lucky current SF is about 10 ELO stronger than 2.2.2 and almost all the difference is due to the Gary's patch of 27 March. You have to consider that results at very fast TC are not to be considered reliable from a quantitative ELO estimation: they just show that there _could_ be something interesting in the patch under test. Most of them do not survive the longer TC verification (ask Gary!) and anyhow, even in the best case, the ELO advantage tends to smooth out at longer TC. Only very few patches show a scalable behavior and these are the "gold" ones, the incredibly difficult to find and rare gold patches seek by all engine developers. Probably next week I will do a verification test between current SF and the 2.2.2 so to have a direct measure of the advantage...but I can already tell to you that I'd sign immediately for a +10 ELO.
I have checked this time the changes between SF 2.1.1 and SF 2.2.2:

https://github.com/mcostalba/Stockfish/commits/master

Code: Select all

May 28, 2011:

Retire mateKiller.

After 6456 games: 1281 - 1293 - 3882 ELO +0 (+- 5.5)

-----------

May 30, 2011:

New extended probcut implementation.

After 7917 games 4 threads 20"+0.1:
Mod vs Orig: 1261 - 1095 - 5561 ELO +7 (+- 4.2) LOS 96%

-----------

Jun 08, 2011:

Use refinedValue in ProbCut condition.

After 12613 games at 20"+0.1 on QUAD:
Mod vs Orig 1870 - 1863 - 8880 ELO +0 (+- 3.3)

-----------

Jun 12, 2011:

Increase LMR limit by one ply.

After 8278 games:
Mod vs Orig 1246 - 1265 - 5767 +0 ELO (+- 4.2)

-----------

Jun 15, 2011:

Try only recaptures in qsearch if depth is very low.

After 9078 games 20"+0.1 QUAD:
Mod vs Orig 1413 - 1319 - 6346 ELO +3 (+- 4)

-----------

Jun 24, 2011:

PHQ settings for King and Mobility.

After 8130 games on QUAD at 20"+0.1:
1342 - 1359 - 5429 ELO +0 (+- 4.4)

-----------

Jun 29, 2011:

Small tweaks to search().

To be sure verified in real games with 4 threads TC 2"+0.1:
After 11125 games 2497 - 2469 - 6159 ELO +0 (+- 4.4)

-----------

Aug 07, 2011:

Split at root!

After 5876 games at 10"+0.1:
Mod vs Orig: 1073 - 849 - 3954 ELO +13 (+- 5.2)

-----------

Aug 09, 2011:

Retire Rml full PV search at depth == 1.

No regression after 6400 games:
Mod vs Orig 1052 1012 4336 ELO +2 (+- 4.9)

-----------

Sep 24, 2011:

Update killers after a TT hit.

After 16707 games: 2771 - 2595 - 11341 ELO +3 (+- 3.2)

-----------

Oct 19, 2011:

Increase Mobility.

After 8736 games at 10"+0.1:
Mod vs Orig 1470 - 1496 - 5770 ELO -1 (+-4.3)

-----------

Nov 12, 2011:

Simplify passed pawns logic.

After 16284 games:

Mod vs Orig 2728 - 2651 - 10911 ELO +1 (+- 3.1)

-----------

Nov 14, 2011:

CLOP: Passed pawns weights tuning.

After 11720 games at 10"+0.1:
Mod vs Orig 1922 - 1832 - 7966 ELO +2 (+-3.6)

-----------

Nov 18, 2011:

Rewrite early stop logic.

After 12245 games at 30"+0.1:
Mod vs Orig 1776 - 1775 - 8694 ELO +0 (+-3.4)

-----------

Dec 06, 2011:

Retire all extensions (but checks) for non-PV nodes.

After 9555 games:
Mod vs Orig 1562 - 1540 - 6453 ELO +0 (+- 4)

-----------

Dec 08, 2011:

Set captureThreshold according to static evaluation.

After 7502 games:
Mod vs Orig 1225 - 1158 - 5119 ELO +3 (+- 4.5)

-----------

Dec 10, 2011:

Don't update bestValue when pruning & allow to prune also first move.

After 5817 games:
Mod vs Orig 939 - 892 - 3986 ELO +2 (+- 5.1)

-----------

Dec 13, 2011:

Simplify aspiration window calculation.

After 5350 games:
Mod vs Orig 800 - 803  - 3647 ELO +0 (+- 5.2)

-----------

Dec 24, 2011:

Don't update killers for evasions.

After 11893 games:
Mod vs Orig 1773 - 1696 - 8424 ELO +2 (+-3.4)
I hope that I have not overlooked any match. I tried Elo_uncertainties_calculator for verifying my tiny programme, and everything seems correct:

Code: Select all

+1281 -1293 =3882

-0.64579179487323807 ± 5.4611037458352477

-----------

+1261 -1095 =5561

7.2859367638477808 ± 4.2591507499082744

-----------

+1870 -1863 =8880

0.19282084740771891 ± 3.3661106598945216

-----------

+1246 -1265 =5767

-0.79744959133723688 ± 4.2065103817540560

-----------

+1413 -1319 =6346

3.5977211217999496 ± 4.0007632647889706

-----------

+1342 -1359 =5429

-0.72649613311599335 ± 4.4422065301987434

-----------

+2497 -2469 =6159

0.87444646615476786 ± 4.4018023484057432

-----------

+1073 -849 =3954

13.251072733698412 ± 5.1807926604982537

-----------

+1052 -1012 =4336

2.1715006845592905 ± 4.9328625222435862

-----------

+2771 -2595 =11341

3.6601978338675811 ± 3.0466000647072468

-----------

+1470 -1496 =5770

-1.0340375337973548 ± 4.3320945817152218

-----------

+2728 -2651 =10911

1.6422798852609790 ± 3.1285329280205616

(The number of games are 16290 and not 16284).

-----------

+1922 -1832 =7966

2.6680731540265548 ± 3.6326643698903733

-----------

+1776 -1775 =8694

2.83736697668427427E-02 ± 3.3816959148027923

-----------

+1562 -1540 =6453

0.79995775913816650 ± 4.0505400584952409

-----------

+1225 -1158 =5119

3.1030129517056121 ± 4.5216181970895560

-----------

+939 -892 =3986

2.8072593909360399 ± 5.1116846635594892

-----------

+800 -803 =3647

-0.19853464190784400 ± 5.2996274644077503

-----------

+1773 -1696 =8424

2.2494672410070339 ± 3.4412500050942415
These uncertainties are for 2-sigma confidence (~ 95.45% confidence). The main point of this post is not test Elo_uncertainties_calculator, but two other things:

- Realize the people how hard is improving a top level engine (I am sure people know it very well). Even better: how difficult is improve WHATEVER engine.

- Compare the estimated Elo improvement (as I did in the second post of this topic) versus the real Elo improvement (in IPON and in Clemens Keck's Base, for example).

Doing the same math as I did before (hoping no typos):

Code: Select all

(Average mean) ~ 40.9298108081450626127
(Average standard deviation) ~ 18.672345721847575450175753991207
If I am not wrong, the overall number of games are 181682, splitted in nineteen matches with different conditions.

A bad claim should be ~ +40.93 ± 18.67 (in the worst case: 40.93 - 18.67 = 22.26), but the reality is not ~ +20 in those lists. Only looking at the ratings (without considering the error bars):

+9 in IPON.
+13 in the Base.

So, Marco's guess of around +10 at this moment (current development version against 2.2.2) makes a lot of sense (maybe we will see it very soon!); my baseless guess is a little more: around +15. To all programmers: please keep up your good work! Users are the main beneficiaries. Thank you very much!

Regards from Spain.

Ajedrecista.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: SF improvements in GitHub.

Post by mcostalba »

Ajedrecista wrote: So, Marco's guess of around +10 at this moment (current development version against 2.2.2) makes a lot of sense (maybe we will see it very soon!);
A bit short of that: after more than 13K games we are at +7 ELO (almost all due to the last Gary's patch). I have stopped the test and moved on with test queue.....
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: SF improvements in GitHub.

Post by gladius »

mcostalba wrote:A bit short of that: after more than 13K games we are at +7 ELO (almost all due to the last Gary's patch). I have stopped the test and moved on with test queue.....
Ah, unfortunate. I think the undefended pieces idea didn't really pan out. I'll run a test with/without that in the eval against a broader set of engines and see how things go.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: SF improvements in GitHub.

Post by mcostalba »

gladius wrote:
mcostalba wrote:A bit short of that: after more than 13K games we are at +7 ELO (almost all due to the last Gary's patch). I have stopped the test and moved on with test queue.....
Ah, unfortunate. I think the undefended pieces idea didn't really pan out. I'll run a test with/without that in the eval against a broader set of engines and see how things go.
I have already started the same test removing my patch on "pinning bonuses"....let's see how it goes.

But my experience says that those 3 ELO points missing are missing forever ;-) the best we can do is to remove code not useful, this is something I did in the past and I will do in the next weeks: people always adds stuff to evaluation, but I am going the other route, I try to remove from evaluation as much as possible without introducing regressions. The goal of this is to have in evaluation only stuff that really works. This paves the way to new advances, otherwise good stuff is mixed with noise.

Note that removing from evaluation what does not work is very risky and very very difficult because requires super resolution: it is tempting removing stuff that in tests shows +0 or also -1, -2 then after removing 2-3 of them you test the result and engine is weaker. That's why I am very careful to add new code to evaluation, because if it is useless becomes extremely difficult to step back once it is in.
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: SF improvements in GitHub.

Post by gladius »

mcostalba wrote:But my experience says that those 3 ELO points missing are missing forever ;-) the best we can do is to remove code not useful, this is something I did in the past and I will do in the next weeks: people always adds stuff to evaluation, but I am going the other route, I try to remove from evaluation as much as possible without introducing regressions. The goal of this is to have in evaluation only stuff that really works. This paves the way to new advances, otherwise good stuff is mixed with noise.

Note that removing from evaluation what does not work is very risky and very very difficult because requires super resolution: it is tempting removing stuff that in tests shows +0 or also -1, -2 then after removing 2-3 of them you test the result and engine is weaker. That's why I am very careful to add new code to evaluation, because if it is useless becomes extremely difficult to step back once it is in.
I think that's a great way to go about things. Especially as the extra code makes it harder to understand the evaluation, and how things work together. Makes it more difficult to tune as well.

Undefended_pieces_removal test is running now :).