measuring improvement of the engine

cdani · Post by **cdani** » Sat Jan 02, 2016 1:22 pm

I tried to improve what I was doing to measure the improvements of Andscacs, and I wanted to share it.

The idea is call Ordo to show the ratings of the gauntlent/selfplay and also bayeselo to show error bars. Maybe this second thing can be done with Ordo, but I don't know how.

Is a bat file that takes various arguments, to be able to work with pgn files done simultaneously in different computers.

First copies the original pgn files in a folder "copiespgn".

Then concatenates all the pgn files to the first one and removes the other files.

Then uses the findandreplace utility to change
Result "?"
to
Result "*"
because is incompatible with Ordo.

Then calls Ordo and finally Bayeselo.

The way to call this batch file from windows explorer is to drag the pgn files to the bat file.

May be someone wants to comment something.

Code: Select all

@echo off

f&#58;
cd F&#58;\p\c\cutechess-cli\

for %%f in (%*) do (
	copy /Y %%f F&#58;\p\c\cutechess-cli\copiespgn
	IF NOT "%%f" == "%1" (
		type "%%f" >> %1
		del %%f
	)
)

rem https&#58;//findandreplace.codeplex.com/
"fnr.exe" --silent --cl --dir "F&#58;\p\c\cutechess-cli" --fileMask "%~n1.pgn" --excludeFileMask "*.dll, *.exe" --find "Result ""?""" --replace "Result ""*"""

xordo-win64.exe -a2856 -q -p %1 -G -W -D

echo readpgn %1 >elo.txt
echo elo  >>elo.txt
echo mm >>elo.txt
echo exactdist >>elo.txt
echo ratings >>elo.txt
echo x >>elo.txt
echo x >>elo.txt

type elo.txt|bayeselo.exe|grep Andscacs
del elo.txt
pause

Result example:

Code: Select all

   # PLAYER                 &#58; RATING    POINTS  PLAYED    (%)
   1 Equinox 3.20 x64mp     &#58; 2934.5     180.5     272   66.4%
   2 Bouquet 1.8 x64        &#58; 2907.5     173.5     276   62.9%
   3 Komodo 5.1r2 64-bit    &#58; 2875.8     160.5     274   58.6%
   4 Gull 3 x64             &#58; 2852.1     151.5     274   55.3%
   5 Fire 4 x64             &#58; 2845.4     150.0     276   54.3%
   6 Komodo 7a 64-bit       &#58; 2839.1     146.5     274   53.5%
   7 Elektro 1.2            &#58; 2839.0     148.0     277   53.4%
   8 Stockfish_290915       &#58; 2837.5     148.0     278   53.2%
   9 Protector 1.9.0        &#58; 2835.4     144.0     272   52.9%
  10 BlackMamba 2.0 x64     &#58; 2835.1     145.5     275   52.9%
  11 Andscacs 0.84048       &#58; 2814.6    1200.0    2748   43.7%

White advantage = 28.56
Draw rate &#40;equal opponents&#41; = 31.57 %

2748 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
  11 Andscacs 0.84048      -39   10   11  2748   44%     4   31%

Don't try to understand the results, because the engines use different time controls each one. Also the number of games is very low because is a random pgn file I had.

Evert · Post by **Evert** » Sat Jan 02, 2016 3:42 pm

I use a Perl script to get ratings and LOS from a set of PGN files:

Code: Select all

#!/usr/bin/perl

if ($#ARGV >= 0&#41; &#123;
   open ELO, "| ~/Program/Chess/BayesElo/bayeselo";
   print ELO "prompt off\n";
   foreach (@ARGV&#41; &#123; print ELO "readpgn $_\n"; &#125;
   print ELO "elo
      mm
      ratings
      los
      ";
   close ELO;
&#125;

I have a separate one that does a SPRT test on a match between two programs to see if one of them is superior to the other.

jdart · Post by **jdart** » Sat Jan 02, 2016 6:36 pm

I looks to me (from the Ordo manual) that the -s switch computes error bars.

--Jon

cdani · Post by **cdani** » Sat Jan 02, 2016 6:57 pm

jdart wrote:I looks to me (from the Ordo manual) that the -s switch computes error bars.

--Jon

I see. Didn't understood that was the same. Thanks!

michiguel · Post by **michiguel** » Sat Jan 02, 2016 8:59 pm

cdani wrote:
jdart wrote:I looks to me (from the Ordo manual) that the -s switch computes error bars.

--Jon
I see. Didn't understood that was the same. Thanks!

Correct, errors are calculate after the gauntlets are simulated.

Ordo is well equipped to gauge the improvement of an engine over its previous version. What you may want to do is to run something like this

ordo -p "games.pgn" -A "engine 1.0" -a 0 -D -W -s1000 -n8

if you have an octa, -n4 if you have a quad, etc (to speed up calculation).

Then, the rating of "engine 2.0" will be contrasted to the rating of engine 1.0 with the specific error bar. With the switch -F you control the confidence (95% is the default).

One alternative to do a fast calculation (but still very accurate) is to use the option of "multiple anchors". I use this to check the rating during the run, since the calculation is almost instantaneous.

You need to provide a file with "fixed" ratings of the opponents ("anchors"), which you probably know after using them in multiple previous gauntlets. For instance, a file (named anchors.csv in the following example) containing

"Opponent 1", 3000
"Opponent 2", 2900
"Opponent 3", 2950
etc.

Then you run

ordo -p "games.pgn" -m "anchors.csv" -D -W

which will be really fast, and if you want errors

ordo -p "games.pgn" -m "anchors.csv" -D -W -s1000 -n8

I hope I did not screwed the options or forgot something. Otherwise, let me know.

Miguel

cdani · Post by **cdani** » Sun Jan 03, 2016 4:38 am

Muchas gracias! Thanks!
I will try this.

michiguel · Post by **michiguel** » Sun Jan 03, 2016 8:49 am

cdani wrote:Muchas gracias! Thanks!
I will try this.

Adam suggested me to do this, and I think it may help you too
https://github.com/michiguel/Ordo/releases/tag/1.0-mf5

source code
https://github.com/michiguel/Ordo/tree/1.0-mf5

This version will use several input files. No need to combine them before hand. I think it will make scripts easier. Look at the example in the manual.pdf or the readme file. For instance:

If several input files are used, they can be listed after '--' at the end of all switches:
ordo -a 2800 -A "Engine X" -o ratings.txt -- input1.pgn input2.pgn input3.pgn "My other file.pgn"

This is in the "dev" branch. If it works well, I will move it to "master" at one point, probably after incorporating some other feature.

Miguel

measuring improvement of the engine

measuring improvement of the engine

Re: measuring improvement of the engine

Re: measuring improvement of the engine

Re: measuring improvement of the engine

Re: measuring improvement of the engine

Re: measuring improvement of the engine

Re: measuring improvement of the engine