self test

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
xr_a_y
Posts: 1871
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

self test

Post by xr_a_y »

I'm facing a self test issue ... I have a +40 in self-testing that seems to be in fact a -20 against others ...
I'm quite aware that self-test are often over optimistic, but here :shock: :shock:

This concern eval features that are taking into account "next move" possibilities (hanging pieces, forks, pawn push, ...)

Any advice ?
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: self test

Post by hgm »

This is also a general concern; in a sense every test against other alpha-beta engines is a self-test, as these engines are so much alike ('incestuous testing'). So what we think is a huge improvement, in the end resulting in Elos rising from 1800 to 3300 or more, might be -500 Elo when tested against entities that think in an entirely different way.
JVMerlino
Posts: 1357
Joined: Wed Mar 08, 2006 10:15 pm
Location: San Francisco, California

Re: self test

Post by JVMerlino »

xr_a_y wrote: Wed Sep 18, 2019 7:04 pm I'm facing a self test issue ... I have a +40 in self-testing that seems to be in fact a -20 against others ...
I'm quite aware that self-test are often over optimistic, but here :shock: :shock:

This concern eval features that are taking into account "next move" possibilities (hanging pieces, forks, pawn push, ...)

Any advice ?
My answer is pretty simple - I never self-test. :) I always run a gauntlet of 12 different engines that are mostly -20/+50 ELO compared to Myrddin whenever I have a new version to test.
fabianVDW
Posts: 146
Joined: Fri Mar 15, 2019 8:46 pm
Location: Germany
Full name: Fabian von der Warth

Re: self test

Post by fabianVDW »

Just because I am curious.

How many games were run against whom(in selfplay and vs. others) at which TC with what results?
Author of FabChess: https://github.com/fabianvdW/FabChess
A UCI compliant chess engine written in Rust.
FabChessWiki: https://github.com/fabianvdW/FabChess/wiki
fabianvonderwarth@gmail.com
D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Re: self test

Post by D Sceviour »

JVMerlino wrote: Wed Sep 18, 2019 7:27 pm
xr_a_y wrote: Wed Sep 18, 2019 7:04 pm I'm facing a self test issue ... I have a +40 in self-testing that seems to be in fact a -20 against others ...
I'm quite aware that self-test are often over optimistic, but here :shock: :shock:

This concern eval features that are taking into account "next move" possibilities (hanging pieces, forks, pawn push, ...)

Any advice ?
My answer is pretty simple - I never self-test. :) I always run a gauntlet of 12 different engines that are mostly -20/+50 ELO compared to Myrddin whenever I have a new version to test.
It is useful to include and self test the old version (without the current patch) inside the multi-engine gauntlet. That way, one can make sure there is a definite elo increase.
User avatar
xr_a_y
Posts: 1871
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

Re: self test

Post by xr_a_y »

fabianVDW wrote: Wed Sep 18, 2019 8:12 pm Just because I am curious.

How many games were run against whom(in selfplay and vs. others) at which TC with what results?
1000 games in self test resulting in a +/-20 margin

5000 games in a 10 men tourney resulting in a +/19 margin

But this was repeated twice with the same results.

TC is 40/20sec 1024Mo TT.
User avatar
xr_a_y
Posts: 1871
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

Re: self test

Post by xr_a_y »

D Sceviour wrote: Wed Sep 18, 2019 8:31 pm
JVMerlino wrote: Wed Sep 18, 2019 7:27 pm
xr_a_y wrote: Wed Sep 18, 2019 7:04 pm I'm facing a self test issue ... I have a +40 in self-testing that seems to be in fact a -20 against others ...
I'm quite aware that self-test are often over optimistic, but here :shock: :shock:

This concern eval features that are taking into account "next move" possibilities (hanging pieces, forks, pawn push, ...)

Any advice ?
My answer is pretty simple - I never self-test. :) I always run a gauntlet of 12 different engines that are mostly -20/+50 ELO compared to Myrddin whenever I have a new version to test.
It is useful to include and self test the old version (without the current patch) inside the multi-engine gauntlet. That way, one can make sure there is a definite elo increase.
I did ! 2 previous version were included
User avatar
xr_a_y
Posts: 1871
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

Re: self test

Post by xr_a_y »

hgm wrote: Wed Sep 18, 2019 7:14 pm This is also a general concern; in a sense every test against other alpha-beta engines is a self-test, as these engines are so much alike ('incestuous testing'). So what we think is a huge improvement, in the end resulting in Elos rising from 1800 to 3300 or more, might be -500 Elo when tested against entities that think in an entirely different way.
I think for example that it was indeed harder for Minic to match Winter compared to other ~2900 rated engines
fabianVDW
Posts: 146
Joined: Fri Mar 15, 2019 8:46 pm
Location: Germany
Full name: Fabian von der Warth

Re: self test

Post by fabianVDW »

Very weird behaviour, haven't experienced that against other engines(then again, I don't test that extensivly against other engines). Usually the self gain is about exactly what I get against other engines aswell.

Your TC seems good to me, I usually test selfplay at 10 +0.1 and if a patch passes then to confirm at 60 +0.6. I would say 40/20 is roughly somewhere between 20 +0.2 and 30 +0.3, which should be sufficient
Author of FabChess: https://github.com/fabianvdW/FabChess
A UCI compliant chess engine written in Rust.
FabChessWiki: https://github.com/fabianvdW/FabChess/wiki
fabianvonderwarth@gmail.com
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: self test

Post by mar »

I always did only self-testing, and typically got about one half of the expected gain in CCRL.

I certainly worked for me and works for others (if you play enough games)

I did something non-standard perhaps, namely always playing against the last released version (=any fixed stable previous version);

this way I didn't fall for the trap of accumulating "improvements" when you actually accumulate error (i.e. not chasing your own tail)

(this is especially true for small improvements).

Always 10k games for small improvements, but error bars still 9 elo.
Martin Sedlak