I'm facing a self test issue ... I have a +40 in self-testing that seems to be in fact a -20 against others ...
I'm quite aware that self-test are often over optimistic, but here
This concern eval features that are taking into account "next move" possibilities (hanging pieces, forks, pawn push, ...)
Any advice ?
self test
Moderators: hgm, Rebel, chrisw
-
- Posts: 1871
- Joined: Sat Nov 25, 2017 2:28 pm
- Location: France
-
- Posts: 27793
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: self test
This is also a general concern; in a sense every test against other alpha-beta engines is a self-test, as these engines are so much alike ('incestuous testing'). So what we think is a huge improvement, in the end resulting in Elos rising from 1800 to 3300 or more, might be -500 Elo when tested against entities that think in an entirely different way.
-
- Posts: 1357
- Joined: Wed Mar 08, 2006 10:15 pm
- Location: San Francisco, California
Re: self test
My answer is pretty simple - I never self-test. I always run a gauntlet of 12 different engines that are mostly -20/+50 ELO compared to Myrddin whenever I have a new version to test.xr_a_y wrote: ↑Wed Sep 18, 2019 7:04 pm I'm facing a self test issue ... I have a +40 in self-testing that seems to be in fact a -20 against others ...
I'm quite aware that self-test are often over optimistic, but here
This concern eval features that are taking into account "next move" possibilities (hanging pieces, forks, pawn push, ...)
Any advice ?
-
- Posts: 146
- Joined: Fri Mar 15, 2019 8:46 pm
- Location: Germany
- Full name: Fabian von der Warth
Re: self test
Just because I am curious.
How many games were run against whom(in selfplay and vs. others) at which TC with what results?
How many games were run against whom(in selfplay and vs. others) at which TC with what results?
Author of FabChess: https://github.com/fabianvdW/FabChess
A UCI compliant chess engine written in Rust.
FabChessWiki: https://github.com/fabianvdW/FabChess/wiki
fabianvonderwarth@gmail.com
A UCI compliant chess engine written in Rust.
FabChessWiki: https://github.com/fabianvdW/FabChess/wiki
fabianvonderwarth@gmail.com
-
- Posts: 570
- Joined: Mon Jul 20, 2015 5:06 pm
Re: self test
It is useful to include and self test the old version (without the current patch) inside the multi-engine gauntlet. That way, one can make sure there is a definite elo increase.JVMerlino wrote: ↑Wed Sep 18, 2019 7:27 pmMy answer is pretty simple - I never self-test. I always run a gauntlet of 12 different engines that are mostly -20/+50 ELO compared to Myrddin whenever I have a new version to test.xr_a_y wrote: ↑Wed Sep 18, 2019 7:04 pm I'm facing a self test issue ... I have a +40 in self-testing that seems to be in fact a -20 against others ...
I'm quite aware that self-test are often over optimistic, but here
This concern eval features that are taking into account "next move" possibilities (hanging pieces, forks, pawn push, ...)
Any advice ?
-
- Posts: 1871
- Joined: Sat Nov 25, 2017 2:28 pm
- Location: France
Re: self test
1000 games in self test resulting in a +/-20 margin
5000 games in a 10 men tourney resulting in a +/19 margin
But this was repeated twice with the same results.
TC is 40/20sec 1024Mo TT.
-
- Posts: 1871
- Joined: Sat Nov 25, 2017 2:28 pm
- Location: France
Re: self test
I did ! 2 previous version were includedD Sceviour wrote: ↑Wed Sep 18, 2019 8:31 pmIt is useful to include and self test the old version (without the current patch) inside the multi-engine gauntlet. That way, one can make sure there is a definite elo increase.JVMerlino wrote: ↑Wed Sep 18, 2019 7:27 pmMy answer is pretty simple - I never self-test. I always run a gauntlet of 12 different engines that are mostly -20/+50 ELO compared to Myrddin whenever I have a new version to test.xr_a_y wrote: ↑Wed Sep 18, 2019 7:04 pm I'm facing a self test issue ... I have a +40 in self-testing that seems to be in fact a -20 against others ...
I'm quite aware that self-test are often over optimistic, but here
This concern eval features that are taking into account "next move" possibilities (hanging pieces, forks, pawn push, ...)
Any advice ?
-
- Posts: 1871
- Joined: Sat Nov 25, 2017 2:28 pm
- Location: France
Re: self test
I think for example that it was indeed harder for Minic to match Winter compared to other ~2900 rated engineshgm wrote: ↑Wed Sep 18, 2019 7:14 pm This is also a general concern; in a sense every test against other alpha-beta engines is a self-test, as these engines are so much alike ('incestuous testing'). So what we think is a huge improvement, in the end resulting in Elos rising from 1800 to 3300 or more, might be -500 Elo when tested against entities that think in an entirely different way.
-
- Posts: 146
- Joined: Fri Mar 15, 2019 8:46 pm
- Location: Germany
- Full name: Fabian von der Warth
Re: self test
Very weird behaviour, haven't experienced that against other engines(then again, I don't test that extensivly against other engines). Usually the self gain is about exactly what I get against other engines aswell.
Your TC seems good to me, I usually test selfplay at 10 +0.1 and if a patch passes then to confirm at 60 +0.6. I would say 40/20 is roughly somewhere between 20 +0.2 and 30 +0.3, which should be sufficient
Your TC seems good to me, I usually test selfplay at 10 +0.1 and if a patch passes then to confirm at 60 +0.6. I would say 40/20 is roughly somewhere between 20 +0.2 and 30 +0.3, which should be sufficient
Author of FabChess: https://github.com/fabianvdW/FabChess
A UCI compliant chess engine written in Rust.
FabChessWiki: https://github.com/fabianvdW/FabChess/wiki
fabianvonderwarth@gmail.com
A UCI compliant chess engine written in Rust.
FabChessWiki: https://github.com/fabianvdW/FabChess/wiki
fabianvonderwarth@gmail.com
-
- Posts: 2554
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: self test
I always did only self-testing, and typically got about one half of the expected gain in CCRL.
I certainly worked for me and works for others (if you play enough games)
I did something non-standard perhaps, namely always playing against the last released version (=any fixed stable previous version);
this way I didn't fall for the trap of accumulating "improvements" when you actually accumulate error (i.e. not chasing your own tail)
(this is especially true for small improvements).
Always 10k games for small improvements, but error bars still 9 elo.
I certainly worked for me and works for others (if you play enough games)
I did something non-standard perhaps, namely always playing against the last released version (=any fixed stable previous version);
this way I didn't fall for the trap of accumulating "improvements" when you actually accumulate error (i.e. not chasing your own tail)
(this is especially true for small improvements).
Always 10k games for small improvements, but error bars still 9 elo.
Martin Sedlak