self test

xr_a_y · Post by **xr_a_y** » Wed Sep 18, 2019 7:04 pm

I'm facing a self test issue ... I have a +40 in self-testing that seems to be in fact a -20 against others ...
I'm quite aware that self-test are often over optimistic, but here

This concern eval features that are taking into account "next move" possibilities (hanging pieces, forks, pawn push, ...)

Any advice ?

hgm · Post by **hgm** » Wed Sep 18, 2019 7:14 pm

This is also a general concern; in a sense every test against other alpha-beta engines is a self-test, as these engines are so much alike ('incestuous testing'). So what we think is a huge improvement, in the end resulting in Elos rising from 1800 to 3300 or more, might be -500 Elo when tested against entities that think in an entirely different way.

JVMerlino · Post by **JVMerlino** » Wed Sep 18, 2019 7:27 pm

xr_a_y wrote: ↑Wed Sep 18, 2019 7:04 pm I'm facing a self test issue ... I have a +40 in self-testing that seems to be in fact a -20 against others ...
I'm quite aware that self-test are often over optimistic, but here

This concern eval features that are taking into account "next move" possibilities (hanging pieces, forks, pawn push, ...)

Any advice ?

My answer is pretty simple - I never self-test.

I always run a gauntlet of 12 different engines that are mostly -20/+50 ELO compared to Myrddin whenever I have a new version to test.

fabianVDW · Post by **fabianVDW** » Wed Sep 18, 2019 8:12 pm

Just because I am curious.

How many games were run against whom(in selfplay and vs. others) at which TC with what results?

D Sceviour · Post by **D Sceviour** » Wed Sep 18, 2019 8:31 pm

JVMerlino wrote: ↑Wed Sep 18, 2019 7:27 pm
xr_a_y wrote: ↑Wed Sep 18, 2019 7:04 pm I'm facing a self test issue ... I have a +40 in self-testing that seems to be in fact a -20 against others ...
I'm quite aware that self-test are often over optimistic, but here

This concern eval features that are taking into account "next move" possibilities (hanging pieces, forks, pawn push, ...)

Any advice ?
My answer is pretty simple - I never self-test. I always run a gauntlet of 12 different engines that are mostly -20/+50 ELO compared to Myrddin whenever I have a new version to test.

It is useful to include and self test the old version (without the current patch) inside the multi-engine gauntlet. That way, one can make sure there is a definite elo increase.

xr_a_y · Post by **xr_a_y** » Wed Sep 18, 2019 8:36 pm

fabianVDW wrote: ↑Wed Sep 18, 2019 8:12 pm Just because I am curious.

How many games were run against whom(in selfplay and vs. others) at which TC with what results?

1000 games in self test resulting in a +/-20 margin

5000 games in a 10 men tourney resulting in a +/19 margin

But this was repeated twice with the same results.

TC is 40/20sec 1024Mo TT.

xr_a_y · Post by **xr_a_y** » Wed Sep 18, 2019 8:37 pm

D Sceviour wrote: ↑Wed Sep 18, 2019 8:31 pm
JVMerlino wrote: ↑Wed Sep 18, 2019 7:27 pm
xr_a_y wrote: ↑Wed Sep 18, 2019 7:04 pm I'm facing a self test issue ... I have a +40 in self-testing that seems to be in fact a -20 against others ...
I'm quite aware that self-test are often over optimistic, but here

This concern eval features that are taking into account "next move" possibilities (hanging pieces, forks, pawn push, ...)

Any advice ?
My answer is pretty simple - I never self-test. I always run a gauntlet of 12 different engines that are mostly -20/+50 ELO compared to Myrddin whenever I have a new version to test.
It is useful to include and self test the old version (without the current patch) inside the multi-engine gauntlet. That way, one can make sure there is a definite elo increase.

I did ! 2 previous version were included

xr_a_y · Post by **xr_a_y** » Wed Sep 18, 2019 8:44 pm

hgm wrote: ↑Wed Sep 18, 2019 7:14 pm This is also a general concern; in a sense every test against other alpha-beta engines is a self-test, as these engines are so much alike ('incestuous testing'). So what we think is a huge improvement, in the end resulting in Elos rising from 1800 to 3300 or more, might be -500 Elo when tested against entities that think in an entirely different way.

I think for example that it was indeed harder for Minic to match Winter compared to other ~2900 rated engines

fabianVDW · Post by **fabianVDW** » Wed Sep 18, 2019 8:59 pm

Very weird behaviour, haven't experienced that against other engines(then again, I don't test that extensivly against other engines). Usually the self gain is about exactly what I get against other engines aswell.

Your TC seems good to me, I usually test selfplay at 10 +0.1 and if a patch passes then to confirm at 60 +0.6. I would say 40/20 is roughly somewhere between 20 +0.2 and 30 +0.3, which should be sufficient

mar · Post by **mar** » Fri Sep 20, 2019 5:07 pm

I always did only self-testing, and typically got about one half of the expected gain in CCRL.

I certainly worked for me and works for others (if you play enough games)

I did something non-standard perhaps, namely always playing against the last released version (=any fixed stable previous version);

this way I didn't fall for the trap of accumulating "improvements" when you actually accumulate error (i.e. not chasing your own tail)

(this is especially true for small improvements).

Always 10k games for small improvements, but error bars still 9 elo.

self test

self test

Re: self test

Re: self test

Re: self test

Re: self test

Re: self test

Re: self test

Re: self test

Re: self test

Re: self test