how to properly test the changes to the engine ?

Robert Pope · Post by **Robert Pope** » Wed Feb 01, 2017 4:43 pm

MahmoudUthman wrote:1-how many opponents to use ?

I do mostly self-play, and only periodically run a gauntlet against a half-dozen opponent (and an older benchmark of itself). I used to strictly use gauntlets against 8 players, but I was using PSWBTM, and that didn't allow me to play fast enough games, so I always had too much noise in the testing.

2-how much should the rating difference between the weakest and the strongest opponents be ?

I think a range of (-100,200) is good for lower tier engines. You still score at least 25% against everyone, and it allows you to keep using many of the same engines over time. It's important to know that the engines you are using are solid.

3-how many different starting positions should I use ?

As many as you are playing games. You don't want your testing to repeat the same game, as that generates false precision. I use 8moves_GM_LB.pgn which has at least a couple thousand openings

4-smallest sufficient time control ? and should I test at different time control or should a single fixed time control be enough ?

Most of the time, I just test at as fast as I can get away with. You almost have to, in order to get enough games in. I use 6+0.25, which is about as fast as my engine can handle without side effects.

5-How many games to play ?

I use SPRT, with a 2000 game maximum (the number my computer can play overnight).

7-Is it okay to use concurrent games ?

As long as you have the CPUs (not threads) to support it. I use 3 on my 4-core to leave a core free for Windows processing and occasional browsing.

6-anything else to consider ?

Stockfish has really highlighted the importance of a rigorous testing process. It's way too easy to code 5 things at once, pat yourself on the back and move on. But then you don't know if those changes were actually harmful because you didn't take the time to play enough games, or which of the changes were actually helpful. It might be a 10 elo improvement, but was it +2+2+2+2+2, or +30+0+2-15-7? Things that I thought were going to be obvious leaps forward ended up being -50 failures after adequate testing.

The worst feeling in the world is spending two weeks on 3 unrelated items, having a huge step back, and having no idea which piece ruined your program.

JVMerlino · Post by **JVMerlino** » Wed Feb 01, 2017 5:38 pm

MahmoudUthman wrote:1-how many opponents to use ?
2-how much should the rating difference between the weakest and the strongest opponents be ?
3-how many different starting positions should I use ?
4-smallest sufficient time control ? and should I test at different time control or should a single fixed time control be enough ?
5-How many games to play ?
6-Is it okay to use concurrent games ?
7-anything else to consider ?

1. I use 12 opponents.
2. I try to get them within a range of +/-100.
3. I always play from the starting position only.
4. I play 1-minute games. It's somewhat closer to "reality", in that Myrddin usually plays on ICS and some people like to play 1-minute games against it.
5. My gauntlet is embarrassingly short at only 60 games per opponent, so 720 games total. But the person I use for formal testing (since it's always good to have an independent tester) says that this is sufficient before sending a candidate version to him, as long as my testing shows at least +30 elo.
6. I have never used concurrent games, so I can't comment on if it's good or bad.
7. I used to test with pondering on before I implemented SMP. Now I test with 4 CPU and pondering off. Once I get a candidate version I also test with 1 CPU and pondering on, just to make sure I didn't break anything (although that code has been reliable for years).

JVMerlino · Post by **JVMerlino** » Wed Feb 01, 2017 6:05 pm

JVMerlino wrote:
MahmoudUthman wrote:1-how many opponents to use ?
2-how much should the rating difference between the weakest and the strongest opponents be ?
3-how many different starting positions should I use ?
4-smallest sufficient time control ? and should I test at different time control or should a single fixed time control be enough ?
5-How many games to play ?
6-Is it okay to use concurrent games ?
7-anything else to consider ?
1. I use 12 opponents.
2. I try to get them within a range of +/-100.
3. I always play from the starting position only.
4. I play 1-minute games. It's somewhat closer to "reality", in that Myrddin usually plays on ICS and some people like to play 1-minute games against it.
5. My gauntlet is embarrassingly short at only 60 games per opponent, so 720 games total. But the person I use for formal testing (since it's always good to have an independent tester) says that this is sufficient before sending a candidate version to him, as long as my testing shows at least +30 elo.
6. I have never used concurrent games, so I can't comment on if it's good or bad.
7. I used to test with pondering on before I implemented SMP. Now I test with 4 CPU and pondering off. Once I get a candidate version I also test with 1 CPU and pondering on, just to make sure I didn't break anything (although that code has been reliable for years).

One other thing is that I also occasionally use the STS suite at https://sites.google.com/site/strategictestsuite/ to ensure that the engine doesn't have any big holes in the eval. Well, ok, it DOES have some big holes in the eval. But at least this suite shows them to me in all their glaring awfulness....

jswaff · Post by **jswaff** » Fri Feb 03, 2017 4:06 pm

John would you mind sharing with me what opponents you use? After a long absence I'm working on Prophet again. Right now I'm putting together a testing procedure myself, currently hunting down opponents similar in strength to my engine. Myrddin is one that I have found.. it's close in strength (a little stronger I think), reliable, and doesn't lose on time in fast time controls. Perhaps some of your sparring partners would work well for me too.

Guenther · Post by **Guenther** » Fri Feb 03, 2017 4:47 pm

jswaff wrote:John would you mind sharing with me what opponents you use? After a long absence I'm working on Prophet again. Right now I'm putting together a testing procedure myself, currently hunting down opponents similar in strength to my engine. Myrddin is one that I have found.. it's close in strength (a little stronger I think), reliable, and doesn't lose on time in fast time controls. Perhaps some of your sparring partners would work well for me too.

You should try my my chronology site for finding equivalent sparring partners.
Either Chess4J is much stronger than I have it seen it before, or you missed how strong Myrddin is meanwhile?
Anyhow do you have an estimation of the last Chess4j 3.0 in any known rating scale? (CCRL - CEGT - GURL - FCP and so on...)
http://rwbc-chess.de/chronology.htm

Guenther

Edit:

Ok, I confess I cheated right now ;-) I checked again CCRl and found now
an entry of version 3.0 at 40/4 with a rating of around 1795 at their scale.
This makes it roughly class 'C2' in my system, before it was 'F' which
probably was the first available thing still compiled by Jim Ablett.
Yet it is miles away from latest Myrddin?

jswaff · Post by **jswaff** » Fri Feb 03, 2017 4:52 pm

Hey Guenther,
This is for Prophet, not chess4j. (Yes I have two engines.)

Current Prophet fits in around the 2300 range on the CCRL scale.

--
James

Guenther · Post by **Guenther** » Fri Feb 03, 2017 4:55 pm

jswaff wrote:Hey Guenther,
This is for Prophet, not chess4j. (Yes I have two engines.)

Current Prophet fits in around the 2300 range on the CCRL scale.

--
James

Hi James, well that's a different story then. Sorry for the missunderstanding.

Guenther

JVMerlino · Post by **JVMerlino** » Fri Feb 03, 2017 5:08 pm

jswaff wrote:John would you mind sharing with me what opponents you use? After a long absence I'm working on Prophet again. Right now I'm putting together a testing procedure myself, currently hunting down opponents similar in strength to my engine. Myrddin is one that I have found.. it's close in strength (a little stronger I think), reliable, and doesn't lose on time in fast time controls. Perhaps some of your sparring partners would work well for me too.

Hi James,

I'm currently in the process of updating a few of my opponents. But here are the ten that are on the list at this time:
AnMon 5.75
DanaSah 4.60
Dirty Apr 24 2011
Greko 7.2
Kiwi 0.6d
Patzer 3.80
Pepito 1.59
Sloppy 0.1.1
SlowBlitzWV
Trace 1.37a
Ufim 8.02

Very astute members of this board will notice one thing that all of these engines have in common -- they're all Winboard compatible. This is because I use Wildcat for my testing, and it only supports Winboard. But they're also all very stable and won't lose on time in 1-minute games.

I do tend to make things hard on myself, sometimes.

jm

how to properly test the changes to the engine ?

Re: how to properly test the changes to the engine ?

Re: how to properly test the changes to the engine ?

Re: how to properly test the changes to the engine ?

Re: how to properly test the changes to the engine ?

Re: how to properly test the changes to the engine ?

Re: how to properly test the changes to the engine ?

Re: how to properly test the changes to the engine ?

Re: how to properly test the changes to the engine ?