fast game testing

Engin · Post by **Engin** » Sun Jan 08, 2012 7:26 pm

Hi Jon,

i am testing with 20+0.2 as minimum and about 2500 games as minimum, this give me on the end an error bar about +-10, and this is accurate enough, more games are better but it will not much change.
i cant believe that some people are proud because they made so much games about 30k, why ? it will not be change anything.

On long time control of course is not possible to make so much games, this will take a long time of you limit spare time.
But i am sure you dont need more 100 games to play to see what is better and wich version is bad.

Engin · Post by **Engin** » Sun Jan 08, 2012 7:31 pm

not need this math nonsense !

Evert · Post by **Evert** » Sun Jan 08, 2012 7:45 pm

Try randomly sampling 40 games from the 10+0.1 sample and compare the result from that with the 40/4 time control. Repeat several times.

That should confirm what others have already said: 40 games is not enough to draw conclusions. Of course, different time controls can change things in unpredictabe ways depending on your time control algorithm. For instance, Jazz (and Sjaak) do poorly at sudden-death time controls (game in N seconds) because of the bad implementation I have for that (I have a better algorithm planned, but I'm not that interested in sudden-death time control).

Evert · Post by **Evert** » Sun Jan 08, 2012 7:53 pm

Engin wrote: i am testing with 20+0.2 as minimum and about 2500 games as minimum, this give me on the end an error bar about +-10, and this is accurate enough, more games are better but it will not much change.
i cant believe that some people are proud because they made so much games about 30k, why ? it will not be change anything.

Can you proof that?
If the error bar is +/- 10 elo, you cannot measure strength improvements that are smaller. In fact, you can barely resolve strength improvements that are of that order. So if that is you expected improvement, you need that many games.

On long time control of course is not possible to make so much games, this will take a long time of you limit spare time.
But i am sure you dont need more 100 games to play to see what is better and wich version is bad.

You might think that, and you'd be wrong. I've seen many cases where a change seems very good (or very poor) at a few 100 games, and it converges down to nothing as the number of games approaches 1e4.
You don't need more than a 100 games to determine which engine is better if the strength difference is large. But testing against much stronger or much weaker engines isn't particularly good (I'd even go so far as to say it's bad) and you're better off testing against engines of comparable strength. You actually need fewer games to resolve an improvement that way because the relative change in rating difference is bigger (err... well, actually, I didn't work out the statistics for that, but I'd be very surprised if it wasn't true).

lucasart · Post by **lucasart** » Sun Jan 08, 2012 11:40 pm

jdart wrote:I have started running matches at fast time control (10 sec + 0.1 sec increment) for testing and tuning purposes. I am not yet doing a huge quantity of games but have done 2-3 thousand overnight, using cutechess-cli. I am getting depth 10-14 typically at this time control, enough to have multithreading useful if enabled.

But I am a bit bothered because I get significantly different match results at longer time control. For example: a recent run at 10+0.1:

GNU Chess 5.07.170.5b_TCEC : +149 -154 =97 400 total 49.375%

but at 40/4 time control (40 games):

GNU Chess 5.07.170.5b_TCEC : +19 -7 =14 40 total 65%

This could just be fluctuation but it is a big difference. I am not getting any losses on time. So, is my program just bad at short time control, or is this sort of thing usual?

40 games are meaningless. i suggest you calculate the standard deviation of your 40 game estimator, and you'll understand.
but it is true, however, that certain modifications scale better or worse at long time control. A recent example for me was the introduction of double reductions (reducing by 2 plies late quiet moves), they clearly scale better at longer time control

Don · Post by **Don** » Sat Jan 14, 2012 5:05 am

Engin wrote:Hi Jon,

i am testing with 20+0.2 as minimum and about 2500 games as minimum, this give me on the end an error bar about +-10, and this is accurate enough, more games are better but it will not much change.
i cant believe that some people are proud because they made so much games about 30k, why ? it will not be change anything.

Because the top programs are generally looking for 2 or 3 ELO improvements for a single change.

Measuring 2 ELO with 2500 games is a fools errand. When we make a change that we hope is a small improvement we might run 20-50k games, but we still know that there is a non-trivial chance it's a regression - if the score is very close.

On long time control of course is not possible to make so much games, this will take a long time of you limit spare time.
But i am sure you dont need more 100 games to play to see what is better and wich version is bad.

Ferdy · Post by **Ferdy** » Sat Jan 14, 2012 6:53 am

jdart wrote:I have started running matches at fast time control (10 sec + 0.1 sec increment) for testing and tuning purposes. I am not yet doing a huge quantity of games but have done 2-3 thousand overnight, using cutechess-cli. I am getting depth 10-14 typically at this time control, enough to have multithreading useful if enabled.

But I am a bit bothered because I get significantly different match results at longer time control. For example: a recent run at 10+0.1:

GNU Chess 5.07.170.5b_TCEC : +149 -154 =97 400 total 49.375%

but at 40/4 time control (40 games):

GNU Chess 5.07.170.5b_TCEC : +19 -7 =14 40 total 65%

This could just be fluctuation but it is a big difference. I am not getting any losses on time. So, is my program just bad at short time control, or is this sort of thing usual?

I don't test and compare like this. If I test at TC 10+0.1 and decided to do a shorter TC for comparison I will use 3+0.1, or 2+0.1, not 40/x.
Also be careful doing comparisons when there are TC changes, because your sparring partner/s may actually scale bad (or perhaps scale better) than yours, in that case you can't blame all to your engine.

Dave_N · Post by **Dave_N** » Sat Jan 14, 2012 8:36 am

I tried 1s + 0 increment over the last week, one or two versions that didn't detect draws seemed to finished accurately within 2 seconds, the others did not, the current version seems to have too much processor time spent by the gui not the engines, up to 40% if I try to read every ms, even after disabling the window redraws. So far Houdini seems to be by far the strongest at those time controls, however I appreciate that the difference is the same as the difference for human players playing bullet or correspondence for example. I am worried that Houdini is the only engine that can even think under these conditions.

I am thinking about implementing a multi-engine MCTS, i.e. 2 engines playing both sides for each position, however I am not sure about how many games are needed or whether it would be accurate, considering the engine performance at these time controls.

hgm · Post by **hgm** » Sat Jan 14, 2012 9:30 am

What is the GUI doing that makes it use so much CPU? Is that purely the effort in polling for input every ms?

Ideally you should not poll for input at all, but just have a dedicated input thread for each engine, which is just blocking on a read call. Then it will only (and instantly) get alive when there is input, so it can check if there is a move, and then pass it on to the opponent engine, before reverting to waiting for more input. That would give a faster response, as well as lower CPU usage.

Dave_N · Post by **Dave_N** » Sat Jan 14, 2012 3:23 pm

Yes I think it is the thread calling CheckInput() thats causing the cpu overload, I have tried commenting the calls to redraw the board and the notation and these don't appear to be the problem, also I stopped writing to the hash table to make sure that wasn't the problem, the clock is ticked all the time (case of a time-out) to update the time in milliseconds, ironically the previous and less accurate version was actually sending "go wtime {wseconds*1000} btime {bseconds*1000}" and this might have produced better cpu usage from the gui. Houdini's results under those conditions, with "go wtime 1000 btime 1000" were worse.

I was trying to implement a blocking read call with the statement
" while( inpt.CanRead() ) " however I think I have a mistake in my implementation.

The program is fine for bullet and blitz atm, since I change the time between reads. Perhaps a dedicated program in self-play conditions is the only good way to do a MCTS anyway.

fast game testing

Re: fast game testing

Re: Error bars for this couple of matches.

Re: fast game testing

Re: fast game testing

Re: fast game testing

Re: fast game testing

Re: fast game testing

Re: fast game testing

Re: fast game testing

Re: fast game testing