A poor man's testing environment

Don · Post by **Don** » Fri Jan 11, 2013 9:52 pm

Lyudmil Tsvetkov wrote:
Richard Allbert wrote:Totally agree.

Computers are strong, no doubt, especially with tactics, but I think far too much emphasis is put on their evalutations.

For example, say after move 14 you see +0.50 from one engine playing another -> after 10 more moves this score is usually different.

This means by definition the +0.50 was maybe not accurate afterall.

iirc on Houdini's website (or an interview with CB) Houdart said that Houdini wins 90% of games where it thinks it is > +0.8 in the middlegame.

That's a a big margin of Evaluation error, and thus I never understand the obsession with differentiating lines differing by 0.1

This was especially so during the London Chess Classic - many observers were constantly claiming "mistakes" by the players, due to a +0.5 swing in evaluation by their engine.

Interestingly, the GM's commentating were quite the opposite - they rarely believed the computer assessment, unless it was massively in favour of one player (over +2.0 or so), or tablebases were accessed. Otherwise they used them for tactics checking.

Richard
Hi Richard.
The better an engine's evaluation is, the less it is going to jump (swing).
Older engines like Fritz, etc., would register even more astounding jumps in their evaluations. But, next to evaluations of humans (and humans still evaluate subliminally, as well compute subliminally millions of variations per second, just as computers do), engine evaluations are simply outstanding. If you were able to see jumps in human evaluations (you could judge for them by seeing the larger quantity of mistakes in comparison to computers), you would know, that human evaluations are jumping sometimes not by half a pawn, but by much bigger margins. The inability to foresee a tactical shot, for example, should register a jump measured by a couple of very whole pawns.
I think that not only 0.1 pawns make the difference, but even 0.01 pawns. In chess every single detail counts. An engine showing +30cps and facing an opponent of roughly equal strength has actually very good chances to gradually increase its score up to a winning point, simply because the number of available variations for the defending side still maintaining the score would decrease with every passing move. Chances are much bigger that an advantage in score will be increased, than they are that it would decrease or at least be maintained.
That is why it is important to evaluate every single detail.

Best regards, Lyudmil

Actually, the higher the resolution of your evaluation function, the better it plays - at least at the same depth. However there is a bit of a tradeoff in speed for this. Presumably less tie score which can produce cutoffs.

A resolution of 1/50 is probably as low as is reasonable. Komodo uses 1/100 (internally 1/1000 which is rounded before being passed to the search) and it's common to use 1/256.

It's not that computers can actually make these super fine distinctions, but it does seem to have an impact to try to.

JBNielsen · Post by **JBNielsen** » Sun Jan 13, 2013 1:02 am

Rebel wrote:I like to present a page for starters how to test a chess engine with limited hardware. I am interested in some feedback for further improvement.

http://www.top-5000.nl/tuning.htm

First I must say, that it is an impressive site Ed has!!
Many interesting subjects, well written and a nice layout.
I can recommend everyone to explore it!

About doing tests with a lot of games:
I did not read about the reduced strength you can set up in arena.
When I test my weak engine Dabbaba, I have used Rybka4 with reduced strength as opponent.
It replies 10 times faster than Dabbaba, so the games are played faster.
But Rybka sometimes play a very fast move at depth 1 or 2 and it may be a bad move.
This adds some randomness to the test.
I don't know if it happens with other engines...

- - - -

"Another way to test your search is the use of a collection of special tactical and positional positions. Elo wise this has very little meaning. It depends on the goal you have in mind with your engine, if you want to compete in the elo lists don't give testsets much meaning, focus on eng-eng matches instead."

Is that really true?

From 1986 to 1991 I constructed a test based on more than 100 chess positions.
The goal was to be able to estimate the rating of the programs tested.
And use the test for improving the development of chess programs.

Many of the positions are too easy for the computers of today.
Many of the positions can be seen here: http://talkchess.com/forum/viewtopic.ph ... highlight=
(notice there are errors in a few of them)

You can read more about the test here: http://www.jens-musik.dk/skak.htm
Here you can find more details: the positions, the weights and how they were adjusted, the results, a detailed article about this test etc.

In my old test each position had its own weight, as some positions are more important than others. The adjustment of weights ran automatically for hours, so the tests calculated rating for each chess-engine came as close as possible to the 'real' rating (SSDF-ratinglist) determined by playing a lot of engine vs engine games.

As far as I remember, the test could calculate the rating of 30-40 programs with an average error of 30-40 elo.

How much effort have been put into constructing such a test since then?
Not very much, I think.
Not compared to what have been spent on programming evaluation, search, making test systems etc.

But why should it not be possible?
We have hundreds of engines, and we know their ratings.
And we can have their result from any testposition we want.
Of course it is important how fast the positions are solved.

It is also important if they can solve a testposition.
But if they fail in a position it is very important how they fail.
Avoiding the bad moves are often more important than finding the best move.
The top engines can help us determine the value of every move in a testposition.
I don't think this option has been used very much...

I think 1.000 positions would be enough.
Chess is complex, but the number of factors like material, mobility, king safety, bishoppair etc are limited.

So if we put all the results in a database and (automatically) adjusted the weights, it could work.
I don't expect it to be easy, but hard work!

But I still think it is possible!

Rebel · Post by **Rebel** » Sun Jan 13, 2013 9:48 am

Hi Jens,

I am not saying testing a large set of tactical positions is worthless as I have done so myself many times during the 80-90's. It's good for testing the correctness of search, its tactical abilities and playing style.

However once you have passed this station and things are satisfying there is little gain elo wise if you want to compete in the elo-lists. In the elo-lists the elo of your brainchild is not decided on special positions but on real games dominated by normal positions. On average every game contains 40-50 (test) positions where your engine has to do well, nowadays one small mistake can already be fatal.

As an example I can give you dozen and dozen of positions where my current engine will find tactical shots clearly later than my version of 10 years ago. However in a direct match the oldie is crushed.

A second example comes from a self-play match I am currently running, the "TEST" version uses an aggressive form of reducing the tree, it finds many tactical shots one or two plies later and yet as it currently looks that way it is a significant improvement.

Code: Select all

TEST &#40;elo 2500&#41; vs MAIN &#40;elo 2500&#41; estimated TPR 2520 (+20&#41;
393-594-311 &#40;1298&#41; match score 690.0 - 608.0 &#40;53.2%)
Won-loss 393-311 = 82 &#40;1298 games&#41; draws 45.8%
LOS = 99.9%  Elo Error Margin +15 -15

&#40;E&#41;xit &#40;S&#41;ave &#40;C&#41;rosstable &#40;O&#41;rdo &#40;P&#41;roDeo &#40;H&#41;tml 0-9 duration

JBNielsen · Post by **JBNielsen** » Mon Jan 14, 2013 12:38 am

Rebel wrote:Hi Jens,

I am not saying testing a large set of tactical positions is worthless as I have done so myself many times during the 80-90's. It's good for testing the correctness of search, its tactical abilities and playing style.

However once you have passed this station and things are satisfying there is little gain elo wise if you want to compete in the elo-lists. In the elo-lists the elo of your brainchild is not decided on special positions but on real games dominated by normal positions. On average every game contains 40-50 (test) positions where your engine has to do well, nowadays one small mistake can already be fatal.

As an example I can give you dozen and dozen of positions where my current engine will find tactical shots clearly later than my version of 10 years ago. However in a direct match the oldie is crushed.

A second example comes from a self-play match I am currently running, the "TEST" version uses an aggressive form of reducing the tree, it finds many tactical shots one or two plies later and yet as it currently looks that way it is a significant improvement.
Code: Select all
TEST &#40;elo 2500&#41; vs MAIN &#40;elo 2500&#41; estimated TPR 2520 (+20&#41;
393-594-311 &#40;1298&#41; match score 690.0 - 608.0 &#40;53.2%)
Won-loss 393-311 = 82 &#40;1298 games&#41; draws 45.8%
LOS = 99.9%  Elo Error Margin +15 -15

&#40;E&#41;xit &#40;S&#41;ave &#40;C&#41;rosstable &#40;O&#41;rdo &#40;P&#41;roDeo &#40;H&#41;tml 0-9 duration

Hi Ed

Congrats on your new improvements.

Your arguments for a testset of positions is not good enough is that many positions may indicate a wrong result.
But the same happens when you play thousands of eng-eng games. The weaker engine still wins a lot of the games.
You have given the answer yourself: "Volume is the way to weed out randomness".

Perhaps 1.00 positions are not enough.
Perhaps 4.000 positions are needed (corresponding to 100 games). If 2.000 of these involves tactics almost every single aspect of a tactic search must be covered.

What is the best attempt to construct such an almost complete test in history?
I do not hope it is mine.

If so, the more scientific part of the computer chess community has very low ambitions and has in my opinion failed seriously.

Houdini · Post by **Houdini** » Mon Jan 14, 2013 12:51 am

The ability to solve test positions should not be confused with general chess strength. Contrary to test suite positions, the vast majority of positions in a chess game don't have a forced solution or a single best move.

In this context two Houdini-related observations:

1) The Tactical Mode in Houdini 3 is a small exercise in creating an engine that is proficient at solving tactical test positions (it is probably the strongest tactical solver ever) at the expense of some normal chess strength.

2) Houdini 3 doesn't score any higher than Houdini 1.5a on the STS test suite with 1400 positions. Yet Houdini 3 is clearly stronger in every phase of the game, it has significantly better evaluation and more efficient search.

In short, I find test suites not very helpful for improving Houdini.

Robert

Don · Post by **Don** » Mon Jan 14, 2013 12:58 am

Houdini wrote:The ability to solve test positions should not be confused with general chess strength. Contrary to test suite positions, the vast majority of positions in a chess game don't have a forced solution or a single best move.

In this context two Houdini-related observations:

1) The Tactical Mode in Houdini 3 is a small exercise in creating an engine that is proficient at solving tactical test positions (it is probably the strongest tactical solver ever) at the expense of some normal chess strength.

2) Houdini 3 doesn't score any higher than Houdini 1.5a on the STS test suite with 1400 positions. Yet Houdini 3 is clearly stronger in every phase of the game, it has significantly better evaluation and more efficient search.

In short, I find test suites not very helpful for improving Houdini.

Robert

Ditto for Komodo. We gave up looking at test suites long ago for that reason. I will still occasionally run one but more for my own curiosity. Sometimes we discover that one version performance better or worse than one several weeks or months old, and yet that doesn't correlate to the chess strength.

JBNielsen · Post by **JBNielsen** » Mon Jan 14, 2013 1:19 am

Houdini wrote:The ability to solve test positions should not be confused with general chess strength. Contrary to test suite positions, the vast majority of positions in a chess game don't have a forced solution or a single best move.

In this context two Houdini-related observations:

1) The Tactical Mode in Houdini 3 is a small exercise in creating an engine that is proficient at solving tactical test positions (it is probably the strongest tactical solver ever) at the expense of some normal chess strength.

2) Houdini 3 doesn't score any higher than Houdini 1.5a on the STS test suite with 1400 positions. Yet Houdini 3 is clearly stronger in every phase of the game, it has significantly better evaluation and more efficient search.

In short, I find test suites not very helpful for improving Houdini.

Robert

You are returning to the talk of small test sets.

Read my posts again. About a big and more complete test that is adjusted to calculate the known eng-eng ratings for the tested engines. A never ending process that will never be 100% correct, but it would be interesting to see how close we could get.

I also wrote about testpositions with not only one right move. I would actually prefer a majority of the positions to be normal positions. The starting position could be one of the positions. Each of the 20 possible moves could be chosen by an engine, and we must try to judge its rating depending on it's choice in this and many other positions.

A poor man's testing environment

Re: A poor man's testing environment

Re: A poor man's testing environment

Re: A poor man's testing environment

Re: A poor man's testing environment

Re: A poor man's testing environment

Re: A poor man's testing environment

Re: A poor man's testing environment