'STS' Test Suite (v2.0): Open Files and Diagonals.. Released

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Uri Blass
Posts: 10282
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: results

Post by Uri Blass »

bob wrote:
swami wrote:
bob wrote:
swami wrote:Whoah. That's pretty very high score from Crafty. 8-)
I think much of this is tactical in nature. What I've always looked for is positions such as 1. e4 c5 2. Nf3 any 3. d4 where d4 is a pretty obvious move to control the center, nullify c5 attacking d4, etc. There are other moves that are perfectly playable, but d4 strikes right to the crux of the position, without being a move that wins anything. The BK test pawn lever positions are similar. Either a program "gets it" or it doesn't. Depth is not particularly important although some require some depth to see the ultimate point of the correct move. I think the way you screened these is backward. I'd toss out positions where the best move scores significantly better than next-best, if you are using a computer to choose them. Some positional scores might well be .2 to .3 (if they don't include king safety issues) but most are a razor;s edge away from the second-best, which is what makes a GM's best move better than my best move.

I'll try to look at these in some detail when I have time to see which look like the kind of positional tests I'd like to keep for eval testing and tuning...
Well, I have chosen only positions where the evaluation score for the best move is atleast > 0.20 more than the second best move and it's been verified after 5 hours of analysis by Dann. And these score difference are agreed on with by Rybka/Zappa/Naum in unison. Else they wouldn't pass the criteria.

You've a point that +4 scores in some tests are really tactical in nature, albeit there were only few such positions. I should cease to call the test suite positional. I should rather call it a puzzle where undermining occur. That would make more sense.

I don't trust GM's moves. I took a look into GM games database, I'm having a tough time trying to find any good positions, and it took me so long to come up with few. It's like sitting by the river trying to catch a fish, and there were hardly any.

Next day, I took a look into Rybka's games, I easily find many tests that could make into a good test suite. All I had to do was to check the score difference between the first best and the second best move from Rybka. And to see whether the position in question would qualify as "undermining" pattern. If all that qualifies, I send them to Dann, who would then run a deep analysis for hours with Top 3 engines, and if they all agree in unison, he'd put those tests into 'Qualified' list. That was fun, really.

I'd think that easier and quicker way to create more positions is from studying correspondence games, especially with the use of computers for days. I don't know where those games can be downloaded, but I've to ask around.

I do see some engines clearly doing better in undermining but doing fantastically bad in open files and diagonals. While others did better in the latter rather than the former. I'd hope to get the 3rd test suite ready. It's a good hobby, I should tell you, I really enjoyed every moment of it! :wink:
This has been the "Holy Grail" of testing for years. It is a tough problem. One fairly good indicator is that faster hardware produces better results, while in a true positional test this would not be the case. Either you have the knowledge or you don't. For example, a position where you can take black's a-pawn and give yourself a "distant pawn majority" (turns into a distant passed pawn eventually) or you can take black's g pawn which weakens his pawns a bit but not nearly as much as the majority. The right position won't be depth-sensitive, it will simply determine whether the program understands majorities or not. A book like PPD or something similar might give some good positions...
I disagree here.

even in true positional test that means no material gain that computers can see computers can perform better at longer time control.

Suppose that a program does not know about candidate pawns but know about passed pawns.

With small depth the program may see only a position with candidate pawn when analyzing the right move so it is not going to find the right move.

With bigger depth the program may see a passed pawn because the right move force a passed pawn if you search deep enough and it may find it
not because of tactics that means winning material.

It means that more time can help the program to find the best move.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: results

Post by bob »

Uri Blass wrote:
bob wrote:
swami wrote:
bob wrote:
swami wrote:Whoah. That's pretty very high score from Crafty. 8-)
I think much of this is tactical in nature. What I've always looked for is positions such as 1. e4 c5 2. Nf3 any 3. d4 where d4 is a pretty obvious move to control the center, nullify c5 attacking d4, etc. There are other moves that are perfectly playable, but d4 strikes right to the crux of the position, without being a move that wins anything. The BK test pawn lever positions are similar. Either a program "gets it" or it doesn't. Depth is not particularly important although some require some depth to see the ultimate point of the correct move. I think the way you screened these is backward. I'd toss out positions where the best move scores significantly better than next-best, if you are using a computer to choose them. Some positional scores might well be .2 to .3 (if they don't include king safety issues) but most are a razor;s edge away from the second-best, which is what makes a GM's best move better than my best move.

I'll try to look at these in some detail when I have time to see which look like the kind of positional tests I'd like to keep for eval testing and tuning...
Well, I have chosen only positions where the evaluation score for the best move is atleast > 0.20 more than the second best move and it's been verified after 5 hours of analysis by Dann. And these score difference are agreed on with by Rybka/Zappa/Naum in unison. Else they wouldn't pass the criteria.

You've a point that +4 scores in some tests are really tactical in nature, albeit there were only few such positions. I should cease to call the test suite positional. I should rather call it a puzzle where undermining occur. That would make more sense.

I don't trust GM's moves. I took a look into GM games database, I'm having a tough time trying to find any good positions, and it took me so long to come up with few. It's like sitting by the river trying to catch a fish, and there were hardly any.

Next day, I took a look into Rybka's games, I easily find many tests that could make into a good test suite. All I had to do was to check the score difference between the first best and the second best move from Rybka. And to see whether the position in question would qualify as "undermining" pattern. If all that qualifies, I send them to Dann, who would then run a deep analysis for hours with Top 3 engines, and if they all agree in unison, he'd put those tests into 'Qualified' list. That was fun, really.

I'd think that easier and quicker way to create more positions is from studying correspondence games, especially with the use of computers for days. I don't know where those games can be downloaded, but I've to ask around.

I do see some engines clearly doing better in undermining but doing fantastically bad in open files and diagonals. While others did better in the latter rather than the former. I'd hope to get the 3rd test suite ready. It's a good hobby, I should tell you, I really enjoyed every moment of it! :wink:
This has been the "Holy Grail" of testing for years. It is a tough problem. One fairly good indicator is that faster hardware produces better results, while in a true positional test this would not be the case. Either you have the knowledge or you don't. For example, a position where you can take black's a-pawn and give yourself a "distant pawn majority" (turns into a distant passed pawn eventually) or you can take black's g pawn which weakens his pawns a bit but not nearly as much as the majority. The right position won't be depth-sensitive, it will simply determine whether the program understands majorities or not. A book like PPD or something similar might give some good positions...
I disagree here.

even in true positional test that means no material gain that computers can see computers can perform better at longer time control.

Suppose that a program does not know about candidate pawns but know about passed pawns.

With small depth the program may see only a position with candidate pawn when analyzing the right move so it is not going to find the right move.

With bigger depth the program may see a passed pawn because the right move force a passed pawn if you search deep enough and it may find it
not because of tactics that means winning material.

Then we simply disagree about what "positional" means. If A understands candidates, and B only understands passers, then there are positions A will play correctly where B will screw up. What happens if your search goes way deep and ends up with a majority at the tips? A will recognize that and B will not. And A will play better.

That was my point, you need to be able to see deeply enough to see the "positional" issue, which sometimes might be pretty deeply. And if you understand weak pawns, you will figure this out, if you have to search until the pawn is actually lost, you may well not.

That's a positional test...

It means that more time can help the program to find the best move.
And that's a surprise? If I can search long enough, I can find the correct move in _any_ position. We already know that and can prove it. But it is what happens when I _can't_ search long enough that is the issue here. What might be nice is a 1M or 10M node search limit, so that you can spend the nodes however you want, but you are constrained to find the correct move with relatively little effort to prove that you understand the positional idea being studied. I can think of things like weak pawns, isolated pawns, weak/strong passed pawns, backward pawns, immobile pawns, candidate passed pawns, distant candidates, distant passers, and so forth. And each idea is distinct in what it means, and it is not a depth issue. Otherwise we would not evaluate passed pawns, we could just search until they promote.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: 'STS' Test Suite (v2.0): Open Files and Diagonals.. Rele

Post by michiguel »

swami wrote:Testing still in progress but results so far:

Q6600 2.4Ghz, 32 bits, All engines use 1 CPU.
Open Files and Diagonals.
10 sec each move:

Code: Select all

Fruit - 85
TwistedLogic - 80
Toga - 80


Bright - 79
ETChess - 78
Glaurung - 78
Hamsters - 76
The King TrailBlazer - 75
Movei - 74
Delfi - 72
Alaric - 72
Pharaon - 71


Zappa 1.1 - 69
Cerebro - 68
Scorpio - 68
Crafty - 67 
Chiron - 67
Tao - 67
Kiwi - 67
The Baron - 66
NOW - 66
Arion - 65
Aristarch - 65
Slowchess - 65
BugChess - 65
List512 - 65
Jonny - 64
Deep Patzer - 64
Alfil - 63
Pro Deo - 63
Natwarlal - 63
Queen - 62
Abrok - 62
Delphil - 62
LearningLemming - 61
Green Light Chess - 61
Comet - 61
Yace - 61
Gaia - 61
Lambchop - 61
Ufim - 60
Trace - 60


Arasan - 59
Asterisk - 59 
Nejmet - 59
Amyan - 58
Romichess - 57
King of Kings - 56
Rotor 56
Phalanx - 54
Pepito - 53
Knight Dreamer - 53
Horizon - 51
Alarm - 51
Booot - 50


Little Thought - 49
ZcT - 40
Bestia - 40 


Beowulf - 39
RDChess - 36
Gaviota 0.60 (unreleased), AMD 2.4 Ghz,
10s / position, solved = 67

Miguel
swami
Posts: 6640
Joined: Thu Mar 09, 2006 4:21 am

Re: 'STS' Test Suite (v2.0): Open Files and Diagonals.. Rele

Post by swami »

michiguel wrote:Gaviota 0.60 (unreleased), AMD 2.4 Ghz,
10s / position, solved = 67

Miguel
Very good score indeed, Miguel. Thanks for posting it. I'd estimate it to be rated at 2500-2600 already if this test score is any indication. Out of curiosity, How much did it score in Undermining test suite though?
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: 'STS' Test Suite (v2.0): Open Files and Diagonals.. Rele

Post by michiguel »

swami wrote:
michiguel wrote:Gaviota 0.60 (unreleased), AMD 2.4 Ghz,
10s / position, solved = 67

Miguel
Very good score indeed, Miguel. Thanks for posting it. I'd estimate it to be rated at 2500-2600 already if this test score is any indication. Out of curiosity, How much did it score in Undermining test suite though?
That score does not correlate with Gaviota's overall strength . Gaviota is not even close to Crafty and both got the same score. In the Undermining test, Gaviota got 56, which I think is more real. At one minute, both scores are 71.

Miguel
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 'STS' Test Suite (v2.0): Open Files and Diagonals.. Rele

Post by bob »

swami wrote:
michiguel wrote:Gaviota 0.60 (unreleased), AMD 2.4 Ghz,
10s / position, solved = 67

Miguel
Very good score indeed, Miguel. Thanks for posting it. I'd estimate it to be rated at 2500-2600 already if this test score is any indication. Out of curiosity, How much did it score in Undermining test suite though?
Before you go any further, you are now off the thin ice and into the deep salty water area. You are _not_ going to be able to predict program ratings based on how many results they get right or wrong in a test suite. This has been tried for many years. It has _never_ worked.

The problem is this:

You make up a problem set. You run it against several programs of known rating. You then fit some sort of curve (whether it be linear, quadratic, exponential or whatever does not matter at all) to that observed data. And a new program comes along that was not a part of the original curve fitting, so your curve is wrong for that program. I could tell you an amusing story about Larry Kaufman and his old CCR rating tests/formulas if you are interested. But this simply doesn't work, and probably never will. There are too many parts to a chess program that can vary...
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: results

Post by Dann Corbit »

Uri Blass wrote:
bob wrote:
swami wrote:
bob wrote:
swami wrote:Whoah. That's pretty very high score from Crafty. 8-)
I think much of this is tactical in nature. What I've always looked for is positions such as 1. e4 c5 2. Nf3 any 3. d4 where d4 is a pretty obvious move to control the center, nullify c5 attacking d4, etc. There are other moves that are perfectly playable, but d4 strikes right to the crux of the position, without being a move that wins anything. The BK test pawn lever positions are similar. Either a program "gets it" or it doesn't. Depth is not particularly important although some require some depth to see the ultimate point of the correct move. I think the way you screened these is backward. I'd toss out positions where the best move scores significantly better than next-best, if you are using a computer to choose them. Some positional scores might well be .2 to .3 (if they don't include king safety issues) but most are a razor;s edge away from the second-best, which is what makes a GM's best move better than my best move.

I'll try to look at these in some detail when I have time to see which look like the kind of positional tests I'd like to keep for eval testing and tuning...
Well, I have chosen only positions where the evaluation score for the best move is atleast > 0.20 more than the second best move and it's been verified after 5 hours of analysis by Dann. And these score difference are agreed on with by Rybka/Zappa/Naum in unison. Else they wouldn't pass the criteria.

You've a point that +4 scores in some tests are really tactical in nature, albeit there were only few such positions. I should cease to call the test suite positional. I should rather call it a puzzle where undermining occur. That would make more sense.

I don't trust GM's moves. I took a look into GM games database, I'm having a tough time trying to find any good positions, and it took me so long to come up with few. It's like sitting by the river trying to catch a fish, and there were hardly any.

Next day, I took a look into Rybka's games, I easily find many tests that could make into a good test suite. All I had to do was to check the score difference between the first best and the second best move from Rybka. And to see whether the position in question would qualify as "undermining" pattern. If all that qualifies, I send them to Dann, who would then run a deep analysis for hours with Top 3 engines, and if they all agree in unison, he'd put those tests into 'Qualified' list. That was fun, really.

I'd think that easier and quicker way to create more positions is from studying correspondence games, especially with the use of computers for days. I don't know where those games can be downloaded, but I've to ask around.

I do see some engines clearly doing better in undermining but doing fantastically bad in open files and diagonals. While others did better in the latter rather than the former. I'd hope to get the 3rd test suite ready. It's a good hobby, I should tell you, I really enjoyed every moment of it! :wink:
This has been the "Holy Grail" of testing for years. It is a tough problem. One fairly good indicator is that faster hardware produces better results, while in a true positional test this would not be the case. Either you have the knowledge or you don't. For example, a position where you can take black's a-pawn and give yourself a "distant pawn majority" (turns into a distant passed pawn eventually) or you can take black's g pawn which weakens his pawns a bit but not nearly as much as the majority. The right position won't be depth-sensitive, it will simply determine whether the program understands majorities or not. A book like PPD or something similar might give some good positions...
I disagree here.

even in true positional test that means no material gain that computers can see computers can perform better at longer time control.

Suppose that a program does not know about candidate pawns but know about passed pawns.

With small depth the program may see only a position with candidate pawn when analyzing the right move so it is not going to find the right move.

With bigger depth the program may see a passed pawn because the right move force a passed pawn if you search deep enough and it may find it
not because of tactics that means winning material.

It means that more time can help the program to find the best move.
Strategy is just very deep tactics.

Given enough time, any program with a correct search and decent eval will solve every position. Of course, it may be close to infinite time, but the results will still improve over time.

The idea of these test sets is to highlight a certain concept. It is useful to chess engines to see if your chess engine knows how to correctly process a particular category of problems. It is useful to human players for the same reason. These tests are not intended to be super difficult like LCT II and the like. The intention is to have a battery of tests that are connected to a singluar concept. We have even thrown out positions that were correct but too hard (e.g. if it takes Rybka 50 hours to solve it, then it may be both correct and interesting, but it is not useful in current testing because nobody has that much time to verify their engines).
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 'STS' Test Suite (v2.0): Open Files and Diagonals.. Rele

Post by bob »

ack. enough already. :)

I've gotten 10 emails about this already, so here's what happened.

In the late 80's and early 90's, there was a (monthly?) publication called "Computer Chess Reports". I think it came from ICD chess but am not sure. It carried all the latest news, particularly about commercial chess programs and machines. Larry developed a set of positions he used against each new entrant (which were all microcomputers at the time) and developed a rating curve based on the solution times for his test suite.

At the 1993 or 1994 ACM computer chess event (I do not remember which) we were sitting around talking and he asked me if I would be willing to try the rating test on Cray Blitz, if we could get machine time. I had the Cray C90 dedicated to me for the tournament so I said "more than happy to do that." And things turned "interesting"

He gave me position 1, and CB announced the correct move after < 1 second of computing (it reported time in 1 sec intervals and reported 0 seconds for anything less than 1 whole second). He gave me position 2. 0 seconds. Position 3. 0 seconds. I think that maybe one position out of the entire batch took 1+ seconds. His rating was something like 2900. We both knew that was wrong. We talked at length and the first issue was the reported time. Since we were truncating to the second, he said "maybe I should report 1 second for every other position." I agreed. Very little difference.

We then started to look at the "why". One position traded into a king+pawn ending where both sides had "runners" and both pawns needed the same number of moves to promote. Programs with the "square of the pawn" eval rule would say this was a draw. But one pawn queened with check and Cray Blitz understood that in the eval and realized that the opponent could then not promote because of the Check. This got him to thinking so he slightly modified the position to put the enemy king close to the promoting square, and we got that one right as well, because we realized (still in the evaluation code) that queening with check prevents the opponent from queening unless the opponent king is close to the queening square. He tried again with an extra pair of pawns that changed the king trajectories a bit so that the pawn could make one extra move before the opposing king could catch it (pawn was on critical diagonal from king square to promotion square). CB got that right in the evaluation as well. Then he thought aha, here's one. White pawn on c6, two squares from promoting, Black pawn on c2, one square from promoting, black king on c4. White to move. goes like this: c7 c1=q c8=Q+, black king has to move and has to move out of the way, allowing Qc8xc1 and winning. And CB understood _that_ in the eval as well.

He finally said "OK, but surely you can't catch _all_ cases? And I said "No, but you have not found one yet..."

The point was we could do things in our evaluation that the micros had no hope of doing due to speed. We had vector hardware that made analyzing such data much more efficient and could get away with this. The bottom line of all this was that (a) we were not a 2900+ program (there were none back then and still probably are not any today); (b) his data was calibrated against a group of programs that had a somewhat common computer architecture, which meant a somewhat common software architecture as well. It didn't fit a different type of program very well at all.

The only way to get a rating estimate is to play real games against known real opponents...
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: 'STS' Test Suite (v2.0): Open Files and Diagonals.. Rele

Post by Dann Corbit »

Another factor to consider:

The best settings for test solving are clearly not the best settings for play.

I wrote a program to do a simple parabolic least squares fit, given a large number of data points. This program is given the evaluation terms of the program to fit. After a day or more, it writes out the constants to use to maximize the solutions found for a 12,000 problem test set. After this tuning, the engine will tear test problem sets to shreds. However, it will get clobbered in real game play.

The lesson?
Test sets are usually tuned to sacrifices or difficult to see moves. Solid, boring, sensible moves are much more the norm. When you tune for the tricky stuff, you get tricky play. But it won't help -- in fact it hurts -- in actual game play. It can be useful to get a feasible starting point for hand tuning, however.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: results

Post by bob »

Dann Corbit wrote:
Uri Blass wrote:
bob wrote:
swami wrote:
bob wrote:
swami wrote:Whoah. That's pretty very high score from Crafty. 8-)
I think much of this is tactical in nature. What I've always looked for is positions such as 1. e4 c5 2. Nf3 any 3. d4 where d4 is a pretty obvious move to control the center, nullify c5 attacking d4, etc. There are other moves that are perfectly playable, but d4 strikes right to the crux of the position, without being a move that wins anything. The BK test pawn lever positions are similar. Either a program "gets it" or it doesn't. Depth is not particularly important although some require some depth to see the ultimate point of the correct move. I think the way you screened these is backward. I'd toss out positions where the best move scores significantly better than next-best, if you are using a computer to choose them. Some positional scores might well be .2 to .3 (if they don't include king safety issues) but most are a razor;s edge away from the second-best, which is what makes a GM's best move better than my best move.

I'll try to look at these in some detail when I have time to see which look like the kind of positional tests I'd like to keep for eval testing and tuning...
Well, I have chosen only positions where the evaluation score for the best move is atleast > 0.20 more than the second best move and it's been verified after 5 hours of analysis by Dann. And these score difference are agreed on with by Rybka/Zappa/Naum in unison. Else they wouldn't pass the criteria.

You've a point that +4 scores in some tests are really tactical in nature, albeit there were only few such positions. I should cease to call the test suite positional. I should rather call it a puzzle where undermining occur. That would make more sense.

I don't trust GM's moves. I took a look into GM games database, I'm having a tough time trying to find any good positions, and it took me so long to come up with few. It's like sitting by the river trying to catch a fish, and there were hardly any.

Next day, I took a look into Rybka's games, I easily find many tests that could make into a good test suite. All I had to do was to check the score difference between the first best and the second best move from Rybka. And to see whether the position in question would qualify as "undermining" pattern. If all that qualifies, I send them to Dann, who would then run a deep analysis for hours with Top 3 engines, and if they all agree in unison, he'd put those tests into 'Qualified' list. That was fun, really.

I'd think that easier and quicker way to create more positions is from studying correspondence games, especially with the use of computers for days. I don't know where those games can be downloaded, but I've to ask around.

I do see some engines clearly doing better in undermining but doing fantastically bad in open files and diagonals. While others did better in the latter rather than the former. I'd hope to get the 3rd test suite ready. It's a good hobby, I should tell you, I really enjoyed every moment of it! :wink:
This has been the "Holy Grail" of testing for years. It is a tough problem. One fairly good indicator is that faster hardware produces better results, while in a true positional test this would not be the case. Either you have the knowledge or you don't. For example, a position where you can take black's a-pawn and give yourself a "distant pawn majority" (turns into a distant passed pawn eventually) or you can take black's g pawn which weakens his pawns a bit but not nearly as much as the majority. The right position won't be depth-sensitive, it will simply determine whether the program understands majorities or not. A book like PPD or something similar might give some good positions...
I disagree here.

even in true positional test that means no material gain that computers can see computers can perform better at longer time control.

Suppose that a program does not know about candidate pawns but know about passed pawns.

With small depth the program may see only a position with candidate pawn when analyzing the right move so it is not going to find the right move.

With bigger depth the program may see a passed pawn because the right move force a passed pawn if you search deep enough and it may find it
not because of tactics that means winning material.

It means that more time can help the program to find the best move.
Strategy is just very deep tactics.

Given enough time, any program with a correct search and decent eval will solve every position. Of course, it may be close to infinite time, but the results will still improve over time.

The idea of these test sets is to highlight a certain concept. It is useful to chess engines to see if your chess engine knows how to correctly process a particular category of problems. It is useful to human players for the same reason. These tests are not intended to be super difficult like LCT II and the like. The intention is to have a battery of tests that are connected to a singluar concept. We have even thrown out positions that were correct but too hard (e.g. if it takes Rybka 50 hours to solve it, then it may be both correct and interesting, but it is not useful in current testing because nobody has that much time to verify their engines).
My kind of "test suite" is a "solve within N nodes" type set of positions. N needs to be as small as practical. For some positions, say the "undermine a piece/pawn/square" type position, a little searching is necessary. But it ought to be a "little" searching. So that the onus is on the evaluation, rather than on the search. Mate in N positions need not apply, nor the "white to move and win" type positions. I want a "white to move and create a positional edge or neutralize a positional edge of black's" type positions. Everyone knows how to test the "square of the king/pawn" type passed pawn evaluation. It should take no search at all given the right starting positions. Ditto for positions where white wants to freeze a black pawn on a weak square, and black wants to advance/trade it before white freezes it. That should be resolvable in a couple of plies in the right kinds of positions.