A new way to compare chess programs

lkaufman · Post by **lkaufman** » Fri Jun 22, 2012 12:55 am

When comparing one engine to another, whether a predecessor or an unrelated engine, we specify the superiority of the stronger program in Elo points. There is in my view one major problem with this (aside from the difficulty in measurement); it is highly dependent on the time limit. In general, the longer the time limit, the less the elo difference will be, though of course there are plenty of exceptions. We can argue about whether this is due to the higher draw percentage as the quality of play goes up or to the lessened importance of tactics with depth, or both, or something else, but it doesn't matter, the result is the same.
I propose a new way to specify superiority of one engine to another. The idea is to specify how much additional time, as a percentage, the weaker program needs to score 50% against the stronger one, in a match with ponder off. Of course the result will still usually depend somewhat on the precise time control, but not in a predictable way. If the figure is 25% at 6" + 0.1", 25% is probably the best guess you can make for the result at one hour plus one minute, absent other information.
Here are some advantages:
1. If you double the speed of your program, the result will be 100% at any time control, no need to test. We would say the new engine is "100% faster" than the old one. Any stated superiority would have a clear meaning, it means the advantage is the same as you would get by increasing the speed of your hardware by that percentage.
2. The gain you measure for your engine at bullet chess is likely to be similar at slow chess, unless you know of a reason otherwise. This offsets the obvious problem of needing to fine-tune the times in a match to get the result. We've all been disappointed in the past to see large elo gains observed at bullet speed drop drastically in serious chess. This shouldn't happen if measured this way.
3. The superiority sounds more impressive. To a player rated say 1800, he may not care if a 3000+ program is improved by 75 elo, but if you tell him that it will get the same answers in half the time he may be more impressed.
4. The advantage of using more cores is pretty much known regardless of the engine. Thus, if you can get a 3 to 1 speedup by using 4 cores, it means that MP on a quad is worth 200% regardless of the engine.

Currently I'm attempting to measure the improvement of Komodo since our last release in this manner. At the moment it appears to be just a tad under 25%. This would translate to something like 20-25 elo at intermediate levels, 30-35 at blitz, and 40-45 at bullet chess, give or take, which is roughly what we observe.

Laskos · Post by **Laskos** » Fri Jun 22, 2012 9:58 am

lkaufman wrote: I propose a new way to specify superiority of one engine to another. The idea is to specify how much additional time, as a percentage, the weaker program needs to score 50% against the stronger one, in a match with ponder off. Of course the result will still usually depend somewhat on the precise time control, but not in a predictable way. If the figure is 25% at 6" + 0.1", 25% is probably the best guess you can make for the result at one hour plus one minute, absent other information.

Don has shown that Komodo improves with the number of doublings a bit faster than Houdini beyond the simple rating compression at longer TC. Therefore say 50% for Houdini at short TC is probably 30% at long TC, not the same constant, but are still more similar to a constant than Elo points, which are maybe 70 points at short TC and 20 points at long TC. Anyway, what you propose makes more sense than simple engine X is 50 points above engine Y, and for most engines the proposed by you constants are really pretty constant.

Kai

Uri Blass · Post by **Uri Blass** » Fri Jun 22, 2012 11:00 am

lkaufman wrote:When comparing one engine to another, whether a predecessor or an unrelated engine, we specify the superiority of the stronger program in Elo points. There is in my view one major problem with this (aside from the difficulty in measurement); it is highly dependent on the time limit. In general, the longer the time limit, the less the elo difference will be, though of course there are plenty of exceptions. We can argue about whether this is due to the higher draw percentage as the quality of play goes up or to the lessened importance of tactics with depth, or both, or something else, but it doesn't matter, the result is the same.
I propose a new way to specify superiority of one engine to another. The idea is to specify how much additional time, as a percentage, the weaker program needs to score 50% against the stronger one, in a match with ponder off. Of course the result will still usually depend somewhat on the precise time control, but not in a predictable way. If the figure is 25% at 6" + 0.1", 25% is probably the best guess you can make for the result at one hour plus one minute, absent other information.

I disagree with your assumption and I think that in most cases it is going to be less than 25% at one hour plus 1 minute.

Note that it is based on experience of testing because I tested movei in the past at unequal time control and I found that the time advantage that it needs for 50% score against Rybka is bigger at slower time control.

It is not that movei is espacially strong at blitz relative to other programs at similiar strength but making it 10 times faster with no change is going to make it relatively strongerer at blitz.

with difference of 1:4 and not 1:10 you may have exceptions but the general tendency is that stronger programs earn more from more time(not in the meaning of rating points but in the meaning of the time handicap that you need to get 50%)

Edit:I see that your 25% is not difference of 1:4 but something like 1:1.25
and of course with these small differences there may be more exceptions but basically I am against the idea to measure improvement by a single time control.

It is better if you have 2 numbers that are not elo for
6" + 0.1" and for 60''+1''

Uri

MM · Post by MM » Fri Jun 22, 2012 12:25 pm

lkaufman wrote: Currently I'm attempting to measure the improvement of Komodo since our last release in this manner. At the moment it appears to be just a tad under 25%. This would translate to something like 20-25 elo at intermediate levels, 30-35 at blitz, and 40-45 at bullet chess, give or take, which is roughly what we observe.

Hi Larry, i apologize to quote only the last part of your post but i did it because it is the more interesting to me.

Franky i'm not interested in MP version of next Komodo for now, but in the strenght of a single core, so i wonder if there will be a release in a reasonable time.

Keep in mind that in September (or before) Houdini 3 will be released and at that time the possibility that Komodo will be stronger than it at blitz will be much lower than now.

No intention to press you, of course, just to let you know.

30-35 elo in blitz should guarantee the leading in the rating lists, or, at least, the superiority of Komodo vs Houdini 2.0 in matches engine vs engine right now. But, honetly, i suspect that you want more.

Can you explain me your plans?

Thank you

P.S. Don't forget the tactics, it is the weak part of your engine.

Best Regards

lkaufman · Post by **lkaufman** » Fri Jun 22, 2012 4:39 pm

Uri Blass wrote:
lkaufman wrote:When comparing one engine to another, whether a predecessor or an unrelated engine, we specify the superiority of the stronger program in Elo points. There is in my view one major problem with this (aside from the difficulty in measurement); it is highly dependent on the time limit. In general, the longer the time limit, the less the elo difference will be, though of course there are plenty of exceptions. We can argue about whether this is due to the higher draw percentage as the quality of play goes up or to the lessened importance of tactics with depth, or both, or something else, but it doesn't matter, the result is the same.
I propose a new way to specify superiority of one engine to another. The idea is to specify how much additional time, as a percentage, the weaker program needs to score 50% against the stronger one, in a match with ponder off. Of course the result will still usually depend somewhat on the precise time control, but not in a predictable way. If the figure is 25% at 6" + 0.1", 25% is probably the best guess you can make for the result at one hour plus one minute, absent other information.
I disagree with your assumption and I think that in most cases it is going to be less than 25% at one hour plus 1 minute.

Note that it is based on experience of testing because I tested movei in the past at unequal time control and I found that the time advantage that it needs for 50% score against Rybka is bigger at slower time control.

It is not that movei is espacially strong at blitz relative to other programs at similiar strength but making it 10 times faster with no change is going to make it relatively strongerer at blitz.

with difference of 1:4 and not 1:10 you may have exceptions but the general tendency is that stronger programs earn more from more time(not in the meaning of rating points but in the meaning of the time handicap that you need to get 50%)

Edit:I see that your 25% is not difference of 1:4 but something like 1:1.25
and of course with these small differences there may be more exceptions but basically I am against the idea to measure improvement by a single time control.

It is better if you have 2 numbers that are not elo for
6" + 0.1" and for 60''+1''

Uri

I agree that it is better to talk about the effective speedup at two (or even three) time controls than just to quote one number, since indeed the number is not necessarily a constant. This does not detract from the idea itself though. It is also probably generally true that weaker programs need larger time odds to catch stronger ones at longer time controls, because the stronger ones are likely to be stronger in part due to better "scaling". If a program author believes that his new version has better scaling than the old one, it would be to his advantage to quote two speedup numbers at different time controls, just as you suggest. If he believes scaling was not affected, he can just measure the fast speedup and leave it to others to confirm that is it similar at longer time controls.

lkaufman · Post by **lkaufman** » Fri Jun 22, 2012 4:45 pm

MM wrote:
lkaufman wrote: Currently I'm attempting to measure the improvement of Komodo since our last release in this manner. At the moment it appears to be just a tad under 25%. This would translate to something like 20-25 elo at intermediate levels, 30-35 at blitz, and 40-45 at bullet chess, give or take, which is roughly what we observe.
Hi Larry, i apologize to quote only the last part of your post but i did it because it is the more interesting to me.

Franky i'm not interested in MP version of next Komodo for now, but in the strenght of a single core, so i wonder if there will be a release in a reasonable time.

Keep in mind that in September (or before) Houdini 3 will be released and at that time the possibility that Komodo will be stronger than it at blitz will be much lower than now.

No intention to press you, of course, just to let you know.

30-35 elo in blitz should guarantee the leading in the rating lists, or, at least, the superiority of Komodo vs Houdini 2.0 in matches engine vs engine right now. But, honetly, i suspect that you want more.

Can you explain me your plans?

Thank you

P.S. Don't forget the tactics, it is the weak part of your engine.

Best Regards

We are satisfied with the current Komodo as good enough to release as Komodo 5, but we have some obligation to release Komodo 4 MP first if at all possible. It is working now, but the speedups are too low right now and we don't yet know why. As soon as we figure out what the problem is and fix it, we can do a release within a few days.

Mike S. · Post by **Mike S.** » Fri Jun 22, 2012 5:49 pm

MM wrote:P.S. Don't forget the tactics, it is the weak part of your engine.

The following test result indicates the opposite:

http://rybkaforum.net/cgi-bin/rybkaforu ... pid=414852

Most results are from four cpu cores, but we find Komodo 4's result on one core only (64/100) not far from e.g. Zappa Mexico II (65) or Rybka 4.1 (70) which used four cores each. Komodo 3 scored even better: 71/100.

An engine which is known to have both tactical style and tactical strength, Spark 1.0 (but it's clearly weaker than Komodo overall, in games) solved 79 on four cores.

The test is not public but I think it is quite difficult. The best result so far was 90/100.

MM · Post by MM » Fri Jun 22, 2012 11:46 pm

Mike S. wrote:
MM wrote:P.S. Don't forget the tactics, it is the weak part of your engine.
The following test result indicates the opposite:

http://rybkaforum.net/cgi-bin/rybkaforu ... pid=414852

Most results are from four cpu cores, but we find Komodo 4's result on one core only (64/100) not far from e.g. Zappa Mexico II (65) or Rybka 4.1 (70) which used four cores each. Komodo 3 scored even better: 71/100.

An engine which is known to have both tactical style and tactical strength, Spark 1.0 (but it's clearly weaker than Komodo overall, in games) solved 79 on four cores.

The test is not public but I think it is quite difficult. The best result so far was 90/100.

Thank you for the link.
In my opinion this list in pretty unsound mainly for 3 reasons:

1.It maxes a mix between 1 core and 4 cores engines.
2. You can't be sure that if an engine solves a certain number of tests with one core, will be able to solve many others on 4 cores.
3. Not all tactical tests are identical. I mean, one engine can be able to solve a certain kind of tactical test and not able to solve another, it could depend by the ''theme'' of the test.

Anyway i'm pretty confident regarding what i wrote because i watched with my eyes hundred games of Komodo 4 (that i bought) and saw its tactical weakness, at least against Houdini.

I simply made a comparison between the strenght in tactics and the strenght in positional play of Komodo, so i see that the problem of Komodo is mainly in tactics.
Best Regards

Don · Post by **Don** » Sat Jun 23, 2012 1:16 am

MM wrote:
Mike S. wrote:
MM wrote:P.S. Don't forget the tactics, it is the weak part of your engine.
The following test result indicates the opposite:

http://rybkaforum.net/cgi-bin/rybkaforu ... pid=414852

Most results are from four cpu cores, but we find Komodo 4's result on one core only (64/100) not far from e.g. Zappa Mexico II (65) or Rybka 4.1 (70) which used four cores each. Komodo 3 scored even better: 71/100.

An engine which is known to have both tactical style and tactical strength, Spark 1.0 (but it's clearly weaker than Komodo overall, in games) solved 79 on four cores.

The test is not public but I think it is quite difficult. The best result so far was 90/100.
Thank you for the link.
In my opinion this list in pretty unsound mainly for 3 reasons:

1.It maxes a mix between 1 core and 4 cores engines.
2. You can't be sure that if an engine solves a certain number of tests with one core, will be able to solve many others on 4 cores.
3. Not all tactical tests are identical. I mean, one engine can be able to solve a certain kind of tactical test and not able to solve another, it could depend by the ''theme'' of the test.

Anyway i'm pretty confident regarding what i wrote because i watched with my eyes hundred games of Komodo 4 (that i bought) and saw its tactical weakness, at least against Houdini.

I simply made a comparison between the strenght in tactics and the strenght in positional play of Komodo, so i see that the problem of Komodo is mainly in tactics.
Best Regards

When you see such a position, put it in a file with a description and fen and send it to us please when you get a few. We are only interested if the move represents a real blunder, not a move that loses in a losing position anyway.

We have had too many people send us examples that did hold up by this measure, they would just show us positions where Komodo was already losing and then Komodo would play some move that would be met by a spectacular response - making the move look like a terrible blunder (but it was already in a deal lost position.) So please make sure you have a legitimate blunder and not just a move that you don't like. There were 2 or 3 shown on this forum and I refuted them all by showing that ALL moves lose.

The converse happens too, someone showed us a position where Komodo "missed" the winning move - but there was nothing wrong with Komodo's move, it was just not as spectacular as the more obvious winning move.

But we are always interested in legitimate examples so please feel free to bundle up some examples and send them to us.

Werewolf · Post by **Werewolf** » Sat Jun 23, 2012 8:56 am

Mike S. wrote:
MM wrote:P.S. Don't forget the tactics, it is the weak part of your engine.
The following test result indicates the opposite:

http://rybkaforum.net/cgi-bin/rybkaforu ... pid=414852

Most results are from four cpu cores, but we find Komodo 4's result on one core only (64/100) not far from e.g. Zappa Mexico II (65) or Rybka 4.1 (70) which used four cores each. Komodo 3 scored even better: 71/100.

Rybka 4.1 has two entries: one on four cores which scored 79/100 and with one core 70 / 100.

A new way to compare chess programs

A new way to compare chess programs

Re: A new way to compare chess programs

Re: A new way to compare chess programs

Re: A new way to compare chess programs

Re: A new way to compare chess programs

Re: A new way to compare chess programs

Re: A new way to compare chess programs

Re: A new way to compare chess programs

Re: A new way to compare chess programs

Re: A new way to compare chess programs