A new way to compare chess programs

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

A new way to compare chess programs

Post by lkaufman »

When comparing one engine to another, whether a predecessor or an unrelated engine, we specify the superiority of the stronger program in Elo points. There is in my view one major problem with this (aside from the difficulty in measurement); it is highly dependent on the time limit. In general, the longer the time limit, the less the elo difference will be, though of course there are plenty of exceptions. We can argue about whether this is due to the higher draw percentage as the quality of play goes up or to the lessened importance of tactics with depth, or both, or something else, but it doesn't matter, the result is the same.
I propose a new way to specify superiority of one engine to another. The idea is to specify how much additional time, as a percentage, the weaker program needs to score 50% against the stronger one, in a match with ponder off. Of course the result will still usually depend somewhat on the precise time control, but not in a predictable way. If the figure is 25% at 6" + 0.1", 25% is probably the best guess you can make for the result at one hour plus one minute, absent other information.
Here are some advantages:
1. If you double the speed of your program, the result will be 100% at any time control, no need to test. We would say the new engine is "100% faster" than the old one. Any stated superiority would have a clear meaning, it means the advantage is the same as you would get by increasing the speed of your hardware by that percentage.
2. The gain you measure for your engine at bullet chess is likely to be similar at slow chess, unless you know of a reason otherwise. This offsets the obvious problem of needing to fine-tune the times in a match to get the result. We've all been disappointed in the past to see large elo gains observed at bullet speed drop drastically in serious chess. This shouldn't happen if measured this way.
3. The superiority sounds more impressive. To a player rated say 1800, he may not care if a 3000+ program is improved by 75 elo, but if you tell him that it will get the same answers in half the time he may be more impressed.
4. The advantage of using more cores is pretty much known regardless of the engine. Thus, if you can get a 3 to 1 speedup by using 4 cores, it means that MP on a quad is worth 200% regardless of the engine.

Currently I'm attempting to measure the improvement of Komodo since our last release in this manner. At the moment it appears to be just a tad under 25%. This would translate to something like 20-25 elo at intermediate levels, 30-35 at blitz, and 40-45 at bullet chess, give or take, which is roughly what we observe.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: A new way to compare chess programs

Post by Laskos »

lkaufman wrote: I propose a new way to specify superiority of one engine to another. The idea is to specify how much additional time, as a percentage, the weaker program needs to score 50% against the stronger one, in a match with ponder off. Of course the result will still usually depend somewhat on the precise time control, but not in a predictable way. If the figure is 25% at 6" + 0.1", 25% is probably the best guess you can make for the result at one hour plus one minute, absent other information.
Don has shown that Komodo improves with the number of doublings a bit faster than Houdini beyond the simple rating compression at longer TC. Therefore say 50% for Houdini at short TC is probably 30% at long TC, not the same constant, but are still more similar to a constant than Elo points, which are maybe 70 points at short TC and 20 points at long TC. Anyway, what you propose makes more sense than simple engine X is 50 points above engine Y, and for most engines the proposed by you constants are really pretty constant.

Kai
Uri Blass
Posts: 10280
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: A new way to compare chess programs

Post by Uri Blass »

lkaufman wrote:When comparing one engine to another, whether a predecessor or an unrelated engine, we specify the superiority of the stronger program in Elo points. There is in my view one major problem with this (aside from the difficulty in measurement); it is highly dependent on the time limit. In general, the longer the time limit, the less the elo difference will be, though of course there are plenty of exceptions. We can argue about whether this is due to the higher draw percentage as the quality of play goes up or to the lessened importance of tactics with depth, or both, or something else, but it doesn't matter, the result is the same.
I propose a new way to specify superiority of one engine to another. The idea is to specify how much additional time, as a percentage, the weaker program needs to score 50% against the stronger one, in a match with ponder off. Of course the result will still usually depend somewhat on the precise time control, but not in a predictable way. If the figure is 25% at 6" + 0.1", 25% is probably the best guess you can make for the result at one hour plus one minute, absent other information.
I disagree with your assumption and I think that in most cases it is going to be less than 25% at one hour plus 1 minute.

Note that it is based on experience of testing because I tested movei in the past at unequal time control and I found that the time advantage that it needs for 50% score against Rybka is bigger at slower time control.

It is not that movei is espacially strong at blitz relative to other programs at similiar strength but making it 10 times faster with no change is going to make it relatively strongerer at blitz.

with difference of 1:4 and not 1:10 you may have exceptions but the general tendency is that stronger programs earn more from more time(not in the meaning of rating points but in the meaning of the time handicap that you need to get 50%)

Edit:I see that your 25% is not difference of 1:4 but something like 1:1.25
and of course with these small differences there may be more exceptions but basically I am against the idea to measure improvement by a single time control.

It is better if you have 2 numbers that are not elo for
6" + 0.1" and for 60''+1''

Uri
MM
Posts: 766
Joined: Sun Oct 16, 2011 11:25 am

Re: A new way to compare chess programs

Post by MM »

lkaufman wrote: Currently I'm attempting to measure the improvement of Komodo since our last release in this manner. At the moment it appears to be just a tad under 25%. This would translate to something like 20-25 elo at intermediate levels, 30-35 at blitz, and 40-45 at bullet chess, give or take, which is roughly what we observe.
Hi Larry, i apologize to quote only the last part of your post but i did it because it is the more interesting to me.

Franky i'm not interested in MP version of next Komodo for now, but in the strenght of a single core, so i wonder if there will be a release in a reasonable time.

Keep in mind that in September (or before) Houdini 3 will be released and at that time the possibility that Komodo will be stronger than it at blitz will be much lower than now.

No intention to press you, of course, just to let you know.

30-35 elo in blitz should guarantee the leading in the rating lists, or, at least, the superiority of Komodo vs Houdini 2.0 in matches engine vs engine right now. But, honetly, i suspect that you want more.

Can you explain me your plans?

Thank you

P.S. Don't forget the tactics, it is the weak part of your engine.

Best Regards
MM
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: A new way to compare chess programs

Post by lkaufman »

Uri Blass wrote:
lkaufman wrote:When comparing one engine to another, whether a predecessor or an unrelated engine, we specify the superiority of the stronger program in Elo points. There is in my view one major problem with this (aside from the difficulty in measurement); it is highly dependent on the time limit. In general, the longer the time limit, the less the elo difference will be, though of course there are plenty of exceptions. We can argue about whether this is due to the higher draw percentage as the quality of play goes up or to the lessened importance of tactics with depth, or both, or something else, but it doesn't matter, the result is the same.
I propose a new way to specify superiority of one engine to another. The idea is to specify how much additional time, as a percentage, the weaker program needs to score 50% against the stronger one, in a match with ponder off. Of course the result will still usually depend somewhat on the precise time control, but not in a predictable way. If the figure is 25% at 6" + 0.1", 25% is probably the best guess you can make for the result at one hour plus one minute, absent other information.
I disagree with your assumption and I think that in most cases it is going to be less than 25% at one hour plus 1 minute.

Note that it is based on experience of testing because I tested movei in the past at unequal time control and I found that the time advantage that it needs for 50% score against Rybka is bigger at slower time control.

It is not that movei is espacially strong at blitz relative to other programs at similiar strength but making it 10 times faster with no change is going to make it relatively strongerer at blitz.

with difference of 1:4 and not 1:10 you may have exceptions but the general tendency is that stronger programs earn more from more time(not in the meaning of rating points but in the meaning of the time handicap that you need to get 50%)

Edit:I see that your 25% is not difference of 1:4 but something like 1:1.25
and of course with these small differences there may be more exceptions but basically I am against the idea to measure improvement by a single time control.

It is better if you have 2 numbers that are not elo for
6" + 0.1" and for 60''+1''

Uri
I agree that it is better to talk about the effective speedup at two (or even three) time controls than just to quote one number, since indeed the number is not necessarily a constant. This does not detract from the idea itself though. It is also probably generally true that weaker programs need larger time odds to catch stronger ones at longer time controls, because the stronger ones are likely to be stronger in part due to better "scaling". If a program author believes that his new version has better scaling than the old one, it would be to his advantage to quote two speedup numbers at different time controls, just as you suggest. If he believes scaling was not affected, he can just measure the fast speedup and leave it to others to confirm that is it similar at longer time controls.
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: A new way to compare chess programs

Post by lkaufman »

MM wrote:
lkaufman wrote: Currently I'm attempting to measure the improvement of Komodo since our last release in this manner. At the moment it appears to be just a tad under 25%. This would translate to something like 20-25 elo at intermediate levels, 30-35 at blitz, and 40-45 at bullet chess, give or take, which is roughly what we observe.
Hi Larry, i apologize to quote only the last part of your post but i did it because it is the more interesting to me.

Franky i'm not interested in MP version of next Komodo for now, but in the strenght of a single core, so i wonder if there will be a release in a reasonable time.

Keep in mind that in September (or before) Houdini 3 will be released and at that time the possibility that Komodo will be stronger than it at blitz will be much lower than now.

No intention to press you, of course, just to let you know.

30-35 elo in blitz should guarantee the leading in the rating lists, or, at least, the superiority of Komodo vs Houdini 2.0 in matches engine vs engine right now. But, honetly, i suspect that you want more.

Can you explain me your plans?

Thank you

P.S. Don't forget the tactics, it is the weak part of your engine.

Best Regards
We are satisfied with the current Komodo as good enough to release as Komodo 5, but we have some obligation to release Komodo 4 MP first if at all possible. It is working now, but the speedups are too low right now and we don't yet know why. As soon as we figure out what the problem is and fix it, we can do a release within a few days.
User avatar
Mike S.
Posts: 1480
Joined: Thu Mar 09, 2006 5:33 am

Re: A new way to compare chess programs

Post by Mike S. »

MM wrote:P.S. Don't forget the tactics, it is the weak part of your engine.
The following test result indicates the opposite:

http://rybkaforum.net/cgi-bin/rybkaforu ... pid=414852

Most results are from four cpu cores, but we find Komodo 4's result on one core only (64/100) not far from e.g. Zappa Mexico II (65) or Rybka 4.1 (70) which used four cores each. Komodo 3 scored even better: 71/100. :mrgreen:

An engine which is known to have both tactical style and tactical strength, Spark 1.0 (but it's clearly weaker than Komodo overall, in games) solved 79 on four cores.

The test is not public but I think it is quite difficult. The best result so far was 90/100.
Regards, Mike
MM
Posts: 766
Joined: Sun Oct 16, 2011 11:25 am

Re: A new way to compare chess programs

Post by MM »

Mike S. wrote:
MM wrote:P.S. Don't forget the tactics, it is the weak part of your engine.
The following test result indicates the opposite:

http://rybkaforum.net/cgi-bin/rybkaforu ... pid=414852

Most results are from four cpu cores, but we find Komodo 4's result on one core only (64/100) not far from e.g. Zappa Mexico II (65) or Rybka 4.1 (70) which used four cores each. Komodo 3 scored even better: 71/100. :mrgreen:

An engine which is known to have both tactical style and tactical strength, Spark 1.0 (but it's clearly weaker than Komodo overall, in games) solved 79 on four cores.

The test is not public but I think it is quite difficult. The best result so far was 90/100.
Thank you for the link.
In my opinion this list in pretty unsound mainly for 3 reasons:

1.It maxes a mix between 1 core and 4 cores engines.
2. You can't be sure that if an engine solves a certain number of tests with one core, will be able to solve many others on 4 cores.
3. Not all tactical tests are identical. I mean, one engine can be able to solve a certain kind of tactical test and not able to solve another, it could depend by the ''theme'' of the test.

Anyway i'm pretty confident regarding what i wrote because i watched with my eyes hundred games of Komodo 4 (that i bought) and saw its tactical weakness, at least against Houdini.

I simply made a comparison between the strenght in tactics and the strenght in positional play of Komodo, so i see that the problem of Komodo is mainly in tactics.
Best Regards
MM
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: A new way to compare chess programs

Post by Don »

MM wrote:
Mike S. wrote:
MM wrote:P.S. Don't forget the tactics, it is the weak part of your engine.
The following test result indicates the opposite:

http://rybkaforum.net/cgi-bin/rybkaforu ... pid=414852

Most results are from four cpu cores, but we find Komodo 4's result on one core only (64/100) not far from e.g. Zappa Mexico II (65) or Rybka 4.1 (70) which used four cores each. Komodo 3 scored even better: 71/100. :mrgreen:

An engine which is known to have both tactical style and tactical strength, Spark 1.0 (but it's clearly weaker than Komodo overall, in games) solved 79 on four cores.

The test is not public but I think it is quite difficult. The best result so far was 90/100.
Thank you for the link.
In my opinion this list in pretty unsound mainly for 3 reasons:

1.It maxes a mix between 1 core and 4 cores engines.
2. You can't be sure that if an engine solves a certain number of tests with one core, will be able to solve many others on 4 cores.
3. Not all tactical tests are identical. I mean, one engine can be able to solve a certain kind of tactical test and not able to solve another, it could depend by the ''theme'' of the test.

Anyway i'm pretty confident regarding what i wrote because i watched with my eyes hundred games of Komodo 4 (that i bought) and saw its tactical weakness, at least against Houdini.

I simply made a comparison between the strenght in tactics and the strenght in positional play of Komodo, so i see that the problem of Komodo is mainly in tactics.
Best Regards
When you see such a position, put it in a file with a description and fen and send it to us please when you get a few. We are only interested if the move represents a real blunder, not a move that loses in a losing position anyway.

We have had too many people send us examples that did hold up by this measure, they would just show us positions where Komodo was already losing and then Komodo would play some move that would be met by a spectacular response - making the move look like a terrible blunder (but it was already in a deal lost position.) So please make sure you have a legitimate blunder and not just a move that you don't like. There were 2 or 3 shown on this forum and I refuted them all by showing that ALL moves lose.

The converse happens too, someone showed us a position where Komodo "missed" the winning move - but there was nothing wrong with Komodo's move, it was just not as spectacular as the more obvious winning move.

But we are always interested in legitimate examples so please feel free to bundle up some examples and send them to us.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Werewolf
Posts: 1796
Joined: Thu Sep 18, 2008 10:24 pm

Re: A new way to compare chess programs

Post by Werewolf »

Mike S. wrote:
MM wrote:P.S. Don't forget the tactics, it is the weak part of your engine.
The following test result indicates the opposite:

http://rybkaforum.net/cgi-bin/rybkaforu ... pid=414852

Most results are from four cpu cores, but we find Komodo 4's result on one core only (64/100) not far from e.g. Zappa Mexico II (65) or Rybka 4.1 (70) which used four cores each. Komodo 3 scored even better: 71/100. :mrgreen:
Rybka 4.1 has two entries: one on four cores which scored 79/100 and with one core 70 / 100.